Guignon Results

These are the results from a single run of a benchmark contributed by Thomas Guignon . This benchmark looks philosophically similar to cpu-rate but contains some inline assembler to test the prefetch mechanism of the Athlons. The results it obtains for memory bandwidth are consistent with those of stream, but its memory access pattern revealed no diminishment of this single CPU rate when run two at a time (so both CPUs were simultaneously busy).

The most obvious thing about these results is they show the 1.2 GHz Athlon to be substantially slower than any of the (much smaller clock) dual PIII's tested. I suspect that this is because of the discrete and sudden drop off in float rate observed in the CPU-rate graph on the Athlon compared to the much slower dropoff on PIII's (see CPU Performance Summary page). Or, of course, it could just be correct or insane because I compiled the code incorrectly or the libraries on the system are broken or...

Still, if this hypothesis is correct, chances are good that this rate will remain constant for larger memory blocks for the Athlon but drop off significantly for the PIII, with the PIII ultimately being about 20% slower. So we can test it and I will, eventually, unless somebody suggests something more likely or better to do. I have some RDRAM-equipped dual 933 PIII's handy to test this hypothesis with, and can rerun Thomas' code in an adaptation of the log-plot-cpu-rate script for various memory sizes on a log scale (and generate statistics, I suppose) with a bit of work, so look back at this site later in the week.

I'm following this with Thomas's comments on the code, some results he's obtained for various systems, and the (unpackaged) code itself. The packaged version is more attractive (has autodocumenting command line options and a small man page, for example) and will be up in a few days.


Results with 1-2 CPU in use at a time

#=============================================================================
# Starting guignon benchmark testing memsize = 8000000
# iterating 10 times for CPU clock of  1200.00 MHz
# Note that there was no substantive difference when run 100 times
# or when run two at a time.
#=============================================================================
fdnrm2:
temps: 18019371, MFLOPS: 133.19, MB/S: 532.76, val: 
    5.773498e+08
temps: 17893696, MFLOPS: 134.125, MB/S: 536.502, val: 
    5.773498e+08
temps: 17916254, MFLOPS: 133.957, MB/S: 535.826, val: 
    5.773498e+08
temps: 17876585, MFLOPS: 134.254, MB/S: 537.015, val: 
    5.773498e+08
temps: 17807489, MFLOPS: 134.775, MB/S: 539.099, val: 
    5.773498e+08
temps: 17881062, MFLOPS: 134.22, MB/S: 536.881, val: 
    5.773498e+08
temps: 17917523, MFLOPS: 133.947, MB/S: 535.788, val: 
    5.773498e+08
temps: 17919857, MFLOPS: 133.93, MB/S: 535.719, val: 
    5.773498e+08
temps: 17835425, MFLOPS: 134.564, MB/S: 538.255, val: 
    5.773498e+08
temps: 17920399, MFLOPS: 133.926, MB/S: 535.702, val: 
    5.773498e+08
#=============================================================================

Hello,
here is a stand alone program for a test:
the  programm computes a vector 2 norm with an unrolled loop, there is 
cleanup code before an after unrolled loop to allow cache line granularity 
access inside loop with some prefetch for the next cache line. This code 
gives good results with large vectors (that does no fit in cache).
Some tests follows on different motherboards:
- Asustek CUR DLS (SW LE), 2 x PIII 733 PC133
[guignon@toto1 tmp]$ ./standalone_fnrm2 1000000 10 733
fdnrm2:
temps: 8552151, MFLOPS: 171.419, MB/S: 685.675, val:     5.773498e+08
temps: 8526416, MFLOPS: 171.936, MB/S: 687.745, val:     5.773498e+08
temps: 8524574, MFLOPS: 171.973, MB/S: 687.894, val:     5.773498e+08
temps: 8528380, MFLOPS: 171.897, MB/S: 687.587, val:     5.773498e+08
temps: 8525014, MFLOPS: 171.965, MB/S: 687.858, val:     5.773498e+08
temps: 8524541, MFLOPS: 171.974, MB/S: 687.896, val:     5.773498e+08
temps: 8524871, MFLOPS: 171.967, MB/S: 687.87, val:     5.773498e+08
temps: 8524821, MFLOPS: 171.968, MB/S: 687.874, val:     5.773498e+08
temps: 8525662, MFLOPS: 171.951, MB/S: 687.806, val:     5.773498e+08
temps: 8596833, MFLOPS: 170.528, MB/S: 682.112, val:     5.773498e+08 
- Supermicro DER (SW HEsl), 2 x PIII 866 PC133
[guignon@superdrill tmp]$ ./standalone_fnrm2 1000000 10 866
fdnrm2:
temps: 9486635, MFLOPS: 182.573, MB/S: 730.291, val:     5.773498e+08
temps: 9541590, MFLOPS: 181.521, MB/S: 726.084, val:     5.773498e+08
temps: 9572349, MFLOPS: 180.938, MB/S: 723.751, val:     5.773498e+08
temps: 9534568, MFLOPS: 181.655, MB/S: 726.619, val:     5.773498e+08
temps: 9530261, MFLOPS: 181.737, MB/S: 726.948, val:     5.773498e+08
temps: 9530904, MFLOPS: 181.725, MB/S: 726.899, val:     5.773498e+08
temps: 9532143, MFLOPS: 181.701, MB/S: 726.804, val:     5.773498e+08
temps: 9551074, MFLOPS: 181.341, MB/S: 725.363, val:     5.773498e+08
temps: 9534616, MFLOPS: 181.654, MB/S: 726.616, val:     5.773498e+08
temps: 9531600, MFLOPS: 181.711, MB/S: 726.845, val:     5.773498e+08 
- MSI 694D (VIA), 2 x PIII 733 PC133
[thomas@mu001 tmp]$  ./standalone_fnrm2 1000000 10 733
fdnrm2:
temps: 7689652, MFLOPS: 190.646, MB/S: 762.583, val:     5.773498e+08
temps: 7594205, MFLOPS: 193.042, MB/S: 772.168, val:     5.773498e+08
temps: 7613548, MFLOPS: 192.551, MB/S: 770.206, val:     5.773498e+08
temps: 7570395, MFLOPS: 193.649, MB/S: 774.596, val:     5.773498e+08
temps: 7607657, MFLOPS: 192.701, MB/S: 770.802, val:     5.773498e+08
temps: 7604577, MFLOPS: 192.779, MB/S: 771.115, val:     5.773498e+08
temps: 7570142, MFLOPS: 193.656, MB/S: 774.622, val:     5.773498e+08
temps: 7606794, MFLOPS: 192.722, MB/S: 770.89, val:     5.773498e+08
temps: 7590203, MFLOPS: 193.144, MB/S: 772.575, val:     5.773498e+08
temps: 7619125, MFLOPS: 192.411, MB/S: 769.642, val:     5.773498e+08  

You notice that the MSI motherboard has better results than the 2 others with 
one processor but dual processor computation is by far better on ServerWorks 
chipsets than on VIA (I will send you a stand alone dual processors version 
later). I send you directly this email to not pollute the beowulf maling list 
but if you think that results and code are enough interesting to be send to 
the mailing list feel free to do it.

A+

The programm (compile with -O1):
#include 
#include 
#include 
#include 

#define L2_SIZE 1048576 

static int vide_cache(){
  int i;
  int t;
  int un = -1;
  
  char *cache;
  
  cache = (char *)calloc(2*L2_SIZE,sizeof(char));
  if(cache==NULL){
    perror("cache");
    exit(-1);
  }
  for(i=0;i<2*L2_SIZE;i++){
    cache[i] = i%4;
  }
  t = 0;
  for(i=0;i<2*L2_SIZE;i++){
    un = -un;
    t += un*cache[i];
  }

  free(cache);
  
  return(t);
}

#define CL_DSIZE 4

static inline void rdtsc(unsigned long long int *counter){

  asm("rdtsc \n\t"
      "movl %%eax,%0 \n\t"
      "movl %%edx,%1 \n\t"
      : "=m" (((unsigned *)counter)[0]), "=m" (((unsigned *)counter)[1])
      :
      : "eax" , "edx");
  
}

double fdnrm2(int n,double *x,int incx){
  // compiler avec -O uniquement
  register double nrm,nrm1,nrm2,nrm3,nrm4;
  double *xmax,*xal,*xmaxal;
  double *p;

  if(incx==1){
    nrm = 0;

    // realigner x sur une frontiere de cache
    xal = (double *)((unsigned)x & 0xFFFFFFF0);
    xmax = x + n;
    xmaxal = (double *)((unsigned)xmax & 0xFFFFFFF0);
    for(p=x;x