These are the results from a single run of a benchmark contributed by
Thomas Guignon
The most obvious thing about these results is they show the 1.2 GHz Athlon to be substantially slower than any of the (much smaller clock) dual PIII's tested. I suspect that this is because of the discrete and sudden drop off in float rate observed in the CPU-rate graph on the Athlon compared to the much slower dropoff on PIII's (see CPU Performance Summary page). Or, of course, it could just be correct or insane because I compiled the code incorrectly or the libraries on the system are broken or...
Still, if this hypothesis is correct, chances are good that this rate will remain constant for larger memory blocks for the Athlon but drop off significantly for the PIII, with the PIII ultimately being about 20% slower. So we can test it and I will, eventually, unless somebody suggests something more likely or better to do. I have some RDRAM-equipped dual 933 PIII's handy to test this hypothesis with, and can rerun Thomas' code in an adaptation of the log-plot-cpu-rate script for various memory sizes on a log scale (and generate statistics, I suppose) with a bit of work, so look back at this site later in the week.
I'm following this with Thomas's comments on the code, some results he's obtained for various systems, and the (unpackaged) code itself. The packaged version is more attractive (has autodocumenting command line options and a small man page, for example) and will be up in a few days.
#============================================================================= # Starting guignon benchmark testing memsize = 8000000 # iterating 10 times for CPU clock of 1200.00 MHz # Note that there was no substantive difference when run 100 times # or when run two at a time. #============================================================================= fdnrm2: temps: 18019371, MFLOPS: 133.19, MB/S: 532.76, val: 5.773498e+08 temps: 17893696, MFLOPS: 134.125, MB/S: 536.502, val: 5.773498e+08 temps: 17916254, MFLOPS: 133.957, MB/S: 535.826, val: 5.773498e+08 temps: 17876585, MFLOPS: 134.254, MB/S: 537.015, val: 5.773498e+08 temps: 17807489, MFLOPS: 134.775, MB/S: 539.099, val: 5.773498e+08 temps: 17881062, MFLOPS: 134.22, MB/S: 536.881, val: 5.773498e+08 temps: 17917523, MFLOPS: 133.947, MB/S: 535.788, val: 5.773498e+08 temps: 17919857, MFLOPS: 133.93, MB/S: 535.719, val: 5.773498e+08 temps: 17835425, MFLOPS: 134.564, MB/S: 538.255, val: 5.773498e+08 temps: 17920399, MFLOPS: 133.926, MB/S: 535.702, val: 5.773498e+08 #=============================================================================
Hello, here is a stand alone program for a test: the programm computes a vector 2 norm with an unrolled loop, there is cleanup code before an after unrolled loop to allow cache line granularity access inside loop with some prefetch for the next cache line. This code gives good results with large vectors (that does no fit in cache). Some tests follows on different motherboards: - Asustek CUR DLS (SW LE), 2 x PIII 733 PC133 [guignon@toto1 tmp]$ ./standalone_fnrm2 1000000 10 733 fdnrm2: temps: 8552151, MFLOPS: 171.419, MB/S: 685.675, val: 5.773498e+08 temps: 8526416, MFLOPS: 171.936, MB/S: 687.745, val: 5.773498e+08 temps: 8524574, MFLOPS: 171.973, MB/S: 687.894, val: 5.773498e+08 temps: 8528380, MFLOPS: 171.897, MB/S: 687.587, val: 5.773498e+08 temps: 8525014, MFLOPS: 171.965, MB/S: 687.858, val: 5.773498e+08 temps: 8524541, MFLOPS: 171.974, MB/S: 687.896, val: 5.773498e+08 temps: 8524871, MFLOPS: 171.967, MB/S: 687.87, val: 5.773498e+08 temps: 8524821, MFLOPS: 171.968, MB/S: 687.874, val: 5.773498e+08 temps: 8525662, MFLOPS: 171.951, MB/S: 687.806, val: 5.773498e+08 temps: 8596833, MFLOPS: 170.528, MB/S: 682.112, val: 5.773498e+08 - Supermicro DER (SW HEsl), 2 x PIII 866 PC133 [guignon@superdrill tmp]$ ./standalone_fnrm2 1000000 10 866 fdnrm2: temps: 9486635, MFLOPS: 182.573, MB/S: 730.291, val: 5.773498e+08 temps: 9541590, MFLOPS: 181.521, MB/S: 726.084, val: 5.773498e+08 temps: 9572349, MFLOPS: 180.938, MB/S: 723.751, val: 5.773498e+08 temps: 9534568, MFLOPS: 181.655, MB/S: 726.619, val: 5.773498e+08 temps: 9530261, MFLOPS: 181.737, MB/S: 726.948, val: 5.773498e+08 temps: 9530904, MFLOPS: 181.725, MB/S: 726.899, val: 5.773498e+08 temps: 9532143, MFLOPS: 181.701, MB/S: 726.804, val: 5.773498e+08 temps: 9551074, MFLOPS: 181.341, MB/S: 725.363, val: 5.773498e+08 temps: 9534616, MFLOPS: 181.654, MB/S: 726.616, val: 5.773498e+08 temps: 9531600, MFLOPS: 181.711, MB/S: 726.845, val: 5.773498e+08 - MSI 694D (VIA), 2 x PIII 733 PC133 [thomas@mu001 tmp]$ ./standalone_fnrm2 1000000 10 733 fdnrm2: temps: 7689652, MFLOPS: 190.646, MB/S: 762.583, val: 5.773498e+08 temps: 7594205, MFLOPS: 193.042, MB/S: 772.168, val: 5.773498e+08 temps: 7613548, MFLOPS: 192.551, MB/S: 770.206, val: 5.773498e+08 temps: 7570395, MFLOPS: 193.649, MB/S: 774.596, val: 5.773498e+08 temps: 7607657, MFLOPS: 192.701, MB/S: 770.802, val: 5.773498e+08 temps: 7604577, MFLOPS: 192.779, MB/S: 771.115, val: 5.773498e+08 temps: 7570142, MFLOPS: 193.656, MB/S: 774.622, val: 5.773498e+08 temps: 7606794, MFLOPS: 192.722, MB/S: 770.89, val: 5.773498e+08 temps: 7590203, MFLOPS: 193.144, MB/S: 772.575, val: 5.773498e+08 temps: 7619125, MFLOPS: 192.411, MB/S: 769.642, val: 5.773498e+08 You notice that the MSI motherboard has better results than the 2 others with one processor but dual processor computation is by far better on ServerWorks chipsets than on VIA (I will send you a stand alone dual processors version later). I send you directly this email to not pollute the beowulf maling list but if you think that results and code are enough interesting to be send to the mailing list feel free to do it. A+ The programm (compile with -O1): #include#include #include #include #define L2_SIZE 1048576 static int vide_cache(){ int i; int t; int un = -1; char *cache; cache = (char *)calloc(2*L2_SIZE,sizeof(char)); if(cache==NULL){ perror("cache"); exit(-1); } for(i=0;i<2*L2_SIZE;i++){ cache[i] = i%4; } t = 0; for(i=0;i<2*L2_SIZE;i++){ un = -un; t += un*cache[i]; } free(cache); return(t); } #define CL_DSIZE 4 static inline void rdtsc(unsigned long long int *counter){ asm("rdtsc \n\t" "movl %%eax,%0 \n\t" "movl %%edx,%1 \n\t" : "=m" (((unsigned *)counter)[0]), "=m" (((unsigned *)counter)[1]) : : "eax" , "edx"); } double fdnrm2(int n,double *x,int incx){ // compiler avec -O uniquement register double nrm,nrm1,nrm2,nrm3,nrm4; double *xmax,*xal,*xmaxal; double *p; if(incx==1){ nrm = 0; // realigner x sur une frontiere de cache xal = (double *)((unsigned)x & 0xFFFFFFF0); xmax = x + n; xmaxal = (double *)((unsigned)xmax & 0xFFFFFFF0); for(p=x;x