Benchmaster: A System Testing Suite

by Robert G. Brown (rgb)

benchmaster Version 1.1.3


Contents


Back to top

Description

Benchmaster is a fairly sophisticated program designed to time and exercise very specific systems functions. It uses the fastest onboard clock that it can find (generally the CPU's cycle counter on x86-derived architectures) to time test "objects", and determines the precision of that timer including the overhead of the timer call.

A test object contains (in addition to test creators/destructor functions) a test routine with two branches -- one "empty" and one "full" -- that are structured to be, as nearly as possibly, identical except for the additional code to be timed in the full loop.

The test harness then determines iteration counts -- the number of times it has to run the empty or full branches to accumulate a time much greater than the timer resolution. It then proceeds to generate a requested number of samples of the timings of the empty and full branches. Finally, it subtracts the average full time from the average empty time to determine the result and evaluates the mean and standard deviations to produce the cumulative expected error.

Finally, the results are printed out in a standard XML based format with an optional header describing the test and certain runtime details. Numbers that are returned include the length of the vector (see discussion of vector tests below), the stride (ditto), and the mean time with error bars, in nanoseconds required to execute just the tested code fragment in the particular context of the test routine. Finally, a "megarate" is returned that is generally the number of million times the test fragment is executed per second. There are a few exceptions to this, see below.

The use of XML in the test output is one of the project's major design goals, as it in principle makes it possible to build e.g. DTDs for the benchmark description language to generate standard reports in a variety of media. Of course this is not yet done -- it is one of the next major design goals. Volunteers/contributions of code welcome... as are comments on the XML itself (which is still pretty malleable until at least one program or transformation process is written that uses it).

Back to top

Benchmaster Download Area

The version numbers have the following meaning. Note that these aren't necessarily what you might expect, so be sure to at least glance at this.

All benchmaster sources (.tgz and .src.rpm) files can be downloaded from this directory. In addition, i386 binary rpm's (currently built on top of Fedora Core 5) are present.

This project is currently semi-stable. The tests seem to work (at least for me) fairly consistently, there just aren't as many as I'd like there to eventually be. Alas, I have too many projects going at once, and have recently been spending a lot of time with the Dieharder project available elsewhere on this website. If you are interested in seeing the benchmaster project advanced more aggressively, contact me (especially with an offer to help out:-).

Below are descriptions of the different kinds of tests already in benchmaster.

Back to top

Vector Tests

As noted above, in addition to being sampled in one loop (with a controllable number of samples) and iterated inside that loop per sample (with an automatically set but user controllable number of iterations) some of the tested code fragments are loops that operate on vectors of numbers. For example, the stream benchmark by John D. McCalpin consists of four simple numerical operations -- a vector copy, scaling a vector, adding a vector, and adding and rescaling a vector. Each of these stream operations is one of the tests already programmed into benchmaster, so benchmaster is capable of "running stream" on a system

In stream, the length of the tested vector is generally fixed -- programmed directly into the test as a compile-time parameter. In benchmaster's stream (and other related vector arithmetic tests) the vector length is a variable and can be selected by the user at runtime. This permits one to directly observe how cache improves performance for various vector lengths and strides.

Most modern CPUs have elaborate optimization mechanisms designed to improve numerical performance on vectors in particular. They prefetch data from memory into the cache in order to ensure that it is waiting there when needed. They have special registers and pipelines that speed repeated operations iterated over a vector. However, sometimes one runs code that (although iterative) does not have such a fortunate memory access pattern. Benchmaster therefore contains a facility for permitting at least some vector tests to be performed in "shuffled order". Basically, a matching vector of vector indices is shuffled and used as the indices for the test. The overhead associated with the shuffling process per se is duplicated in the "empty" code fragment so that only the actual time required to access the memory in shuffled order contributes.

Back to top

Test Design

The general function and design of the timing harness is explained above and documented in the code itself. This section indicates how to add a test to the suite, as one of its major design goals is to make it easy for you to add your own test fragments and operations.

Before showing you an example of a test (the easiest way to document how to create a new one) let me remark on a few of the many problems that plague "benchmarking". For one, the speed with which code is executed depends on many, many things, some of which are out of our control. For example, there is an obvious dependence on system state -- if your task is swapped off of the CPU on a multitasking system in mid-instruction, your time for that sample will be quite high. There is an uncontrollable dependence on the compiler. I'm assuming that the full and empty branches are both likely to be close together in memory and both likely to be kept resident in cache in order for the timings to be comparable and subtractable. This is likely enough to be true, but there are no guarantees from the compiler.

It is also very difficult to test/time single instructions of just about any kind. A multiply on a system can take a fraction of a clock cycle. The finest-grained timekeeper on the system is the clock cycle counter (typically order of a nanosecond), and the code required to read it takes many nanoseconds to execute. Timing a multiply is thus akin to timing the beating of a hummingbird's wings with an hourglass, a process made worse by the compiler's tendency to optimize away instructions that it can tell are doing nothing and can be compressed.

This is just a warning. As it says in the program's Usage statement and man page, the "Mega-rates" returned by this tool are BOGUS and may not be even approximately correct. When interpreting results of existing tests or adding your own, be cautious and test the tester as much as the code fragment itself until the results make sense in your own mind.

One final warning about hidden optimizations, overflows, etc. Many CPUs are smart enough to use superfast internal arithmetic in order to perform certain operations. For example, multiplying by 0.0 or 1.0 or 2.0 on many CPUs will take much, much less time than multiplying by (say) 3.141592653589. For that reason I typically use this as a number to multiply by whenever I am testing multiplication. Of course if one iterates multiplication by \pi it doesn't take long to overflow a floating point variable, so one needs to use caution in designing test loops to avoid this without using 1.0 as a multiplier or dividing (which takes much longer than multiplication) when one wants to test only multiplication.

With all that said, a test consists of two pieces. An include file that minimally contains the function prototypes for the the test itself, for example (for the stream copy test):

/*
* $Id: benchmaster.abs,v 1.7 2004/12/17 15:31:56 rgb Exp $
*
* See copyright in copyright.h and the accompanying file COPYING
*
*/

/*
 * Goes with stream_copy_test.c.
 */
void stream_copy_init(Test *newtest);
void stream_copy_alloc();
void stream_copy_free();
int stream_copy_test(int full_flag);
void stream_copy_results();
void stream_copy_about();

and the stream copy test source for these components:

/*
 * $Id: benchmaster.abs,v 1.7 2004/12/17 15:31:56 rgb Exp $
 * See copyright in copyright.h and the accompanying file COPYING
 */

#include "benchmaster.h"

 /*
  *==================================================================
  * This is the "copy" test from the stream suite.  It is not
  * directly comparable to stream results for a variety of reasons.
  * For one, it uses malloc to allocate all vectors so vector
  * length may be different from what is compiled into any given copy
  * of stream.  Also, it uses the average time to determine the rate
  * and not the best time.  It will therefore generally return
  * results that are very SLIGHTLY LOWER/SLOWER than regular stream
  * (but which may be more realistic for general purpose code).
  *
  * It also uses a different timing harness, one that is both
  * more accurate (uses a superior timer) and which repeats the
  * computation many times, typically order 100, to obtain both a 
  * mean time and its standard deviation as test results.
  *==================================================================
  */


void stream_copy_init(Test *mytest){

 int i;

 mytest->alloc = stream_copy_alloc;
 mytest->free = stream_copy_free;
 mytest->test = stream_copy_test;
 mytest->results = stream_copy_results;
 snprintf(mytest->name,K,"stream copy");
 snprintf(mytest->about,K,"d[i] = a[i] (%d byte double vector)",sizeof(double));

 if(verbose == VERBOSE || verbose == V_INIT){
   printf("# Init for test %s\n",mytest->name);
 }


}

void stream_copy_alloc()
{

 int i;

 /*
  * Allocate vector(s) to be tested with and initialize it and all
  * associated test-specific variables.
  */
 d = (double *) malloc((size_t) (size*sizeof(double)));
 a = (double *) malloc((size_t) (size*sizeof(double)));
 /*
  * Initialize the vector. xtest is set from the command line, default PI.
  */
 for(i=0;i < size;i+=stride){
   a[i] = xtest;
 }

}

void stream_copy_free()
{

 int i;

 /*
  * Free all the memory we just allocated, to be neat and clean and
  * all that.
  */
 free(a);
 free(d);

}

int stream_copy_test(int full_flag)
{

 int i;
 
 if(full_flag){
   for(i=0;i < size;i+=stride){
     d[i] = a[i];
   }
   return(full_flag);
 } else {
   return(full_flag);
 }
}

void stream_copy_results(Test *mytest)
{

 double nanotime_norm;

 /*
  * This is the number of copy operations in the core loop.  We adjust the
  * test normalization so it is the SAME as that of stream, which computes
  * the rate as "megabytes/seconds": 1.0e-6*2*sizeof(double)*nsize/time
  * (in seconds).  We measure nanoseconds, so ours is just a teeny bit
  * different.
  */
 nanotime_norm = (double)size/stride;

 mytest->avg_time = fabs(mytest->avg_time_full - mytest->avg_time_empty)/nanotime_norm;
 mytest->sigma = (mytest->sigma_time_full + mytest->sigma_time_empty)/nanotime_norm;
 mytest->avg_megarate = 1000.0*2*sizeof(double)/mytest->avg_time;

 show_results(mytest);

}

Note that xtest is a user-settable value for the real number used to initialize the vector. It defaults to pi, but can be overridden on the command line so you can see for yourself the effect of using 1.0 or 0.0 in certain contexts of certain tests instead of a number that cannot be optimized at the CPU microcode level. Note that the avg_megarate is a bit different than from other tests as it returns a "bandwidth" in Mbytes/sec (to be comparable with stream) instead of a "bogomegarate", which is just the number of millions of iterations of the test fragment completed per second.

This code fragment in my testing harness produces results that are generally a few percent slower than those of off-the-shelf stream. This is understandable -- I use the average time instead of the minimum time to generate the number, so my stream number is average/expected performance over a reasonably large number (default 100) of samples while stream reports the observed peak performance on a fairly small sample size (default 10).

Once you have written a "microbenchmark" test fragment of your own, you have merely to insert it into the overall structure of the benchmaster program. The best way to do that is to follow exactly the way the existing tests are inserted. Put your test's include file and NAME into the enum and include list in tests.h. Initialize the test in startup.c. At that point the code should "just work". Try to use the provided global variables and pointers for things like xtest and vectors, just to keep the primary code from getting too gooped up. You should be able to put globals into the include file for your test if you need to add any new ones for specific tests, and should be sure to add code to parsecl.c and startup as required to manage them.

Note Well! Benchmaster is not written for the purpose of encouraging "benchmarketing" by vendors, so take any publication of results from the benchmark with a grain of salt.

Consider the source! I mean that quite literally. Benchmaster is a fully GPL tool that is available in rpm form. That means that you or a vendor can hack up the source, insert optimized machine code, etc. Note that you should insist on results of unmodified benchmaster source, compiled with the default flags as at least one component of what you consider when you are trying to measure or understand systems performance. Only then is it reasonable to have fun tuning things up to see how fast you can make them.

Back to top

Conclusion

As noted above, benchmaster is designed to be easy to modify to insert your own tests as "test objects". You should feel free to contribute particularly good or revealing tests back to me for inclusion in the primary distribution of the tool. If and when I get so many tests into the tool that the current method of selecting tests by number no longer works, I'll probably have to rewrite the toplevel interface to make it easier to keep track of everything, but that is likely some time off (and should not affect results).

You should also feel free to make suggestions or report bugs or contribute code to the timing harness itself -- I am constantly trying things to see if I can obtain better control over system state (and hence get more reproducible results) and to see if I can finagle accurate timings of those hummingbird wing beats using my hourglass. Perhaps one of you reading this knows just how to go about doing it in userspace code...

Future Plans for benchmaster include:

Participation in benchmaster is openly solicited and encouraged. All code is written in C for efficiency and its tight coupling to the operating system, for all that it is still "object oriented" in general design to keep it scalable and extensible. Contact me at rgbATphy.duke.edu if you are interested in participating.

I hope that you find benchmaster useful.