benchmaster
by
Robert G. Brown
Duke University Physics Department
Durham, NC 27708-0305
Copyright Robert G. Brown, 2025
Abstract
Benchmaster: A System Testing Suite
by Robert G. Brown (rgb)
benchmaster Version 1.1.3
Contents
Back to top
Description
Benchmaster is a fairly sophisticated program designed to
time and exercise very specific systems functions. It uses the fastest
onboard clock that it can find (generally the CPU's cycle counter on
x86-derived architectures) to time test "objects", and determines the
precision of that timer including the overhead of the timer call.
A test object contains (in addition to test creators/destructor
functions) a test routine with two branches -- one "empty" and one
"full" -- that are structured to be, as nearly as possibly, identical
except for the additional code to be timed in the full loop.
The test harness then determines iteration counts -- the
number of times it has to run the empty or full branches to accumulate a
time much greater than the timer resolution. It then proceeds to
generate a requested number of samples of the timings of the
empty and full branches. Finally, it subtracts the average full time
from the average empty time to determine the result and evaluates the
mean and standard deviations to produce the cumulative expected
error.
Finally, the results are printed out in a standard XML based
format with an optional header describing the test and certain runtime
details. Numbers that are returned include the length of the vector
(see discussion of vector tests below), the stride (ditto), and the mean
time with error bars, in nanoseconds required to execute just the
tested code fragment in the particular context of the test routine.
Finally, a "megarate" is returned that is generally the number of
million times the test fragment is executed per second. There are a few
exceptions to this, see below.
The use of XML in the test output is one of the project's major
design goals, as it in principle makes it possible to build e.g. DTDs
for the benchmark description language to generate standard reports in a
variety of media. Of course this is not yet done -- it is one of the
next major design goals. Volunteers/contributions of code
welcome... as are comments on the XML itself (which is still pretty
malleable until at least one program or transformation process is
written that uses it).
Back to top
The version numbers have the following meaning. Note that these
aren't necessarily what you might expect, so be sure to at least glance
at this.
- First number (major). Bumped only when major goals in the design
roadmap are reached (for example, finishing all the test ideas outlined
below:-). Version 1.x.x, for example, means that the basic testing
structure is well defined and stable.
- Second number (first minor). Bumped for bugfixes, feature
additions, major or minor improvements (but reset to zero when a major
design goal is reached).
- Third number (second minor). This number indicates the number of
tests currently supported.
All benchmaster sources (.tgz and .src.rpm) files can be downloaded
from this directory. In addition, i386 binary
rpm's (currently built on top of Fedora Core 5) are present.
This project is currently semi-stable. The tests seem to work (at
least for me) fairly consistently, there just aren't as many as I'd like
there to eventually be. Alas, I have too many projects going at once,
and have recently been spending a lot of time with the Dieharder project
available elsewhere on this website. If you are interested in seeing
the benchmaster project advanced more aggressively, contact me
(especially with an offer to help out:-).
Below are descriptions of the different kinds of tests already in
benchmaster.
Back to top
Vector Tests
As noted above, in addition to being sampled in one loop (with a
controllable number of samples) and iterated inside that loop per sample
(with an automatically set but user controllable number of iterations)
some of the tested code fragments are loops that operate on vectors of
numbers. For example, the stream benchmark by John
D. McCalpin consists of four simple numerical operations -- a vector
copy, scaling a vector, adding a vector, and adding and rescaling a
vector. Each of these stream operations is one of the tests already
programmed into benchmaster, so benchmaster is capable of "running
stream" on a system
In stream, the length of the tested vector is generally fixed --
programmed directly into the test as a compile-time parameter. In
benchmaster's stream (and other related vector arithmetic tests) the
vector length is a variable and can be selected by the user at
runtime. This permits one to directly observe how cache improves
performance for various vector lengths and strides.
Most modern CPUs have elaborate optimization mechanisms designed to
improve numerical performance on vectors in particular. They prefetch
data from memory into the cache in order to ensure that it is waiting
there when needed. They have special registers and pipelines that speed
repeated operations iterated over a vector. However, sometimes one runs
code that (although iterative) does not have such a fortunate
memory access pattern. Benchmaster therefore contains a facility for
permitting at least some vector tests to be performed in "shuffled
order". Basically, a matching vector of vector indices is shuffled and
used as the indices for the test. The overhead associated with the
shuffling process per se is duplicated in the "empty" code fragment so
that only the actual time required to access the memory in shuffled
order contributes.
Back to top
Test Design
The general function and design of the timing harness is explained
above and documented in the code itself. This section indicates how to
add a test to the suite, as one of its major design goals is to make it
easy for you to add your own test fragments and
operations.
Before showing you an example of a test (the easiest way to document
how to create a new one) let me remark on a few of the many problems
that plague "benchmarking". For one, the speed with which code is
executed depends on many, many things, some of which are out of our
control. For example, there is an obvious dependence on system state --
if your task is swapped off of the CPU on a multitasking system in
mid-instruction, your time for that sample will be quite high. There is
an uncontrollable dependence on the compiler. I'm assuming that the
full and empty branches are both likely to be close together in memory
and both likely to be kept resident in cache in order for the timings to
be comparable and subtractable. This is likely enough to be true, but
there are no guarantees from the compiler.
It is also very difficult to test/time single instructions of
just about any kind. A multiply on a system can take a fraction of a
clock cycle. The finest-grained timekeeper on the system is the
clock cycle counter (typically order of a nanosecond), and the code
required to read it takes many nanoseconds to execute. Timing a
multiply is thus akin to timing the beating of a hummingbird's wings
with an hourglass, a process made worse by the compiler's tendency to
optimize away instructions that it can tell are doing nothing and can be
compressed.
This is just a warning. As it says in the program's Usage statement
and man page, the "Mega-rates" returned by this tool are BOGUS and may
not be even approximately correct. When interpreting results of
existing tests or adding your own, be cautious and test the tester as
much as the code fragment itself until the results make sense in your
own mind.
One final warning about hidden optimizations, overflows, etc. Many
CPUs are smart enough to use superfast internal arithmetic in order to
perform certain operations. For example, multiplying by 0.0 or 1.0 or
2.0 on many CPUs will take much, much less time than multiplying by
(say) 3.141592653589. For that reason I typically use this as a number
to multiply by whenever I am testing multiplication. Of course if one
iterates multiplication by \pi it doesn't take long to overflow a
floating point variable, so one needs to use caution in designing test
loops to avoid this without using 1.0 as a multiplier or dividing (which
takes much longer than multiplication) when one wants to test
only multiplication.
With all that said, a test consists of two pieces. An include file
that minimally contains the function prototypes for the the test itself,
for example (for the stream copy test):
/*
* $Id: benchmaster.abs,v 1.7 2004/12/17 15:31:56 rgb Exp $
*
* See copyright in copyright.h and the accompanying file COPYING
*
*/
/*
* Goes with stream_copy_test.c.
*/
void stream_copy_init(Test *newtest);
void stream_copy_alloc();
void stream_copy_free();
int stream_copy_test(int full_flag);
void stream_copy_results();
void stream_copy_about();
and the stream copy test source for these components:
/*
* $Id: benchmaster.abs,v 1.7 2004/12/17 15:31:56 rgb Exp $
* See copyright in copyright.h and the accompanying file COPYING
*/
#include "benchmaster.h"
/*
*==================================================================
* This is the "copy" test from the stream suite. It is not
* directly comparable to stream results for a variety of reasons.
* For one, it uses malloc to allocate all vectors so vector
* length may be different from what is compiled into any given copy
* of stream. Also, it uses the average time to determine the rate
* and not the best time. It will therefore generally return
* results that are very SLIGHTLY LOWER/SLOWER than regular stream
* (but which may be more realistic for general purpose code).
*
* It also uses a different timing harness, one that is both
* more accurate (uses a superior timer) and which repeats the
* computation many times, typically order 100, to obtain both a
* mean time and its standard deviation as test results.
*==================================================================
*/
void stream_copy_init(Test *mytest){
int i;
mytest->alloc = stream_copy_alloc;
mytest->free = stream_copy_free;
mytest->test = stream_copy_test;
mytest->results = stream_copy_results;
snprintf(mytest->name,K,"stream copy");
snprintf(mytest->about,K,"d[i] = a[i] (%d byte double vector)",sizeof(double));
if(verbose == VERBOSE || verbose == V_INIT){
printf("# Init for test %s\n",mytest->name);
}
}
void stream_copy_alloc()
{
int i;
/*
* Allocate vector(s) to be tested with and initialize it and all
* associated test-specific variables.
*/
d = (double *) malloc((size_t) (size*sizeof(double)));
a = (double *) malloc((size_t) (size*sizeof(double)));
/*
* Initialize the vector. xtest is set from the command line, default PI.
*/
for(i=0;i < size;i+=stride){
a[i] = xtest;
}
}
void stream_copy_free()
{
int i;
/*
* Free all the memory we just allocated, to be neat and clean and
* all that.
*/
free(a);
free(d);
}
int stream_copy_test(int full_flag)
{
int i;
if(full_flag){
for(i=0;i < size;i+=stride){
d[i] = a[i];
}
return(full_flag);
} else {
return(full_flag);
}
}
void stream_copy_results(Test *mytest)
{
double nanotime_norm;
/*
* This is the number of copy operations in the core loop. We adjust the
* test normalization so it is the SAME as that of stream, which computes
* the rate as "megabytes/seconds": 1.0e-6*2*sizeof(double)*nsize/time
* (in seconds). We measure nanoseconds, so ours is just a teeny bit
* different.
*/
nanotime_norm = (double)size/stride;
mytest->avg_time = fabs(mytest->avg_time_full - mytest->avg_time_empty)/nanotime_norm;
mytest->sigma = (mytest->sigma_time_full + mytest->sigma_time_empty)/nanotime_norm;
mytest->avg_megarate = 1000.0*2*sizeof(double)/mytest->avg_time;
show_results(mytest);
}
Note that xtest is a user-settable value for the real number used to
initialize the vector. It defaults to pi, but can be overridden on the
command line so you can see for yourself the effect of using 1.0 or 0.0
in certain contexts of certain tests instead of a number that cannot be
optimized at the CPU microcode level. Note that the avg_megarate is a
bit different than from other tests as it returns a "bandwidth" in
Mbytes/sec (to be comparable with stream) instead of a "bogomegarate",
which is just the number of millions of iterations of the test fragment
completed per second.
This code fragment in my testing harness produces results that are
generally a few percent slower than those of off-the-shelf stream. This
is understandable -- I use the average time instead of the minimum time
to generate the number, so my stream number is average/expected
performance over a reasonably large number (default 100) of samples
while stream reports the observed peak performance on a fairly small
sample size (default 10).
Once you have written a "microbenchmark" test fragment of your own,
you have merely to insert it into the overall structure of the
benchmaster program. The best way to do that is to follow exactly the
way the existing tests are inserted. Put your test's include file and
NAME into the enum and include list in tests.h. Initialize the test in
startup.c. At that point the code should "just work". Try to use the
provided global variables and pointers for things like xtest and
vectors, just to keep the primary code from getting too gooped up. You
should be able to put globals into the include file for your test if you
need to add any new ones for specific tests, and should be sure to add
code to parsecl.c and startup as required to manage them.
Note Well! Benchmaster is not written for the purpose
of encouraging "benchmarketing" by vendors, so take any
publication of results from the benchmark with a grain of salt.
Consider the source! I mean that quite literally. Benchmaster is a
fully GPL tool that is available in rpm form. That means that you or a
vendor can hack up the source, insert optimized machine code,
etc. Note that you should insist on results of unmodified benchmaster
source, compiled with the default flags as at least one component
of what you consider when you are trying to measure or understand
systems performance. Only then is it reasonable to have fun tuning
things up to see how fast you can make them.
Back to top
Conclusion
As noted above, benchmaster is designed to be easy to modify
to insert your own tests as "test objects". You should feel free to
contribute particularly good or revealing tests back to me for inclusion
in the primary distribution of the tool. If and when I get so many
tests into the tool that the current method of selecting tests by number
no longer works, I'll probably have to rewrite the toplevel interface to
make it easier to keep track of everything, but that is likely some time
off (and should not affect results).
You should also feel free to make suggestions or report bugs or
contribute code to the timing harness itself -- I am constantly trying
things to see if I can obtain better control over system state (and
hence get more reproducible results) and to see if I can finagle
accurate timings of those hummingbird wing beats using my hourglass.
Perhaps one of you reading this knows just how to go about doing it in
userspace code...
Future Plans for benchmaster include:
- More microbenchmark tests, including ones for e.g. disk access,
network latency/bw (will need an accompanying daemon to run on the
target)
- More shufflable tests (shuffled stream?)
- Selected application or library level tests.
- XML test interface on the input side as well as the output?
It would be nice to be able to read in a "test descriptor file" that
made benchmaster work through a suite of tests according to a set of
instructions and produce a table/report as output. This would also
facilitate the development of a...
- ...GUI. Both web and Gtk interfaces would greatly simplify the
use of the tool and would permit the immediate-gratification
presentation of e.g. performance curves as vector sizes are swept across
the size of L1 and L2 caches.
Participation in benchmaster is openly solicited and encouraged.
All code is written in C for efficiency and its tight coupling to the
operating system, for all that it is still "object oriented" in general
design to keep it scalable and extensible. Contact me at
rgbATphy.duke.edu if you are interested in participating.
I hope that you find benchmaster useful.
|