memtest

<h1>memtest</h1>

<dl>

<dt> <a href="./memtest.tar.gz">Memtest tarball</a>
<dd> For those that want to just get the program and play with it.

<dt> <a href="#about">About memtest</a>
<dd> A brief description of memtest and some suggestions for ways to use
it.

<dt> <a href="#results">memtest results</a>
<dd> Results of some applications of memtest in years past.  Interesting
to see the variation of memory speed, especially when accessing memory
addresses randomly.

</dl>

<hr>

<a name=about>
<h2>About memtest</h2>

memtest is a is a memory performance testing program.  It makes no
pretence of being exhaustive in its testing mechnism.  Instead it
focuses on two particular <i>kinds</i> of memory access.<p>

The first of these is sequential or <b>streaming</b> access.  memtest
allocates a block of memory as an unsigned integer array and promptly
fills the array with its own indexes, so block[0] = 0, block[1] = 1, and
so forth.  It then times a loop (with microsecond resolution) that
reads, then writes back, the contents of each array element, in
order.  It repeats the timed loop <i>N</i> times and forms averages and
standard deviations of the measurements for any given block size.<p>

This sequential access pattern gives <i>maximum advantage</i> to read
ahead caching strategies.  It also maximally favors blocks that fit into
existing CPU caches and so can avoid memory access entirely.  No
operations, float or integer, other than the read and write itself (and
the inevitable loop indexing and addressing) are executed, so it is
insentitive to float or integer speed per se.<p>

The second access pattern it test is a <b>random</b> access pattern.  To
accomplish this, <i>inside</i> the sampling loop (<i>N</i>) but
<i>outside</i> the timing loop, the block vector is filled with a random
shuffling of its indices.  As these indices are used to direct the swap
(in the streaming test as well) this results in the read/write being
executed in a <i>random order</i>.<p>

This works to <i>defeat</i> caching strategies by only rarely accessing
sequential pieces of memory.  It also makes the access highly nonlocal,
making it very difficult to obtain any advantage from the cache itself.
When the block size becomes much bigger than any of the cache sizes, the
chances become very large that the next bit of data requested will come
from somewhere in main memory and not in a piece of cache preloaded by
an earlier lookahead.</p>

Note that by isolating the (numerically intensive) work of loading the
random block addresses we don't increase the amount of actual work done
inside the timing loops.  As the figures below indicate, random access
patterns do increase the time required to access a given piece of data
(relative to optimially efficient streaming access) but only by roughly
a factor of two.  However, the program itself will take <i>much</i>
longer to complete, because generating all those random numbers can be
very time consuming.<p>

To speed it up again, one can opt to read/write only every <i>n</i>th
element in the block vector (or the vector element pointed to by the
<i>n</i>th index).  When testing very large blocks of memory it is
suggested that this strategy be employed.  It is also suggested that
when testing very large blocks of memory that a large block increment be
used.  As the figures make plain, through <i>most</i> of the range of
memory sizes to be tested, memory speeds are boring (that is, locally
constant).  A quick pass through to find out where the interesting parts
are (informed by a knowledge of the size of e.g. the L1 and L2 caches
and the physical memory itself) followed by a refined pass across the
interesting regions is a sensible strategy.<p>

<hr>

<a name=results>
<h2>Results from memtest</h2>

<dl>

<dt> <a href="#eve">eve</a>
<dd> A 400 MHz, 440BX Celeron with 64 MB of SDRAM.

<dt> <a href="#lucifer">lucifer</a>
<dd> A 466 MHz dual Celeron (Abit BP6 motherboard) with 128 MB of
SDRAM.

<dt> <a href="#brahma">brahma</a><p> 
<dd> A Dell Poweredge 2300 dual 400 MHz PII and the 440BX chipset.  The
system has 512 MB of PC-100 ECC SDRAM. This is (was) one of the head
nodes of <a href="http://www.phy.duke.edu/brahma/">brahma</a>.

</dl>

<hr>

Below are figures generated by applying memtest to a number of
Intel-based systems I run at Duke or at home.  Each figure is
accompanied by a brief description of the hardware and some comments.
Note well that the scales of these figures are not all the same.  Note
also that I don't claim that these results are either "correct" or
general -- they are just what I got (and they generally make sense).
Use them at your own risk.  The <b>better</b> thing to do is to use
memtest (or stream) to measure the memory properties of your own
system(s) or specific prototypes for your cluster -- as some of the
figures below should actually make clear, it is difficult to generalize
about the memory performance of one system given that of another, even
another that on paper should be similar.<p>

<a name=eve>
<h3>eve</h3>

eve is a 400 MHz single CPU Celeron built with an Abit motherboard using
the 440BX chipset.  eve has only 64 MB of main memory, which makes it
good to demonstrate certain bottlenecks.<p>

<center><img src="./memtest/results/eve.mt.sing.gif"></center>

In the figure above, eve has about 55 MB of memory that (at the time of
the test) could be freed (with some pain) for a calculation.  At the
very end, it is just starting to swap.  Note that random access to
memory (filled triangles) is <i>much</i> slower (by nearly a factor of
three) than streaming access (empty triangles).<p>

<center><img src="./memtest/results/eve.ms.sing.L1.gif"></center>

The figure above shows eve's speed performance below 500K.  The effect
of the L1 and L2 caches is very clearly delimned -- a peak near the L1
boundary (at 16K) that tails into an extended peak across L2, ending
when the code starts to fill the L2 cache (depending on what else is
running).

<a name=lucifer>
<h3>lucifer</h3>

lucifer is a dual 466 MHz Celeron built with an Abit BP6 motherboard.
lucifer has 128 MB of main memory.  The primary interesting features
observable with lucifer are the effect of dual CPUs on memory access on
a system with a smallish (128K) L2 cache.<p>

<center><img src="./memtest/results/lucifer.mt.dual.gif"></center>

Compare streaming single CPU performance (empty triangles) to streaming
dual CPU performance (empty hexagons and stars).  Note that there is
<i>no penalty</i> associated with <i>any</i> of the "disadvantages" of
the dual Celeron architecture.  Memory speed is nearly the same for both
and this equality extends (as we can see below) all the way down to the
boundaries of the L2 cache (within statistical noise).<p>

When random access for a single CPU (filled triangles) or for dual CPU's
(filled hexagons) is compared, though, the story is quite different.
First of all, it is apparent that random access is some three times
slower than streaming access for a single CPU, not unlike what we saw
above for eve.  Second, <i>simultaneously</i> accessing random memory
blocks with <i>both</i> CPUs adds an even larger penalty -- not quite
twice as slow as a single CPU with a random pattern of access.  We
clearly saturate the memory bus.<p>

<center><img src="./memtest/results/lucifer.ms.dual.L1.gif"></center>

In the figure above the relative speed within the first 530K of memory
is displayed. One can clearly see the L2 cache boundary and, in the case
of random access, the L1 cache boundary.  Clearly memtest is revealing a
lot of very interesting things about the memory performance of the
system.  In another section below, we'll discuss how much you should
worry about this for <i>your</i> code.  Chances are good that it has
enough streaming component that the answer would be "not much" and that
a dual Celeron is a sensible choice for you, but (as we'll see later) it
is certainly possible that this is not correct.<p>

<a name=brahma>
<h3>brahma</h3>

brahma is a Dell Poweredge 2300.  It is a dual 400 MHz PII with onboard
U2W controller, the Intel 440BX chipset, and 512 MB of PC-100 ECC
SDRAM.  We present basically the same two figures that we did for
lucifer:<p>

<center><img
src="./memtest/results/memtest.brahma.dual.gif"></center><p>

and<p>

<center><img
src="./memtest/results/memtest.brahma.dual.1M.gif"></center><p>

with the latter showing performance on the entire first megabyte as the
performance is (as one can see) totally boringly perfect except for a
small kink at the size of the L1 cache.  This kink is worth studying in
detail below.  Note well that the dual PII at 400 MHz
<i>outperforms</i> the dual Celeron at 466 MHz in memory access speeds,
especially for random access.  This speedup will be compared later
across the entire range.

<center><img
src="./memtest/results/brahma.L1.speed.gif"></center><p>

In the figure above we plot the average speed directly instead of the
time required to access the memory.  Note that this "speed" is a
relative thing (which is one reason that we haven't worried about it
thus far).  The timed loop contains a bunch of address and other
arithmetic which consumes times comparable to the raw memory access
times.  It is true enough that this overhead will be present in
<i>any</i> related calculation, but still it does prevent us from being
able to claim that the speed is the "memory read/write speed", quite.
We could probably correct for this (by subtracting out the time required
to execute an empty timing loop) but it hardly seems worth it.  Instead,
think of it as only a semi-quantitative result, most useful for purposes
of comparison.<p>

However you view it, it beautifully illustrates the power of the L1
cache.  Both kinds of memory access (streaming in open triangles and
random in filled triangles, as before) are faster when run out of L1
cache.  Curiously, random access is <i>faster</i> than streaming in the
L1 cache itself, probably because the random access pattern forces the
load of the entire block sooner and more efficiently than a streaming
pattern.  On the other hand, its drop off is more dramatic when forced
to the L2 cache -- it becomes more than two times slower and appears to
be getting slower still while the streaming speed appears to be
approximately constant.

It is worth looking at the L2 cache boundary the same way.  In the next
figure, we present a direct comparison of the <i>speed</i> of lucifer
and brahma, for both one (square) and two (pentagon) tasks.  brahma is
unfilled and lucifer is filled.  You can see that brahma executes two
simultaneous memtests about as fast as lucifer can manage one.<p>

<center><img
src="./memtest/results/luciferVSbrahma.stream.speed.gif"></center><p>

Note the very sharp drop off in at the cache boundaries in both cases.
This cusp is also apparent on the figures above.  Even for streaming
access, the benefits of in-cache versus out of cache execution are
apparent.  However, the scale of this figure hides a very important
detail.<p>

<center><img
src="./memtest/results/luciferVSbrahma.stream.L1.gif"></center><p>

By blowing up the figure to where we can see the L1 and L2 cache
boundaries for both processors, we get a suprise!  lucifer's L1 cache is
about fifty percent <i>faster</i> than brahma's!  This is much more
advantage than we'd expect by comparing the raw clocks.  This advantage
carries over into the L2 cache as well -- for a single task the L2 cache
is half again as fast as brahma's, but running two tasks at once
essentially erases that advantage, although it isn't clear from the
figure how reproducible the oscillations for lucifer are.  lucifer
overall appears a bit more sensitive to historical competition -- its
Celeron performance isn't as robust as the PII's.<p>

This figure shows what one can also read about on paper -- the cache of
the Celeron runs at the full CPU clock, while that of the PII runs at
half the clock speed.  Because of the "buffering" effects of a larger L2
cache (which provides for things like room for more contexts in the
cache that otherwise have to come in from main memory, possibly in the
middle of a timing loop) and the fact that our timing loops aren't
"pure" but contains irrelevant instructions one doesn't see all of the
Celeron's clock advantage, but it is certainly there and quite
pronounced for streaming access.  It is undoubtedly this factor that
protects in the <i>other</i> direction in the previous figure, where we
could see that the memory speed advantage of the PII was nowhere near
what one would expect by just comparing the memory clock speeds.<p>

<center><img
src="./memtest/results/luciferVSbrahma.rand.L1.gif"></center><p>

It is worthwhile to look at the same figure for the random access test.
Here the Celeron's L1 <i>and</i> L2 cache truly shine -- brahma can
obtain no real advantage from its larger L2 cache (as previously noted
lucifer rapidly fills its L2 cache with the data).  Although lucifer's
L1 is still only around half again as fast as brahma's, L2 cache is now
<i>more</i> than twice as fast!  The Celeron is a pretty good little
chip, considering that it typically costs less than half an equivalent
clock PII or PIII.<p>

It's once again worthwhile to emphasize that these figures do <i>not</i>
mean that the Celeron will turn in significantly better (or worse)
performance on a typical numerical task.  Remember, we're all but
ignoring the role of the CPU itself in all of this.  What we're really
doing is determining how fast the CPU can grab a chunk of memory if and
when it really needs one, which as we can see above depends on all sorts
of things (like the CPU clock, the memory clock, the cache clock, and
how likely the required memory is to <i>already</i> be in cache.  We've
ignored the work being done by the CPU <i>on</i> that memory --
presumably we're bringing it in for something more constructive than
just reading it and rewriting it.<p>

It is this last component that favors a dual.  Since memory access is
generally parallelized by the caching subsystem to the extent possible,
even a few numerical instructions being executed on the memory once it's
fetched is often enough to unblock a dual CPU system so that one unit is
fetching the next block of memory into cache while the other executes
instructions.  In a future revision of memtest I'll probably stick a
tiny counting loop that does mindless floating point operations right in
with the read/write.  This will obviously slow down the memory access
rate.  What isn't so obvious is that this will free up the memory bus
from contention, permitting two jobs to run at the <i>same</i> rate even
though the dual memory bus is saturable in flat out access.  We can then
count the number of instructions required (per memory access) to unblock
the memory bus.  I have a private bet that it is as few as 3, but we'll
see...<p>

The other thing we're ignoring is the interpolating quality most code
has between the two limits represented by pure streaming access and pure
random access.  Most code accesses a whole bunch of reasonable
neighborly addresses (in memory or code space) followed by a jump.  We
can see in the results above that there is roughly a factor of two
difference (practically speaking) in the speed of access in the two
kinds of access, with presumably some sort of statistical interpolation
that depends on the probability of each.  However, the speed advantage
of 100 MHz over 66 MHz memory (even further disadvantaged by smaller
caches and so forth is even in the worst case much smaller than the
ratio of the clocks, except in the exceptional regime that a program
fits into the L2 cache of one but is running out of memory in
another.<p>

<hr>

This page was written and is maintained by Robert G. Brown
<a href="mailto:rgb@phy.duke.edu">rgb@phy.duke.edu</a>