The first of these is sequential or streaming access. memtest allocates a block of memory as an unsigned integer array and promptly fills the array with its own indexes, so block[0] = 0, block[1] = 1, and so forth. It then times a loop (with microsecond resolution) that reads, then writes back, the contents of each array element, in order. It repeats the timed loop N times and forms averages and standard deviations of the measurements for any given block size.
This sequential access pattern gives maximum advantage to read ahead caching strategies. It also maximally favors blocks that fit into existing CPU caches and so can avoid memory access entirely. No operations, float or integer, other than the read and write itself (and the inevitable loop indexing and addressing) are executed, so it is insentitive to float or integer speed per se.
The second access pattern it test is a random access pattern. To accomplish this, inside the sampling loop (N) but outside the timing loop, the block vector is filled with a random shuffling of its indices. As these indices are used to direct the swap (in the streaming test as well) this results in the read/write being executed in a random order.
This works to defeat caching strategies by only rarely accessing sequential pieces of memory. It also makes the access highly nonlocal, making it very difficult to obtain any advantage from the cache itself. When the block size becomes much bigger than any of the cache sizes, the chances become very large that the next bit of data requested will come from somewhere in main memory and not in a piece of cache preloaded by an earlier lookahead.
Note that by isolating the (numerically intensive) work of loading the random block addresses we don't increase the amount of actual work done inside the timing loops. As the figures below indicate, random access patterns do increase the time required to access a given piece of data (relative to optimially efficient streaming access) but only by roughly a factor of two. However, the program itself will take much longer to complete, because generating all those random numbers can be very time consuming.To speed it up again, one can opt to read/write only every nth element in the block vector (or the vector element pointed to by the nth index). When testing very large blocks of memory it is suggested that this strategy be employed. It is also suggested that when testing very large blocks of memory that a large block increment be used. As the figures make plain, through most of the range of memory sizes to be tested, memory speeds are boring (that is, locally constant). A quick pass through to find out where the interesting parts are (informed by a knowledge of the size of e.g. the L1 and L2 caches and the physical memory itself) followed by a refined pass across the interesting regions is a sensible strategy.
When random access for a single CPU (filled triangles) or for dual CPU's
(filled hexagons) is compared, though, the story is quite different.
First of all, it is apparent that random access is some three times
slower than streaming access for a single CPU, not unlike what we saw
above for eve. Second, simultaneously accessing random memory
blocks with both CPUs adds an even larger penalty -- not quite
twice as slow as a single CPU with a random pattern of access. We
clearly saturate the memory bus.
and
with the latter showing performance on the entire first megabyte as the
performance is (as one can see) totally boringly perfect except for a
small kink at the size of the L1 cache. This kink is worth studying in
detail below. Note well that the dual PII at 400 MHz
outperforms the dual Celeron at 466 MHz in memory access speeds,
especially for random access. This speedup will be compared later
across the entire range.
In the figure above we plot the average speed directly instead of the
time required to access the memory. Note that this "speed" is a
relative thing (which is one reason that we haven't worried about it
thus far). The timed loop contains a bunch of address and other
arithmetic which consumes times comparable to the raw memory access
times. It is true enough that this overhead will be present in
any related calculation, but still it does prevent us from being
able to claim that the speed is the "memory read/write speed", quite.
We could probably correct for this (by subtracting out the time required
to execute an empty timing loop) but it hardly seems worth it. Instead,
think of it as only a semi-quantitative result, most useful for purposes
of comparison.
However you view it, it beautifully illustrates the power of the L1
cache. Both kinds of memory access (streaming in open triangles and
random in filled triangles, as before) are faster when run out of L1
cache. Curiously, random access is faster than streaming in the
L1 cache itself, probably because the random access pattern forces the
load of the entire block sooner and more efficiently than a streaming
pattern. On the other hand, its drop off is more dramatic when forced
to the L2 cache -- it becomes more than two times slower and appears to
be getting slower still while the streaming speed appears to be
approximately constant.
It is worth looking at the L2 cache boundary the same way. In the next
figure, we present a direct comparison of the speed of lucifer
and brahma, for both one (square) and two (pentagon) tasks. brahma is
unfilled and lucifer is filled. You can see that brahma executes two
simultaneous memtests about as fast as lucifer can manage one.
Note the very sharp drop off in at the cache boundaries in both cases.
This cusp is also apparent on the figures above. Even for streaming
access, the benefits of in-cache versus out of cache execution are
apparent. However, the scale of this figure hides a very important
detail.
By blowing up the figure to where we can see the L1 and L2 cache
boundaries for both processors, we get a suprise! lucifer's L1 cache is
about fifty percent faster than brahma's! This is much more
advantage than we'd expect by comparing the raw clocks. This advantage
carries over into the L2 cache as well -- for a single task the L2 cache
is half again as fast as brahma's, but running two tasks at once
essentially erases that advantage, although it isn't clear from the
figure how reproducible the oscillations for lucifer are. lucifer
overall appears a bit more sensitive to historical competition -- its
Celeron performance isn't as robust as the PII's.
This figure shows what one can also read about on paper -- the cache of
the Celeron runs at the full CPU clock, while that of the PII runs at
half the clock speed. Because of the "buffering" effects of a larger L2
cache (which provides for things like room for more contexts in the
cache that otherwise have to come in from main memory, possibly in the
middle of a timing loop) and the fact that our timing loops aren't
"pure" but contains irrelevant instructions one doesn't see all of the
Celeron's clock advantage, but it is certainly there and quite
pronounced for streaming access. It is undoubtedly this factor that
protects in the other direction in the previous figure, where we
could see that the memory speed advantage of the PII was nowhere near
what one would expect by just comparing the memory clock speeds.
It is worthwhile to look at the same figure for the random access test.
Here the Celeron's L1 and L2 cache truly shine -- brahma can
obtain no real advantage from its larger L2 cache (as previously noted
lucifer rapidly fills its L2 cache with the data). Although lucifer's
L1 is still only around half again as fast as brahma's, L2 cache is now
more than twice as fast! The Celeron is a pretty good little
chip, considering that it typically costs less than half an equivalent
clock PII or PIII.
It's once again worthwhile to emphasize that these figures do not
mean that the Celeron will turn in significantly better (or worse)
performance on a typical numerical task. Remember, we're all but
ignoring the role of the CPU itself in all of this. What we're really
doing is determining how fast the CPU can grab a chunk of memory if and
when it really needs one, which as we can see above depends on all sorts
of things (like the CPU clock, the memory clock, the cache clock, and
how likely the required memory is to already be in cache. We've
ignored the work being done by the CPU on that memory --
presumably we're bringing it in for something more constructive than
just reading it and rewriting it.
It is this last component that favors a dual. Since memory access is
generally parallelized by the caching subsystem to the extent possible,
even a few numerical instructions being executed on the memory once it's
fetched is often enough to unblock a dual CPU system so that one unit is
fetching the next block of memory into cache while the other executes
instructions. In a future revision of memtest I'll probably stick a
tiny counting loop that does mindless floating point operations right in
with the read/write. This will obviously slow down the memory access
rate. What isn't so obvious is that this will free up the memory bus
from contention, permitting two jobs to run at the same rate even
though the dual memory bus is saturable in flat out access. We can then
count the number of instructions required (per memory access) to unblock
the memory bus. I have a private bet that it is as few as 3, but we'll
see...
The other thing we're ignoring is the interpolating quality most code
has between the two limits represented by pure streaming access and pure
random access. Most code accesses a whole bunch of reasonable
neighborly addresses (in memory or code space) followed by a jump. We
can see in the results above that there is roughly a factor of two
difference (practically speaking) in the speed of access in the two
kinds of access, with presumably some sort of statistical interpolation
that depends on the probability of each. However, the speed advantage
of 100 MHz over 66 MHz memory (even further disadvantaged by smaller
caches and so forth is even in the worst case much smaller than the
ratio of the clocks, except in the exceptional regime that a program
fits into the L2 cache of one but is running out of memory in
another.
eve
eve is a 400 MHz single CPU Celeron built with an Abit motherboard using
the 440BX chipset. eve has only 64 MB of main memory, which makes it
good to demonstrate certain bottlenecks.lucifer
lucifer is a dual 466 MHz Celeron built with an Abit BP6 motherboard.
lucifer has 128 MB of main memory. The primary interesting features
observable with lucifer are the effect of dual CPUs on memory access on
a system with a smallish (128K) L2 cache.brahma
brahma is a Dell Poweredge 2300. It is a dual 400 MHz PII with onboard
U2W controller, the Intel 440BX chipset, and 512 MB of PC-100 ECC
SDRAM. We present basically the same two figures that we did for
lucifer:
This page was written and is maintained by Robert G. Brown
rgb@phy.duke.edu