brahma
brahma is a Dell Poweredge 2300. It is a dual 400 MHz PII with onboard
U2W controller, the Intel 440BX chipset, and 512 MB of PC-100 ECC
SDRAM. We present basically the same two figures that we did for
lucifer:
and
with the latter showing performance on the entire first megabyte as the
performance is (as one can see) totally boringly perfect except for a
small kink at the size of the L1 cache. This kink is worth studying in
detail below. Note well that the dual PII at 400 MHz
outperforms the dual Celeron at 466 MHz in memory access speeds,
especially for random access. This speedup will be compared later
across the entire range.
In the figure above we plot the average speed directly instead of the
time required to access the memory. Note that this "speed" is a
relative thing (which is one reason that we haven't worried about it
thus far). The timed loop contains a bunch of address and other
arithmetic which consumes times comparable to the raw memory access
times. It is true enough that this overhead will be present in
any related calculation, but still it does prevent us from being
able to claim that the speed is the "memory read/write speed", quite.
We could probably correct for this (by subtracting out the time required
to execute an empty timing loop) but it hardly seems worth it. Instead,
think of it as only a semi-quantitative result, most useful for purposes
of comparison.
However you view it, it beautifully illustrates the power of the L1
cache. Both kinds of memory access (streaming in open triangles and
random in filled triangles, as before) are faster when run out of L1
cache. Curiously, random access is faster than streaming in the
L1 cache itself, probably because the random access pattern forces the
load of the entire block sooner and more efficiently than a streaming
pattern. On the other hand, its drop off is more dramatic when forced
to the L2 cache -- it becomes more than two times slower and appears to
be getting slower still while the streaming speed appears to be
approximately constant.
It is worth looking at the L2 cache boundary the same way. In the next
figure, we present a direct comparison of the speed of lucifer
and brahma, for both one (square) and two (pentagon) tasks. brahma is
unfilled and lucifer is filled. You can see that brahma executes two
simultaneous memtests about as fast as lucifer can manage one.
Note the very sharp drop off in at the cache boundaries in both cases.
This cusp is also apparent on the figures above. Even for streaming
access, the benefits of in-cache versus out of cache execution are
apparent. However, the scale of this figure hides a very important
detail.
By blowing up the figure to where we can see the L1 and L2 cache
boundaries for both processors, we get a suprise! lucifer's L1 cache is
about fifty percent faster than brahma's! This is much more
advantage than we'd expect by comparing the raw clocks. This advantage
carries over into the L2 cache as well -- for a single task the L2 cache
is half again as fast as brahma's, but running two tasks at once
essentially erases that advantage, although it isn't clear from the
figure how reproducible the oscillations for lucifer are. lucifer
overall appears a bit more sensitive to historical competition -- its
Celeron performance isn't as robust as the PII's.
This figure shows what one can also read about on paper -- the cache of
the Celeron runs at the full CPU clock, while that of the PII runs at
half the clock speed. Because of the "buffering" effects of a larger L2
cache (which provides for things like room for more contexts in the
cache that otherwise have to come in from main memory, possibly in the
middle of a timing loop) and the fact that our timing loops aren't
"pure" but contains irrelevant instructions one doesn't see all of the
Celeron's clock advantage, but it is certainly there and quite
pronounced for streaming access. It is undoubtedly this factor that
protects in the other direction in the previous figure, where we
could see that the memory speed advantage of the PII was nowhere near
what one would expect by just comparing the memory clock speeds.
It is worthwhile to look at the same figure for the random access test.
Here the Celeron's L1 and L2 cache truly shine -- brahma can
obtain no real advantage from its larger L2 cache (as previously noted
lucifer rapidly fills its L2 cache with the data). Although lucifer's
L1 is still only around half again as fast as brahma's, L2 cache is now
more than twice as fast! The Celeron is a pretty good little
chip, considering that it typically costs less than half an equivalent
clock PII or PIII.
It's once again worthwhile to emphasize that these figures do not
mean that the Celeron will turn in significantly better (or worse)
performance on a typical numerical task. Remember, we're all but
ignoring the role of the CPU itself in all of this. What we're really
doing is determining how fast the CPU can grab a chunk of memory if and
when it really needs one, which as we can see above depends on all sorts
of things (like the CPU clock, the memory clock, the cache clock, and
how likely the required memory is to already be in cache. We've
ignored the work being done by the CPU on that memory --
presumably we're bringing it in for something more constructive than
just reading it and rewriting it.
It is this last component that favors a dual. Since memory access is
generally parallelized by the caching subsystem to the extent possible,
even a few numerical instructions being executed on the memory once it's
fetched is often enough to unblock a dual CPU system so that one unit is
fetching the next block of memory into cache while the other executes
instructions. In a future revision of memtest I'll probably stick a
tiny counting loop that does mindless floating point operations right in
with the read/write. This will obviously slow down the memory access
rate. What isn't so obvious is that this will free up the memory bus
from contention, permitting two jobs to run at the same rate even
though the dual memory bus is saturable in flat out access. We can then
count the number of instructions required (per memory access) to unblock
the memory bus. I have a private bet that it is as few as 3, but we'll
see...
The other thing we're ignoring is the interpolating quality most code
has between the two limits represented by pure streaming access and pure
random access. Most code accesses a whole bunch of reasonable
neighborly addresses (in memory or code space) followed by a jump. We
can see in the results above that there is roughly a factor of two
difference (practically speaking) in the speed of access in the two
kinds of access, with presumably some sort of statistical interpolation
that depends on the probability of each. However, the speed advantage
of 100 MHz over 66 MHz memory (even further disadvantaged by smaller
caches and so forth is even in the worst case much smaller than the
ratio of the clocks, except in the exceptional regime that a program
fits into the L2 cache of one but is running out of memory in
another.
This page was written and is maintained by Robert G. Brown