Beowulf Resources and Links

This is the official home page for the Duke University Physics Department's Brahma Beowulf Project. Please feel free to explore this website. There are a number of things on the site itself that may be of use or interest to individuals interested in beowulf-style cluster computing.

This site is maintained by rgb. It and all works linked thereupon authored by Robert G. Brown are Copyright 2003 (or as indicated in the document) and made available through a modified Open Publication License unless superceded by another license directly associated with the document. (Current site version 2.2-1)

Home

Beowulf Book

Talks, Papers, Articles

Software/Programs

Links

Vendors

memtest

Memtest tarball: For those that want to just get the program and play with it.
About memtest: A brief description of memtest and some suggestions for ways to use it.
memtest results: Results of some applications of memtest in years past. Interesting to see the variation of memory speed, especially when accessing memory addresses randomly.

About memtest

memtest is a is a memory performance testing program. It makes no pretence of being exhaustive in its testing mechnism. Instead it focuses on two particular kinds of memory access.

The first of these is sequential or streaming access. memtest allocates a block of memory as an unsigned integer array and promptly fills the array with its own indexes, so block[0] = 0, block[1] = 1, and so forth. It then times a loop (with microsecond resolution) that reads, then writes back, the contents of each array element, in order. It repeats the timed loop N times and forms averages and standard deviations of the measurements for any given block size.

This sequential access pattern gives maximum advantage to read ahead caching strategies. It also maximally favors blocks that fit into existing CPU caches and so can avoid memory access entirely. No operations, float or integer, other than the read and write itself (and the inevitable loop indexing and addressing) are executed, so it is insentitive to float or integer speed per se.

The second access pattern it test is a random access pattern. To accomplish this, inside the sampling loop (N) but outside the timing loop, the block vector is filled with a random shuffling of its indices. As these indices are used to direct the swap (in the streaming test as well) this results in the read/write being executed in a random order.

This works to defeat caching strategies by only rarely accessing sequential pieces of memory. It also makes the access highly nonlocal, making it very difficult to obtain any advantage from the cache itself. When the block size becomes much bigger than any of the cache sizes, the chances become very large that the next bit of data requested will come from somewhere in main memory and not in a piece of cache preloaded by an earlier lookahead.

Note that by isolating the (numerically intensive) work of loading the random block addresses we don't increase the amount of actual work done inside the timing loops. As the figures below indicate, random access patterns do increase the time required to access a given piece of data (relative to optimially efficient streaming access) but only by roughly a factor of two. However, the program itself will take much longer to complete, because generating all those random numbers can be very time consuming.

To speed it up again, one can opt to read/write only every nth element in the block vector (or the vector element pointed to by the nth index). When testing very large blocks of memory it is suggested that this strategy be employed. It is also suggested that when testing very large blocks of memory that a large block increment be used. As the figures make plain, through most of the range of memory sizes to be tested, memory speeds are boring (that is, locally constant). A quick pass through to find out where the interesting parts are (informed by a knowledge of the size of e.g. the L1 and L2 caches and the physical memory itself) followed by a refined pass across the interesting regions is a sensible strategy.

Results from memtest

eve: A 400 MHz, 440BX Celeron with 64 MB of SDRAM.
lucifer: A 466 MHz dual Celeron (Abit BP6 motherboard) with 128 MB of SDRAM.
brahma: A Dell Poweredge 2300 dual 400 MHz PII and the 440BX chipset. The system has 512 MB of PC-100 ECC SDRAM. This is (was) one of the head nodes of brahma.

Below are figures generated by applying memtest to a number of Intel-based systems I run at Duke or at home. Each figure is accompanied by a brief description of the hardware and some comments. Note well that the scales of these figures are not all the same. Note also that I don't claim that these results are either "correct" or general -- they are just what I got (and they generally make sense). Use them at your own risk. The better thing to do is to use memtest (or stream) to measure the memory properties of your own system(s) or specific prototypes for your cluster -- as some of the figures below should actually make clear, it is difficult to generalize about the memory performance of one system given that of another, even another that on paper should be similar.

eve

eve is a 400 MHz single CPU Celeron built with an Abit motherboard using the 440BX chipset. eve has only 64 MB of main memory, which makes it good to demonstrate certain bottlenecks.

In the figure above, eve has about 55 MB of memory that (at the time of the test) could be freed (with some pain) for a calculation. At the very end, it is just starting to swap. Note that random access to memory (filled triangles) is much slower (by nearly a factor of three) than streaming access (empty triangles).

The figure above shows eve's speed performance below 500K. The effect of the L1 and L2 caches is very clearly delimned -- a peak near the L1 boundary (at 16K) that tails into an extended peak across L2, ending when the code starts to fill the L2 cache (depending on what else is running).

lucifer

lucifer is a dual 466 MHz Celeron built with an Abit BP6 motherboard. lucifer has 128 MB of main memory. The primary interesting features observable with lucifer are the effect of dual CPUs on memory access on a system with a smallish (128K) L2 cache.

Compare streaming single CPU performance (empty triangles) to streaming dual CPU performance (empty hexagons and stars). Note that there is no penalty associated with any of the "disadvantages" of the dual Celeron architecture. Memory speed is nearly the same for both and this equality extends (as we can see below) all the way down to the boundaries of the L2 cache (within statistical noise).

When random access for a single CPU (filled triangles) or for dual CPU's (filled hexagons) is compared, though, the story is quite different. First of all, it is apparent that random access is some three times slower than streaming access for a single CPU, not unlike what we saw above for eve. Second, simultaneously accessing random memory blocks with both CPUs adds an even larger penalty -- not quite twice as slow as a single CPU with a random pattern of access. We clearly saturate the memory bus.

In the figure above the relative speed within the first 530K of memory is displayed. One can clearly see the L2 cache boundary and, in the case of random access, the L1 cache boundary. Clearly memtest is revealing a lot of very interesting things about the memory performance of the system. In another section below, we'll discuss how much you should worry about this for your code. Chances are good that it has enough streaming component that the answer would be "not much" and that a dual Celeron is a sensible choice for you, but (as we'll see later) it is certainly possible that this is not correct.

brahma

brahma is a Dell Poweredge 2300. It is a dual 400 MHz PII with onboard U2W controller, the Intel 440BX chipset, and 512 MB of PC-100 ECC SDRAM. We present basically the same two figures that we did for lucifer:

and

with the latter showing performance on the entire first megabyte as the performance is (as one can see) totally boringly perfect except for a small kink at the size of the L1 cache. This kink is worth studying in detail below. Note well that the dual PII at 400 MHz outperforms the dual Celeron at 466 MHz in memory access speeds, especially for random access. This speedup will be compared later across the entire range.

In the figure above we plot the average speed directly instead of the time required to access the memory. Note that this "speed" is a relative thing (which is one reason that we haven't worried about it thus far). The timed loop contains a bunch of address and other arithmetic which consumes times comparable to the raw memory access times. It is true enough that this overhead will be present in any related calculation, but still it does prevent us from being able to claim that the speed is the "memory read/write speed", quite. We could probably correct for this (by subtracting out the time required to execute an empty timing loop) but it hardly seems worth it. Instead, think of it as only a semi-quantitative result, most useful for purposes of comparison.

However you view it, it beautifully illustrates the power of the L1 cache. Both kinds of memory access (streaming in open triangles and random in filled triangles, as before) are faster when run out of L1 cache. Curiously, random access is faster than streaming in the L1 cache itself, probably because the random access pattern forces the load of the entire block sooner and more efficiently than a streaming pattern. On the other hand, its drop off is more dramatic when forced to the L2 cache -- it becomes more than two times slower and appears to be getting slower still while the streaming speed appears to be approximately constant. It is worth looking at the L2 cache boundary the same way. In the next figure, we present a direct comparison of the speed of lucifer and brahma, for both one (square) and two (pentagon) tasks. brahma is unfilled and lucifer is filled. You can see that brahma executes two simultaneous memtests about as fast as lucifer can manage one.

Note the very sharp drop off in at the cache boundaries in both cases. This cusp is also apparent on the figures above. Even for streaming access, the benefits of in-cache versus out of cache execution are apparent. However, the scale of this figure hides a very important detail.

By blowing up the figure to where we can see the L1 and L2 cache boundaries for both processors, we get a suprise! lucifer's L1 cache is about fifty percent faster than brahma's! This is much more advantage than we'd expect by comparing the raw clocks. This advantage carries over into the L2 cache as well -- for a single task the L2 cache is half again as fast as brahma's, but running two tasks at once essentially erases that advantage, although it isn't clear from the figure how reproducible the oscillations for lucifer are. lucifer overall appears a bit more sensitive to historical competition -- its Celeron performance isn't as robust as the PII's.

This figure shows what one can also read about on paper -- the cache of the Celeron runs at the full CPU clock, while that of the PII runs at half the clock speed. Because of the "buffering" effects of a larger L2 cache (which provides for things like room for more contexts in the cache that otherwise have to come in from main memory, possibly in the middle of a timing loop) and the fact that our timing loops aren't "pure" but contains irrelevant instructions one doesn't see all of the Celeron's clock advantage, but it is certainly there and quite pronounced for streaming access. It is undoubtedly this factor that protects in the other direction in the previous figure, where we could see that the memory speed advantage of the PII was nowhere near what one would expect by just comparing the memory clock speeds.

It is worthwhile to look at the same figure for the random access test. Here the Celeron's L1 and L2 cache truly shine -- brahma can obtain no real advantage from its larger L2 cache (as previously noted lucifer rapidly fills its L2 cache with the data). Although lucifer's L1 is still only around half again as fast as brahma's, L2 cache is now more than twice as fast! The Celeron is a pretty good little chip, considering that it typically costs less than half an equivalent clock PII or PIII.

It's once again worthwhile to emphasize that these figures do not mean that the Celeron will turn in significantly better (or worse) performance on a typical numerical task. Remember, we're all but ignoring the role of the CPU itself in all of this. What we're really doing is determining how fast the CPU can grab a chunk of memory if and when it really needs one, which as we can see above depends on all sorts of things (like the CPU clock, the memory clock, the cache clock, and how likely the required memory is to already be in cache. We've ignored the work being done by the CPU on that memory -- presumably we're bringing it in for something more constructive than just reading it and rewriting it.

It is this last component that favors a dual. Since memory access is generally parallelized by the caching subsystem to the extent possible, even a few numerical instructions being executed on the memory once it's fetched is often enough to unblock a dual CPU system so that one unit is fetching the next block of memory into cache while the other executes instructions. In a future revision of memtest I'll probably stick a tiny counting loop that does mindless floating point operations right in with the read/write. This will obviously slow down the memory access rate. What isn't so obvious is that this will free up the memory bus from contention, permitting two jobs to run at the same rate even though the dual memory bus is saturable in flat out access. We can then count the number of instructions required (per memory access) to unblock the memory bus. I have a private bet that it is as few as 3, but we'll see...

The other thing we're ignoring is the interpolating quality most code has between the two limits represented by pure streaming access and pure random access. Most code accesses a whole bunch of reasonable neighborly addresses (in memory or code space) followed by a jump. We can see in the results above that there is roughly a factor of two difference (practically speaking) in the speed of access in the two kinds of access, with presumably some sort of statistical interpolation that depends on the probability of each. However, the speed advantage of 100 MHz over 66 MHz memory (even further disadvantaged by smaller caches and so forth is even in the worst case much smaller than the ratio of the clocks, except in the exceptional regime that a program fits into the L2 cache of one but is running out of memory in another.

This page was written and is maintained by Robert G. Brown rgb@phy.duke.edu