Brahma History

<h1>Brahma History</h1>

<p>The Duke Physics Department is not new to cluster computing, nor did
its cluster computing efforts begin with the beowulf in 1995.  The
department has run a Unix-based TCP/IP client/server network of
commodity workstations since roughly the mid-80's.  Originally this
network was built around a PDP-11, a Sun 110M server with many serial
ports and a serial port terminal, a handful of Sun 3's, a SGI Iris, and
a Sun 386i.</p>

<p>A number of these systems were being used to perform computations in
granular flow (Robert Behringer), quantum optics and condensed matter
physics (Robert G. Brown and Mikael Ciftan) and of course there were
numerous computations run on specific topics on individual systems over
the years (we <i>are</i> a <i>physics</i> department, after all:-).</p>

<p>Over almost a ten year period, workstations (mostly Sun
Sparcstations, but with SGI Iris's, a SGI 220S, and various PC's
augmenting the mix) were added to the department's original thinwire
10Base2 network, and the group of Berndt Mueller in particular began
using the network of systems for high performance computing in nuclear
physics.  SunOS did an excellent job of multitasking, so it was very
natural to start to spread numerical jobs around and run them "in
parallel" in the background on a workstation where some user was logged
in and working interactively (but not computationally) in the
foreground.</p>

<p>In about 1993, Brown learned about <a
href="http://www.epm.ornl.gov/pvm/">PVM</a> -- one of the original
parallel libraries designed to turn a cluster of networked workstations
in to a "virtual supercomputer".  Inside six months, his computations
were running on top of PVM, using virtually all the systems in the
physics department network as a compute cluster.  In this way the jobs
rapidly accumulated supercomputer-level quantities of time and permitted
him to publish very precise Monte Carlo results without the tremendous
expense of a real supercomputer, using what amounted to "free" computing
(otherwise wasted idle cycles on desktop workstations).</p>

<p>This worked so well that by 1994 Brown was using all of the physics
department and the math department's architecturally similar Sun LAN as
well, over thirty two Sparc CPUs all total, in a single ongoing
computation, and looking for more "free" compute power.</p>

<p>This was found in 1995 in form of new Academic Computing clusters set
up by Duke all over campus for student use.  These were (mostly) Sparc 5
systems running Solaris 2.2 and 2.3, which had recently replaced the
venerable BSD-based SunOS on all newly sold Sparc systems.  Brown had
account access on the systems already by virtue of being on the faculty;
the systems (like all workstations used as interactive desktops) were
90% idle 90% of the time, and it looked like an ideal opportunity to
extend the computation out over 100 or more systems delivering on the
order of GFLOPs of total compute power.</p>  

<p>Brown sought permission to run his (Monte Carlo) code in the
background on these systems.  Permission was granted on an experimental
basis, and he promptly did so.  For a bit more than six months, the
"OnSpin3d" computation was spread out over as many as 150 systems at
once in acpub, in the physics department, and in the math
department.</p>

<p>Many GFLOP-years of cycles (depending on what you count as a "GFLOP")
were accumulated on his embarrassingly parallel computation (which was
no longer being run on top of PVM to avoid stressing the campus backbone
network, but rather was being run more like a Grid application with
remote, scripted job submission and occasional expect/tcl/rsh grazing of
accumulated results).  All this work went into a lovely <i>Physical
Review</i> paper.</p>

<p>Alas, it turned out that althought the idea of recovering cycles like
this was <i>conceptually</i> sound and worked perfectly under BSD-based
SunOS, SysV-based Solaris had a different, badly broken, scheduler.  A
process that could run on a Sparcstation 1 in the background "invisible"
to the interactive user of the Sparcstation console would gradually de
facto boost its priority on a ten or more times faster Sparcstation 5 to
where it would ultimately render the interactive console unusably slow.
In its early years and revisions, Solaris was known not-so-fondly as
"Slowaris" among computer cognoscenti, largely for this reason.</p>

<p>Obviously, this was unacceptable, and Brown's "grand experiment" with
turning an entire University WAN into a (for the time) enormous compute
cluster ended.  At the time there seemed to be little reason to
"publish" the experiment for public accolade, but using published
linpack results of roughly 10 MFLOPS for the Sparc 5, the aggregate peak
power of this "Grid" style cluster in application to OnSpin3d was in the
ballpark of 12-15 GLOPs, enough to have easily put it in the top 100 on
the top 500 list for 1995 (recognizing, of course, that the aggregate
linpack results for that list were based on a slightly different
benchmark and hence not directly comparable).</p>

<p>Even so, the GFLOP-years of computation accumulated during the six
months that OnSpin3d ran on the cluster were sufficient to allow a
lovely paper to be written, and provided direct inspiration to Brown to
build a dedicated cluster computer in the physics department to continue
his work.  He faced two problems.  Sparcs (up to then the architecture
of choice for the price/performance senstive, especially at Duke, which
had excellent pricing from Sun) were proving too expensive to buy in
large quantities.  At the same time Solaris had proven to be a
spectacular failure as a potential cluster operating system, and
although the physics department had clung to SunOS 4.x for that very
reason, it was increasingly difficult to resist Sun's pressure to
"upgrade" to Solaris on new hardware.</p>

<p>Brown had already encountered linux and was experimenting with
floppy-based Slackware and SLS on his home PC's, but the 133 MHz Pentium
and AMD processors of the day, although "decent" performers in terms of
the interactive desktop, were not stellar numerical performers compared
to the 10 MFLOP or better Sparcs.  In 1995, however, Intel had released
its new Pentium Pro (P6 architecture) chips.</p>  

<p>These chips promised several very signficant improvements over the
Pentium.  Their floating point performance was much improved and
actually compared well to the Sparcs, including the new Ultra Sparcs, in
absolute terms while significantly beating them in price/performance.
Also, they were architected to work in a <i>dual</i> processor
configuration.  At the time, the chassis, hard disk, and other
supporting hardware that comprised a "system" were a significant
fraction of its total cost.  Packing two processors in one chassis
represented a further improvement in the cost/benefit ratio for CPU
bound code like OnSpin3d.</p>

<p>Brown did not have tremendously great resources to spend on computer
hardware, and there were limits on what could reasonably be sought from
the agency funding his research.  This meant that building a cluster
with 32 or more nodes was out of the question, as it would have cost
well over $100,000 at the time.  The design of a linux-based Pentium Pro
cluster to run OnSpin3d was obvious, it was a matter of matching the
scale of the design to the funding.  Support was obtained from the Army
Research Office to purchase four dual processor 200 MHz Pentium Pro
systems, with the University providing a brand new switched 100BT
ethernet for the entire department as part of a campus wide networking
project.</p>

<p>Shortly after the system was designed and its funding sought, a
visitor to the department from Drexel gave a short presentation on
cluster computing and for the first time we learned of the linux-based
"beowulf" built the previous year, with similar clusters duplicated at
Drexel and a few other institutions (the ones at the top of the
"clusters" list on the beowulf.org website).  We were not alone!  This
came as a welcome surprise, and Brown soon joined the beowulf mailing
list and made lots of new friends, all heavily involved in building,
designing, using and even studying COTS high-performance compute
clusters.</p>

<p>The systems funded by the ARO were ordered and delivered in mid-1996,
with a few additional systems purchased with DOE and NSF funds by
individual groups in the department over the next year.  The 2.0.0 linux
kernel, the first production kernel that would support SMP operation on
the Pentium Pro, had literally just been released in June and was
promptly installed.  A minor glitch with the vendor (the systems were
ordered with TWO processors, but shipped with only one, making it
remarkably difficult to get the right process scaling:-) was resolved
satisfactorily.</p>  

<p>Some interesting glitches in system operation caused Brown to make
his first somewhat desperate foray into the new world of "kernel
hacking", and gave him a deep insight into the benefits of running an
<i>open source</i> operating system on compute nodes where one
<i>could</i> try to fix things instead of wait "forever" for Sun to e.g.
fix their scheduler. Nevertheless, by mid-September these initial
problems were resolved, OnSpin3d was running on all four linux-based SMP
systems (three as "cluster nodes", one as a server and desktop console).
Eight CPUs, 1600 aggregate MHz, and a reasonable fraction of a GFLOP at
a cost of less than $20,000.</p>

<p>Brown's desktop had been named "ganesh" for years as Brown grew up in
India (a tradition that continues to this very day:-), and a motif of
Hindu gods as names for Unix systems had been in use in the department
for many years.  The new systems were therefore christened Brahma,
Shiva, Ganesh, and Arjuna, with the cluster itself named Brahma in honor
of the "B" in "Beowulf" and ganesh the "head node" on Brown's
desktop.</p>

<p>Linux 2.0.0 soon proved to have a <i>much</i> better multitasking
scheduler than Solaris (as did the earlier 0.9+ versions he had played
with before), comparable in efficiency to the SunOS scheduler of days
past.  OnSpin3d ran happily in the background of desktop systems with no
visible impact on interactive performance, so Brown rapidly pushed most
of his nodes out to desktops within the physics department, which was
somewhat starved at the time for desktop workstations.  Graduate offices
and faculty desks were graced with the original systems, and new systems
were given out to anyone willing to buy a monitor, keyboard and mouse
(not essential to their cluster functionality and hence not purchased
with grant money).  Brahma was architected as a "distributed parallel
supercomputer" on the original PVM model, which could freely mix
dedicated and desktop systems into a single organized model applied to a
single (fairly coarse grained) parallel problem.</p>

<p>More and more systems were added to Brahma, and eventually systems
started to be shelved up on our relatively small server room.  At any
given time over half of the systems were out on desktops, but <i>all</i>
the systems were typically at a load average of 1.0 or slightly higher,
running OnSpin3d and other programs continuously in the background while
permitting the interactive user to browse the burgeoning Web, read and
send mail, run Mathematica, and do many other physics or instructionally
related tasks.  Few networks in the world squeezed more value out of its
systems than our physics department, which rapidly converted from all
other operating systems to almost exclusive use of Linux as its
overwhelming cost/benefit advantage became apparent.</p>

<p>All things pass in time, however, and clusters are no exception.
Moore's law continued its inexorable advance, and PII's and the PIII's
appeared.  The first two major "all at once" additions of cluster nodes
(the Intel equipment grant systems indicated as <a
href="./clusters.php#brahma_2">Brahma 2</a> and (the Intel equipment
grant systems indicated as <a href="./clusters.php#brahma_3">Brahma
3</a>) continued to be named Brahma, and a node-naming scheme of b1, b2,
b3... was introduced to differentiate "dedicated" compute nodes from
nodes on people's desktops.  This scheme failed to indicate processor
generation or node architecture, however, and users had to keep track of
where the 400 MHz CPUs ended and the older 933 MHz systems began by
hand.  The use of desktop systems as compute nodes began to gradually
decline as well, not because it was obtrusive but because it simply
became too difficult to know what was available and free.</p>

<p>Beginning with the ganesh cluster (g01-g15 and ganesh itself) all new
"Brahma" clusters have subsequently been given unique "names" of their
own that are associated with their ownership, application, and numbering
scheme.  This, together with <a
href="../Resources/programs.php">xmlsysd and wulfstat</a>, have made it
very simple indeed to track the clusters for utilization, load,
availability, and so forth.</p>  

<p>At the time of this writing (April, 2003) Brahma has grown from its
original GFLOP or so of aggregate power to somewhere in the ballpark of
50-100 GFLOPs, with over a hundred nodes and two hundred CPUs.  Its
power keeps growing, as well, as by now the power of COTS cluster
computing is no secret, and Duke has gone from a couple of clusters in
1996 to literally more clusters and sub-clusters than can be counted in
2003.  The newly renovated cluster/server room the Duke physics building
houses both our own brahma-related cluster(s) and the 96 CPU (Dual
Athlon 1800+) "genie" cluster owned and operated by the <a
href="http://www.isds.duke.edu">Institute for Statistics and Decision
Sciences</a> at Duke, for example.</p>

<p>Commodity computer based network cluster computing was not invented
by any single individual or group (although, as noted above, Jack
Dongarra and the inventors of PVM would have one of the most valid
claims based on the fact that this software package was one of the first
explicitly designed to support this very thing).  It has come into its
present form by means of the collective effort of hundreds of
individuals and groups in institutions and universities all over the
world.  Since the mid-90's, the organizing "brain" of that effort has
without question been the beowulf mailing list, organized by Thomas
Sterling and Donald Becker as an outgrowth of their "beowulf" project at
NASA-Goddard.</p>

<p>Duke University and the Duke University Physics Department via the
Brahma project is <i>proud</i> to have been <i>one</i> of the first
major cluster computing sites, and one of the major contributors to the
development and promotion of linux and cluster computing on the beowulf
list for many years.</p>