HEAD: Beowulf Infrastructure 

DECK: How to care for and feed your beowulf-style supercomputer cluster

TOC_LINE: Beowulf nodes are cheap and easy to cost out, but estimating
the true costs of a place for them to live, power, cooling, and a
scalable administrative infrastructure can be a headache.  This article
shows you everything you need to know before writing a proposal or a
budget for your first beowulf.

AUTHOR:  Robert G. Brown


SUBHEAD: Introduction

Beowulf-style supercomputers built out of over-the-counter (OTC)
hardware are by far the cheapest way to get floating point cycles per
dollar spent, especially at the hardware level.  Small "hobby scale"
beowulfs (less than perhaps 8 nodes, where for the purposes of this
article a "node" will refer to a single case, which might house one or
more CPUs) can pretty much be built "anywhere" and with an appropriate
software installation and integration scheme can be managed by anyone
with a decent knowledge of linux (or really any other unix) and some
systems administration skills.  I have a compute cluster of this scale
in my home, for example.

As the number of nodes increases, however, careful attention must be
paid to physical infrastructure.  Cluster nodes consume electricity.
They must be kept cool.  They weigh a certain amount, have a footprint
on the floor, and take up volume.  They can be racked or stacked in a
variety of ways.  They require network wiring.  One must be able to
physically access the fronts and/or backs to cable them up or power
cycle them.

In a similar vein, if one is installing only a handful of nodes, taking
an hour, or even several hours, to install each node is still only a day
or two of time.  It may be fairly easy and relatively inexpensive to put
a monitor and keyboard on each node, or use a cheap keyboard, monitor
and mouse switch to get to each node to do an install or upgrade.
Spending minutes per node per day administering the nodes again may add
up to only a day or two of time over a year.  This leaves most of the
year for cluster production.

When one plans to install a I<hundred> nodes, this kind of
seat-of-the-pants approach simply does not work.  A hundred nodes might
require 10 kilowatts of electrical power, several "tons" of air
conditioning (a term that doesn't refer to the weight of the air
conditioner but rather its capacity), could weigh several tons, might
need hundreds of feet of floor space, and could easily cost $10,000 a
year as a I<recurring> cost to provide power and cooling to, I<after>
buying it the power lines and air conditioning units it needs.

Management costs face a similar crisis.  Spending two hours
I<installing> each node it adds up to I<200 hours> (or five full work
weeks) to install the cluster.  Spending two minutes a day per node
I<managing> each node adds up to over three hours a day!  One can easily
get to the point where one has no time to use a cluster, or where
managing even what is really a fairly small cluster as professional
clusters go has become a more than full time job.

As one can see, infrastructure costs can be I<significant> for larger
clusters, and poor methodology (methodology that does not scale well
with the number of nodes) can lead to disaster. This article describes
some things you should know before you run out and buy a few hundred
nodes to put in your metaphorical garage, and suggests at least a few
ways to honestly estimate the fixed and recurring requirements and costs
for running a relatively large cluster.

SUBHEAD:  Cluster Space

As noted above, cluster nodes have a variety of physical dimensions.
They have a "footprint" (the area of their base), a height (and hence a
volume), and a weight.  Their shelving or rackmount may increase their
footprint.  Access to the front and back of the nodes must typically be
preserved to allow nodes to be moved in and out of the cluster, to allow
cool air in and warm air out, to provide access to network and power
cabling.  This access space will usually require 2-3x the footprint of
the node itself in even the most efficient cluster room layout.

Nodes are often stacked up vertically, tower units in heavy duty steel
shelving or rackmount units in two post or four post racks.  In very
rough terms, four shelves of four tower units per shelf (16 nodes) might
occupy a strip two feet wide by three feet long by close to eight feet
high.  Adding access, a fairly minimal space for such a cluster would be
twenty square feet in a room with at least eight foot ceilings.
Assuming a pessimistic weight per node, including the weight fraction of
the shelving that supports it, of 30 pounds (14 kg.), the loaded shelf
could weigh 500 pounds.

Rackmount clusters often are installed in 43U racks.  These racks are
19" wide and just over six feet (or under two meters) tall, where one
"U" is 1.75 inches.  Depending on configuration, rackmount nodes can
still weigh 22 pounds (10 kg.) per U and are often roughly 30" deep.
Including access, a fully loaded rack requires a minimum of 13 square
feet (with at least least seven foot ceilings) and can weigh 1000
pounds or even more if an uninterruptible power supply (with its heavy
batteries) is included.

Finally there is a "blade" configuration, which we do not discuss here
other than to say it permits still higher densities of nodes, at a
considerably higher cost.  This might be right for someone needing many
nodes in a very restricted physical space.

Clearly space becomes a major factor in large cluster design.  One needs
to carefully consider even things like floor strength, as one stacks up
half a ton per square meter.  Few things can ruin your day like a 43U
rack filled with expensive equipment falling through a ceiling.  Even if
it doesn't hit anybody.  Humidity is another bad thing -- electrical
circuits don't like getting wet.  The next major things to consider,
however, are power and air conditioning.

SUBHEAD: Power and Cooling

I am going to assume that every reading this knows what a "watt" is, if
only from their experience with light bulbs and EZ-bake ovens.  People
who are totally clueless about wiring might read the Electrical Wiring
FAQ, or any good
introductory physics text, as well as discussions in the beowulf list
archives and other websites in the references.

To begin with, you have to provide a place to plug each node into in any
cluster room -- enough circuits and good physical contiguity of the
circuit receptacles to the points where the nodes are physically
located.  Typically electricity is provided in the form of "power poles"
next to rack locations or receptacles built into wall or ceiling or
floor (for drop ceiling or raised floor facilities).  

Nodes can draw a wide range of electrical power.  A "reasonable"
estimate is 100-200W per CPU, but this is I<very> crude.  Power
requirements I<increase> with CPU clock, memory, disk, network and other
peripherals (and time, as systems evolve), so one needs to carefully
consider I<your> node configuration, under load.  Blade computers, or
older (slower clock) systems, stripped, might be less.  I<Measure> the
actual power draw of a prototype of your expected nodes under a variety
of loads if possible.  Remember that nodes you buy five years from now
for the space may require even more power.

Here are a couple of suggestions, based on personal and painful
experience, regarding your electrical wiring requirements for a compute
cluster location.  One is to overwire.  A 20 amp, 120 VAC circuit in
principle can deliver about 1700 W rms (average) power without blowing.
One would thus naively expect to be able to run as many as 16 100W nodes
on a single circuit, but in practice you might find circuit breakers
tripping at 10 as systems draw in excess of their average rate while
booting, for example.  Reserving 50% of the capacity of each circuit in
your estimates wouldn't be excessive.  The cost of excess capacity,
amortized over ten years, is trivial compared to the cost of inadequate
capacity and the resulting headaches and loss of productivity.

A second I<strong> suggestion is to learn about the kind of line
distortion that occurs when large numbers of switching power supplies
(the kind found in most computers) are on a single line, especially on a
I<long> run from the receptacle to the power bus and neutral line
ground.  I<Each phase of a multiphase circuit should have its own
neutral line> -- sharing of neutrals is a shockingly bad idea in a
computer cluster room.  There should be a I<short> run from a I<local>
power panel to the receptacles.  All wiring should be done by
experienced, licensed professionals so that it meets or exceeds the
requirements of the National Electrical Code.

I I<strongly recommend> that anyone considering electrical
infrastructure (renovation or new construction) for a cluster begin by
reading the Harmonics Q&A FAQ at I<http://www.mirusinternational.com>
and consider getting a harmonic mitigating transformer for the space.
This FAQ provides a marvelous education in just how putting many
switching power supplies on a single line can distort line voltage,
generate spurious and injurious system noise, reduce a system's natural
capacity to withstand surges, and more.  Do not assume that your
building's existing wiring (even where adequate in terms of nominal
capacity) will be adequate to run a cluster, unless you wish to be
tormented by power-related hardware problem.

Finally, consider uninterruptible power.  Although the marginal benefit
of keeping the nodes up through short power outages may or may not be
significant in your power grid, a good UPS conditions power far better
than most surge protectors.  A single UPS for your whole facility is
likely to be cheaper and more manageable than UPS for all the nodes
individually.

All the power that goes I<into> a room through all of those electrical
cords has to be I<removed> from the room, generally with one or more air
conditioning (AC) units.  A single loaded shelf (16 cases) can draw
anywhere from 1.6 KW to close to 5 KW (in a loaded dual configuration).
A single loaded 43U rack might draw from 4 KW to well over 10 KW in its
meter-square floor space.  Totalled up, this (plus a margin for switches
and other cluster equipment, plus heat produced by human bodies and
electrical lights and the AC units themselves) is the heat that must be
removed from the room in question.

AC is typically purchased or installed in units of ``tons'' (the heat
required to melt a ton of ice at 0 degrees C in 24 hours).  This works
out to be about 3500 watts, or 3 tons of AC per 10 KW of load in the
space.  Again, it is better to have surplus capacity than inadequate
capacity, because one really wishes to keep the room at temperatures
I<below> 20 degrees C (perhaps around 60 degrees F).  Every 10 degrees F
above 70 F reduces the expected life of a system by roughly a year and
consequently increases the amount of time spent dealing with hardware
failure.  Any AC/Power system installed should also have a "thermal kill
switch" (or other automated thermally enabled shutdown mechanism) that
shuts down all room power if AC fails but power doesn't and ambient
temperatures exceed (say) 32 C/90 F.

Professional care must be taken to distribute cooled air so that it can
actually be taken up by the intake vents of the system and collect their
heated exhaust air to return it to the chiller.  The system should be
capable of being balanced against load distribution in the room,
increasing airflow where it is most needed.  In operation the room
should have no particularly hot, or cold, spots, although it will always
be warmer "behind" a rack (where the hot air is being exhausted) than in
front.  Many possibilities exist for distribution -- up through a raised
floor, down from the ceiling (be careful about condensation drips!),
from a single heat exchanger run from a remote chilled water supply or
multiple units installed locally.

SUBHEAD: Networking

The final aspect of physical infrastructure to consider is network
access.  I do not refer to the network backbone of the cluster itself,
which is likely to be local to the cluster and simply a matter of
routing wires to switches within the room (although this may well
require wiring trays or conduits to keep the wiring neat and
maintainable).

Some clusters are intended to be operated "locally" -- from a head node
or other access point physically contiguous to the cluster, with no
access to a WAN or outside LAN.  This is fine, provided that one allows
for the fact that a loaded cluster room sounds like a 747 taking off and
is typically cold enough to require a jacket or sweater to work in.

Most clusters, however, integrate with a building LAN so that workers
can access it from their offices.  Many clusters even permit access from
a campus WAN, or across the Internet (secured with e.g. the secure
shell, ssh).  In either case, one must ensure that the physical cluster
space contains fiber or copper connections to the appropriate backbone.

SUBHEAD: Physical Infrastructure Costs

Space, power, AC and network access all cost money to provide.  They
cost money in two ways -- a capital investment in building or renovating
a space so that it is suitable for your cluster, and recurring costs for
using the space.  The capital cost is highly variable (obviously) but
can I<easily> be in the tens to hundreds of thousands of dollars,
depending on the capacity desired, availability and cost of power, AC
and the network connections, and more.  This cost must be viewed as
being amortized over the lifetime of the space.  For example, a $30,000
renovation for a space to hold 100 nodes over ten years adds a cost of
$3 per node per year.  Adding "rent" and "interest" might push this to
$5 per node per year -- not much, viewed this way, but the $30,000 must
be provided "up front" before building the cluster at all.

Recurring costs can be estimated as follows.  A simple calculation shows
that 1 W of power used 24 hours a day for a year in a grid where power
retails for $0.08 per KW-hour costs about $0.70.  The cost of the AC
needed to remove that watt can be estimated at around $0.30.  We will
thus use $1 per watt per year to estimate our recurring cost for power
and AC, together.  Note that this might be high or low by as much as 50%
depending on actual costs in your area.

We thus must expect to spend in the ballpark of $100 to $200 I<per CPU,
per year> (invisibly absorbing the $5 estimate for amortized renovation
costs) just to I<run> the cluster.  A 100 node cluster will generally
have over $10,000 per year in recurring costs just to keep it turned on!

As we can see, the physical infrastructure of a compute cluster large
enough to be considered a "supercomputer" may cost far less per CPU to
purchase, but it requires very much the I<same> physical infrastructure
as a comparable "big iron" supercomputer -- a suitable space, plentiful
electrical power, cooling capacity sufficient to remove all of that
power as it is turned into heat by the cluster's operation, and one
needs to factor the I<costs> for all of this (both fixed and recurring)
into your "total cost of ownership" (TCO) budgeting for the cluster.

This raises the cost of the cluster over the naive estimate which
included only the cost of the hardware, but it is still quite low (and
far lower than the cost of big iron).  However, we have still to
consider the second aspect of cluster computing: management and
operational infrastructure.  How difficult (and hence "expensive") is it
to install cluster nodes, to manage cluster nodes, to monitor cluster
nodes as they do their work?

In the next few sections we examine these important aspects of cluster
infrastructure.

SUBHEAD: Cluster Installation

Little specialized skill is required to physically install a compute
cluster, once the physical infrastructure (space with racks or shelves,
power, and cooling) is prepared to receive the nodes.  Almost anybody
not actively incompetent in computing can remove tower units or
rackmount systems from their boxes and shelve them or rack them as the
case may be.  Cabling them up neatly (both network cabling and power
cabling) is easily done with a pack of cable ties or specialized rack
cable supports.  The network switches required will generally be rack or
shelf mounted as appropriate, and operating them is often just a matter
of plugging in the cables.

The only "tricky" part is installing a suitable image of linux on all of
the cluster nodes.  However, these days linux is quite possibly the
I<most scalable operating system in the world> in terms of installation.

It is obviously beyond the scope of this or any simple article to teach
a total novice all that they need to know about system administration,
how to set up a web or NFS server, how to configure a network, how to
install accounts.  If I could reduce all of that to a few thousand words
of prose, I'd be in the wrong business as a physicist!

Instead I'm going to assume that you, dear reader, are at least
moderately competent in all of these things, and direct you to a list of
resources (not the least of which is the linux section of your local
bookstore) if you are I<really> getting started with Unix in general,
linux in particular, systems administration, and clusters all at the
same time.

Even within "linux" there are many choices to be made.  There are many
general purpose linux distributions, each with advantages and
disadvantages.  There are also specialized linux distributions,
including one from Scyld (a company founded by many of the original NASA
Goddard beowulf group) that is I<designed> for building true beowulf
compute clusters.  Finally, there are many vendors (some of whom are
linked to the Brahma site) who would be happy to provide you with a
ready-to-operate "turnkey" cluster.  Amazingly, a turnkey linux cluster
can retail for as little as the OTC hardware cost plus a 10-20%
"integration charge", which (as we will see below) is quite reasonable.

In this section, we will outline at least one way to install cluster
nodes using the Red Hat (RH) linux distribution.  We will see that linux
installation of cluster nodes (and workstations) is rapidly tending
toward the I<theoretical> limit of scaling efficiency.  That is, it is
possible at this point to install a suitably equipped cluster node by:
connecting it to the network and turning it on.  From that point on,
I<almost all> of the software and operational maintenance of the node is
fully automated, with no recurring costs that scale on a per-node basis,
for the lifetime of the node except for those associated with
(unavoidable) hardware failure and monitoring.

To install RH-based linux cluster nodes (or, for that matter, LAN
workstations) on a maximally scalable basis one proceeds as follows:

  1) Set up an (installation) webserver, ideally for and accessible by
your entire institution (not just the LAN where the cluster might be
located).  This server should have the highest bandwidth you can manage
to all systems it will serve, as it will need to pump order of a GB of
data through the net on a typical install.  Let us call this server's
address I<http://wanserver.mydomain.edu>.

  2) Place a mirror or copy of the Red Hat distribution of your choice
there.  We tend to trail the current distribution by one (and are still
at 7.3) out of a mix of conservatism and because we have a lot of local
packages to build and test before making the current one "official".  We
also have to wait until certain "convenient times" in the academic cycle
before upgrading everything.  This sort of thing will vary by
organization.  Let us imagine that the distribution is in
I<http://wanserver.mydomain.edu/pub/rh-7.3/i386> (a fairly typical
path).

  3) Set up a DHCP and PXE server.  General documentation can be found
in the mini-HOWTO's referenced below and in e.g. /usr/share/doc/dhcp*.
However, a peek at a working dhcpd.conf and pxe.conf is worth, as they
say, a thousand finesses.  Such a peek (and many other things besides)
are provided on a special site I set up at
I<http:www.phy.duke.edu/brahma/linux-mag.html>, which you can view as
being a virtual extension of this article (so it doesn't end up ten
thousand words long, at least not I<here>).

  4) Create a suitable kickstart file.  Also documented on the linux-mag
link.

  5) Set up yum, both on the installation webserver adding a yum
package with suitable /etc/yum.conf to the kickstart package list.  Note
that yum can support multiple archives, and of course, it is documented
and crossreferenced on the linux-mag link.

With this setup, it should be easy for you to install a node by just,
well, racking it up, cabling it, and I<turning it on>.  Reinstallation
and upgrading it is even simpler (as one doesn't have to be within a
hundred miles of the node).  Keeping its packages on all the nodes up to
date is fully automated, provided only that one keeps the archives
themselves up to date.

SUBHEAD: Maintaining a Cluster

Cluster software maintenance can be I<fully automated> with yum.  Once
/etc/yum.conf is installed so that it points at your archive directory,
e.g.  I<http://wanserver.mydomain.edu/pub/rh-7.3/i386/> above (which has
the yum headers for the archive extracted and stored in the ./headers
subdirectory), running "yum update" as root will update I<all packages>
on your cluster to the current/latest revisions provided in the archive.
In our environment, the yum rpm itself installs both a suitable
/etc/yum.conf I<and> a daily cron task to run the update command.

This means that I<every cluster node or workstation on campus> installed
from our primary archive will I<automatically> update every night to the
latest revisions in the repository.  If a security, performance, or
bugfix update of any package is released, we simply rebuild or install
the updated, patched package with a later number and insert it into the
archive (in e.g. an other-pkgs/updates subdirectory) and by the next day
it will be installed on every campus machine with no further action
being taken.  Even linux workstations installed by students in the dorms
with little or no linux experience thus remain reasonably I<secure>, as
well as functionally current.

For cluster nodes, this means that the cluster nodes require basically
I<no> hands on software management.  Any package whose setup can be
encapsulated as an RPM plus (perhaps) a %post script can be dropped on
every node overnight by just dropping it into a directory and forgetting
it.  If one is in a hurry, there are a variety of ways to distribute a
root yum update command to all the cluster nodes via e.g. ssh with a
single command run on any node, server or workstation that can access
all the cluster nodes.

It's hard to get much simpler or more efficient than that.  Node
software maintenance has no costs that scale on a per-node basis, and
absolutely I<minimal> costs on the institutional or LAN basis.  Perfect
scaling, the dream of systems managers everywhere.

SUBHEAD: Monitoring a Cluster

SUBHEAD: Management Infrastructure Costs

SUBHEAD: Conclusion


[ BEGIN Sidebar One - "Resources" ]

  Electrical Wiring FAQ:
 :  I<http://www.faqs.org/faqs/electrical-wiring/part1/>
  Scyld:
    I<http://www.scyld.com>
  Brhama:
    I<http://www.phy.duke.edu/brahma>
  DHCP mini HOWTO:
    I<http://www.tldp.org/HOWTO/mini/DHCP/index.html> 
  PXE mini HOWTO:
    I<http://www.ltsp.org/documentation/pxe.howto.html>
  yum Website:
    I<http://www.dulug.duke.edu/yum/>

All example configuration files referenced in the article together with
further explanatory text, can be found on the brahma website at:
    I<http://www.phy.duke.edu/brahma/linux-mag.html>

[ END Sidebar One ]