HEAD: Beowulf Infrastructure DECK: How to care for and feed your beowulf-style supercomputer cluster TOC_LINE: Beowulf nodes are cheap and easy to cost out, but estimating the true costs of a place for them to live, power, cooling, and a scalable administrative infrastructure can be a headache. This article shows you everything you need to know before writing a proposal or a budget for your first beowulf. AUTHOR: Robert G. Brown SUBHEAD: Introduction Beowulf-style supercomputers built out of over-the-counter (OTC) hardware are by far the cheapest way to get floating point cycles per dollar spent, especially at the hardware level. Small "hobby scale" beowulfs (less than perhaps 8 nodes, where for the purposes of this article a "node" will refer to a single case, which might house one or more CPUs) can pretty much be built "anywhere" and with an appropriate software installation and integration scheme can be managed by anyone with a decent knowledge of linux (or really any other unix) and some systems administration skills. I have a compute cluster of this scale in my home, for example. As the number of nodes increases, however, careful attention must be paid to physical infrastructure. Cluster nodes consume electricity. They must be kept cool. They weigh a certain amount, have a footprint on the floor, and take up volume. They can be racked or stacked in a variety of ways. They require network wiring. One must be able to physically access the fronts and/or backs to cable them up or power cycle them. In a similar vein, if one is installing only a handful of nodes, taking an hour, or even several hours, to install each node is still only a day or two of time. It may be fairly easy and relatively inexpensive to put a monitor and keyboard on each node, or use a cheap keyboard, monitor and mouse switch to get to each node to do an install or upgrade. Spending minutes per node per day administering the nodes again may add up to only a day or two of time over a year. This leaves most of the year for cluster production. When one plans to install a I nodes, this kind of seat-of-the-pants approach simply does not work. A hundred nodes might require 10 kilowatts of electrical power, several "tons" of air conditioning (a term that doesn't refer to the weight of the air conditioner but rather its capacity), could weigh several tons, might need hundreds of feet of floor space, and could easily cost $10,000 a year as a I cost to provide power and cooling to, I buying it the power lines and air conditioning units it needs. Management costs face a similar crisis. Spending two hours I each node it adds up to I<200 hours> (or five full work weeks) to install the cluster. Spending two minutes a day per node I each node adds up to over three hours a day! One can easily get to the point where one has no time to use a cluster, or where managing even what is really a fairly small cluster as professional clusters go has become a more than full time job. As one can see, infrastructure costs can be I for larger clusters, and poor methodology (methodology that does not scale well with the number of nodes) can lead to disaster. This article describes some things you should know before you run out and buy a few hundred nodes to put in your metaphorical garage, and suggests at least a few ways to honestly estimate the fixed and recurring requirements and costs for running a relatively large cluster. SUBHEAD: Cluster Space As noted above, cluster nodes have a variety of physical dimensions. They have a "footprint" (the area of their base), a height (and hence a volume), and a weight. Their shelving or rackmount may increase their footprint. Access to the front and back of the nodes must typically be preserved to allow nodes to be moved in and out of the cluster, to allow cool air in and warm air out, to provide access to network and power cabling. This access space will usually require 2-3x the footprint of the node itself in even the most efficient cluster room layout. Nodes are often stacked up vertically, tower units in heavy duty steel shelving or rackmount units in two post or four post racks. In very rough terms, four shelves of four tower units per shelf (16 nodes) might occupy a strip two feet wide by three feet long by close to eight feet high. Adding access, a fairly minimal space for such a cluster would be twenty square feet in a room with at least eight foot ceilings. Assuming a pessimistic weight per node, including the weight fraction of the shelving that supports it, of 30 pounds (14 kg.), the loaded shelf could weigh 500 pounds. Rackmount clusters often are installed in 43U racks. These racks are 19" wide and just over six feet (or under two meters) tall, where one "U" is 1.75 inches. Depending on configuration, rackmount nodes can still weigh 22 pounds (10 kg.) per U and are often roughly 30" deep. Including access, a fully loaded rack requires a minimum of 13 square feet (with at least least seven foot ceilings) and can weigh 1000 pounds or even more if an uninterruptible power supply (with its heavy batteries) is included. Finally there is a "blade" configuration, which we do not discuss here other than to say it permits still higher densities of nodes, at a considerably higher cost. This might be right for someone needing many nodes in a very restricted physical space. Clearly space becomes a major factor in large cluster design. One needs to carefully consider even things like floor strength, as one stacks up half a ton per square meter. Few things can ruin your day like a 43U rack filled with expensive equipment falling through a ceiling. Even if it doesn't hit anybody. Humidity is another bad thing -- electrical circuits don't like getting wet. The next major things to consider, however, are power and air conditioning. SUBHEAD: Power and Cooling I am going to assume that every reading this knows what a "watt" is, if only from their experience with light bulbs and EZ-bake ovens. People who are totally clueless about wiring might read the Electrical Wiring FAQ, or any good introductory physics text, as well as discussions in the beowulf list archives and other websites in the references. To begin with, you have to provide a place to plug each node into in any cluster room -- enough circuits and good physical contiguity of the circuit receptacles to the points where the nodes are physically located. Typically electricity is provided in the form of "power poles" next to rack locations or receptacles built into wall or ceiling or floor (for drop ceiling or raised floor facilities). Nodes can draw a wide range of electrical power. A "reasonable" estimate is 100-200W per CPU, but this is I crude. Power requirements I with CPU clock, memory, disk, network and other peripherals (and time, as systems evolve), so one needs to carefully consider I node configuration, under load. Blade computers, or older (slower clock) systems, stripped, might be less. I the actual power draw of a prototype of your expected nodes under a variety of loads if possible. Remember that nodes you buy five years from now for the space may require even more power. Here are a couple of suggestions, based on personal and painful experience, regarding your electrical wiring requirements for a compute cluster location. One is to overwire. A 20 amp, 120 VAC circuit in principle can deliver about 1700 W rms (average) power without blowing. One would thus naively expect to be able to run as many as 16 100W nodes on a single circuit, but in practice you might find circuit breakers tripping at 10 as systems draw in excess of their average rate while booting, for example. Reserving 50% of the capacity of each circuit in your estimates wouldn't be excessive. The cost of excess capacity, amortized over ten years, is trivial compared to the cost of inadequate capacity and the resulting headaches and loss of productivity. A second I suggestion is to learn about the kind of line distortion that occurs when large numbers of switching power supplies (the kind found in most computers) are on a single line, especially on a I run from the receptacle to the power bus and neutral line ground. I -- sharing of neutrals is a shockingly bad idea in a computer cluster room. There should be a I run from a I power panel to the receptacles. All wiring should be done by experienced, licensed professionals so that it meets or exceeds the requirements of the National Electrical Code. I I that anyone considering electrical infrastructure (renovation or new construction) for a cluster begin by reading the Harmonics Q&A FAQ at I and consider getting a harmonic mitigating transformer for the space. This FAQ provides a marvelous education in just how putting many switching power supplies on a single line can distort line voltage, generate spurious and injurious system noise, reduce a system's natural capacity to withstand surges, and more. Do not assume that your building's existing wiring (even where adequate in terms of nominal capacity) will be adequate to run a cluster, unless you wish to be tormented by power-related hardware problem. Finally, consider uninterruptible power. Although the marginal benefit of keeping the nodes up through short power outages may or may not be significant in your power grid, a good UPS conditions power far better than most surge protectors. A single UPS for your whole facility is likely to be cheaper and more manageable than UPS for all the nodes individually. All the power that goes I a room through all of those electrical cords has to be I from the room, generally with one or more air conditioning (AC) units. A single loaded shelf (16 cases) can draw anywhere from 1.6 KW to close to 5 KW (in a loaded dual configuration). A single loaded 43U rack might draw from 4 KW to well over 10 KW in its meter-square floor space. Totalled up, this (plus a margin for switches and other cluster equipment, plus heat produced by human bodies and electrical lights and the AC units themselves) is the heat that must be removed from the room in question. AC is typically purchased or installed in units of ``tons'' (the heat required to melt a ton of ice at 0 degrees C in 24 hours). This works out to be about 3500 watts, or 3 tons of AC per 10 KW of load in the space. Again, it is better to have surplus capacity than inadequate capacity, because one really wishes to keep the room at temperatures I 20 degrees C (perhaps around 60 degrees F). Every 10 degrees F above 70 F reduces the expected life of a system by roughly a year and consequently increases the amount of time spent dealing with hardware failure. Any AC/Power system installed should also have a "thermal kill switch" (or other automated thermally enabled shutdown mechanism) that shuts down all room power if AC fails but power doesn't and ambient temperatures exceed (say) 32 C/90 F. Professional care must be taken to distribute cooled air so that it can actually be taken up by the intake vents of the system and collect their heated exhaust air to return it to the chiller. The system should be capable of being balanced against load distribution in the room, increasing airflow where it is most needed. In operation the room should have no particularly hot, or cold, spots, although it will always be warmer "behind" a rack (where the hot air is being exhausted) than in front. Many possibilities exist for distribution -- up through a raised floor, down from the ceiling (be careful about condensation drips!), from a single heat exchanger run from a remote chilled water supply or multiple units installed locally. SUBHEAD: Networking The final aspect of physical infrastructure to consider is network access. I do not refer to the network backbone of the cluster itself, which is likely to be local to the cluster and simply a matter of routing wires to switches within the room (although this may well require wiring trays or conduits to keep the wiring neat and maintainable). Some clusters are intended to be operated "locally" -- from a head node or other access point physically contiguous to the cluster, with no access to a WAN or outside LAN. This is fine, provided that one allows for the fact that a loaded cluster room sounds like a 747 taking off and is typically cold enough to require a jacket or sweater to work in. Most clusters, however, integrate with a building LAN so that workers can access it from their offices. Many clusters even permit access from a campus WAN, or across the Internet (secured with e.g. the secure shell, ssh). In either case, one must ensure that the physical cluster space contains fiber or copper connections to the appropriate backbone. SUBHEAD: Physical Infrastructure Costs Space, power, AC and network access all cost money to provide. They cost money in two ways -- a capital investment in building or renovating a space so that it is suitable for your cluster, and recurring costs for using the space. The capital cost is highly variable (obviously) but can I be in the tens to hundreds of thousands of dollars, depending on the capacity desired, availability and cost of power, AC and the network connections, and more. This cost must be viewed as being amortized over the lifetime of the space. For example, a $30,000 renovation for a space to hold 100 nodes over ten years adds a cost of $3 per node per year. Adding "rent" and "interest" might push this to $5 per node per year -- not much, viewed this way, but the $30,000 must be provided "up front" before building the cluster at all. Recurring costs can be estimated as follows. A simple calculation shows that 1 W of power used 24 hours a day for a year in a grid where power retails for $0.08 per KW-hour costs about $0.70. The cost of the AC needed to remove that watt can be estimated at around $0.30. We will thus use $1 per watt per year to estimate our recurring cost for power and AC, together. Note that this might be high or low by as much as 50% depending on actual costs in your area. We thus must expect to spend in the ballpark of $100 to $200 I (invisibly absorbing the $5 estimate for amortized renovation costs) just to I the cluster. A 100 node cluster will generally have over $10,000 per year in recurring costs just to keep it turned on! As we can see, the physical infrastructure of a compute cluster large enough to be considered a "supercomputer" may cost far less per CPU to purchase, but it requires very much the I physical infrastructure as a comparable "big iron" supercomputer -- a suitable space, plentiful electrical power, cooling capacity sufficient to remove all of that power as it is turned into heat by the cluster's operation, and one needs to factor the I for all of this (both fixed and recurring) into your "total cost of ownership" (TCO) budgeting for the cluster. This raises the cost of the cluster over the naive estimate which included only the cost of the hardware, but it is still quite low (and far lower than the cost of big iron). However, we have still to consider the second aspect of cluster computing: management and operational infrastructure. How difficult (and hence "expensive") is it to install cluster nodes, to manage cluster nodes, to monitor cluster nodes as they do their work? In the next few sections we examine these important aspects of cluster infrastructure. SUBHEAD: Cluster Installation Little specialized skill is required to physically install a compute cluster, once the physical infrastructure (space with racks or shelves, power, and cooling) is prepared to receive the nodes. Almost anybody not actively incompetent in computing can remove tower units or rackmount systems from their boxes and shelve them or rack them as the case may be. Cabling them up neatly (both network cabling and power cabling) is easily done with a pack of cable ties or specialized rack cable supports. The network switches required will generally be rack or shelf mounted as appropriate, and operating them is often just a matter of plugging in the cables. The only "tricky" part is installing a suitable image of linux on all of the cluster nodes. However, these days linux is quite possibly the I in terms of installation. It is obviously beyond the scope of this or any simple article to teach a total novice all that they need to know about system administration, how to set up a web or NFS server, how to configure a network, how to install accounts. If I could reduce all of that to a few thousand words of prose, I'd be in the wrong business as a physicist! Instead I'm going to assume that you, dear reader, are at least moderately competent in all of these things, and direct you to a list of resources (not the least of which is the linux section of your local bookstore) if you are I getting started with Unix in general, linux in particular, systems administration, and clusters all at the same time. Even within "linux" there are many choices to be made. There are many general purpose linux distributions, each with advantages and disadvantages. There are also specialized linux distributions, including one from Scyld (a company founded by many of the original NASA Goddard beowulf group) that is I for building true beowulf compute clusters. Finally, there are many vendors (some of whom are linked to the Brahma site) who would be happy to provide you with a ready-to-operate "turnkey" cluster. Amazingly, a turnkey linux cluster can retail for as little as the OTC hardware cost plus a 10-20% "integration charge", which (as we will see below) is quite reasonable. In this section, we will outline at least one way to install cluster nodes using the Red Hat (RH) linux distribution. We will see that linux installation of cluster nodes (and workstations) is rapidly tending toward the I limit of scaling efficiency. That is, it is possible at this point to install a suitably equipped cluster node by: connecting it to the network and turning it on. From that point on, I of the software and operational maintenance of the node is fully automated, with no recurring costs that scale on a per-node basis, for the lifetime of the node except for those associated with (unavoidable) hardware failure and monitoring. To install RH-based linux cluster nodes (or, for that matter, LAN workstations) on a maximally scalable basis one proceeds as follows: 1) Set up an (installation) webserver, ideally for and accessible by your entire institution (not just the LAN where the cluster might be located). This server should have the highest bandwidth you can manage to all systems it will serve, as it will need to pump order of a GB of data through the net on a typical install. Let us call this server's address I. 2) Place a mirror or copy of the Red Hat distribution of your choice there. We tend to trail the current distribution by one (and are still at 7.3) out of a mix of conservatism and because we have a lot of local packages to build and test before making the current one "official". We also have to wait until certain "convenient times" in the academic cycle before upgrading everything. This sort of thing will vary by organization. Let us imagine that the distribution is in I (a fairly typical path). 3) Set up a DHCP and PXE server. General documentation can be found in the mini-HOWTO's referenced below and in e.g. /usr/share/doc/dhcp*. However, a peek at a working dhcpd.conf and pxe.conf is worth, as they say, a thousand finesses. Such a peek (and many other things besides) are provided on a special site I set up at I, which you can view as being a virtual extension of this article (so it doesn't end up ten thousand words long, at least not I). 4) Create a suitable kickstart file. Also documented on the linux-mag link. 5) Set up yum, both on the installation webserver adding a yum package with suitable /etc/yum.conf to the kickstart package list. Note that yum can support multiple archives, and of course, it is documented and crossreferenced on the linux-mag link. With this setup, it should be easy for you to install a node by just, well, racking it up, cabling it, and I. Reinstallation and upgrading it is even simpler (as one doesn't have to be within a hundred miles of the node). Keeping its packages on all the nodes up to date is fully automated, provided only that one keeps the archives themselves up to date. SUBHEAD: Maintaining a Cluster Cluster software maintenance can be I with yum. Once /etc/yum.conf is installed so that it points at your archive directory, e.g. I above (which has the yum headers for the archive extracted and stored in the ./headers subdirectory), running "yum update" as root will update I on your cluster to the current/latest revisions provided in the archive. In our environment, the yum rpm itself installs both a suitable /etc/yum.conf I a daily cron task to run the update command. This means that I installed from our primary archive will I update every night to the latest revisions in the repository. If a security, performance, or bugfix update of any package is released, we simply rebuild or install the updated, patched package with a later number and insert it into the archive (in e.g. an other-pkgs/updates subdirectory) and by the next day it will be installed on every campus machine with no further action being taken. Even linux workstations installed by students in the dorms with little or no linux experience thus remain reasonably I, as well as functionally current. For cluster nodes, this means that the cluster nodes require basically I hands on software management. Any package whose setup can be encapsulated as an RPM plus (perhaps) a %post script can be dropped on every node overnight by just dropping it into a directory and forgetting it. If one is in a hurry, there are a variety of ways to distribute a root yum update command to all the cluster nodes via e.g. ssh with a single command run on any node, server or workstation that can access all the cluster nodes. It's hard to get much simpler or more efficient than that. Node software maintenance has no costs that scale on a per-node basis, and absolutely I costs on the institutional or LAN basis. Perfect scaling, the dream of systems managers everywhere. SUBHEAD: Monitoring a Cluster SUBHEAD: Management Infrastructure Costs SUBHEAD: Conclusion [ BEGIN Sidebar One - "Resources" ] Electrical Wiring FAQ: : I Scyld: I Brhama: I DHCP mini HOWTO: I PXE mini HOWTO: I yum Website: I All example configuration files referenced in the article together with further explanatory text, can be found on the brahma website at: I [ END Sidebar One ]