In the preceding section we saw how our maintenance requirements can vary considerably depending on how carefully we automated our installation and how scalable (identical) our nodes are. If one sets the nodes up using either of the last two methods, they basically should not need to be backed up at all11.21. If one builds the nodes by hand, one probably will need to back them up.
The server node, of course, should be carefully backed up and indeed should likely have backup storage attached to match the disk it serves. Any decent book on linux or unix systems adminstration will tell you how to go about backing up space and I won't say any more of it. Once it is set up and automated, though, all you ever do is change an occasional tape.
As is the case with any network of computers, you will need to take care to monitor both systems messages (see suggestions below for centralizing this) and things like system load, CPU temperature, available memory and so forth. There are several very nice tools for the latter discussed in the Appendix on software (with URL's to download sites). If your beowulf or cluster is being used by lots of groups, you may need to referee demands for resources. Finally, you will likely have to teach some of these groups how to write parallel code for beowulfs. Giving them this book (and a handful of others) is a good first step, but of course there will be a need to do a bit more in some cases.
That's really about it. The nodes are ``instantly'' replaceable or re-installable, and as long as you protect and manage the head node like you would any server or workstation, the whole thing should just tick right along. Occasionally things will break in ways you don't understand at all - running a particular application will crash the system, or your network will fall apart.
That's what the beowulf list is for. So my parting advice on maintenance is to join this list at www.beowulf.org if you haven't already. I'll be there, and so will a lot of people who know far more than I about beowulfery and linux in general. Collectively, we know more than anybody! Chances are pretty good that we can help, and if we can't at least you'll know that the problem you are facing is real.
As you can see, beowulfs take a fair amount of work to plan. They take less work (once you've learned what you're doing) to build. Properly built, they take almost no work at all to maintain. Fix or replace the hardware when it breaks, upgrade the operating systems and tools periodically, maintain your server node and that's it. I don't spend an hour a week on maintaining my home beowulf, and on average spend little more on the one at Duke (mostly on hardware maintenance, actually - Grrr).