next up previous contents
Next: Conclusion Up: A Model for Cluster Previous: Physical Infrastructure   Contents

The (De)Centralized Management Group

In this section the primary duties and responsibilities of the core group responsible for supporting cluster operations on campus are articulated. As noted in the various sections above, this group will be responsible for:

In parentheses after each chore a list of names is given for the expected initial assignment of responsibilities. It should be noted that most of these ``assignments'' de facto recognize that this work is already being done, for the most part, by these individuals or would be clearly expected to devolve to one of them once this model is approved and adopted.

The first entry, coordination, is perhaps the most important task on the list. The University has an amazing amount of the infrastructure required to support cluster computing on nearly any desired scale in place. What it primarily lacks is a clearinghouse, a central entity that can connect up this infrastructure and support mechanism with those that might wish to participate. It also is very much a decentralized operational model. Individuals from the University (Bill Rankin), OIT (the acpub group), Arts and Sciences (Seth Vidal), and even the faculty (Robert Brown) all play key roles in establishing the initial ``centrally managed cluster''. Support on a broader basis comes from the entire University cluster community. Initially, at least, very little in the way of additional staff should be required to get the project off the ground, but coordinating the staff and contributions we have (from all over campus) will very definitely require someone working very hard to make it so.

However, there is one very important caveat to this very optimistic staff layout. Some individuals who play key roles in this project (notably Seth Vidal) are obviously immensely valuable to the University already and essential to the success of the model. Because of their value they are already heavily overburdened. Great care must be taken to prioritize their task assignments and provide additional support to their local management groups to in some measure protect their time and sanity. Seth can and does manage install.dulug.duke.edu. He very likely can help a great deal working with Bill to come up with a universally accessible, nearly fully automated cluster node kickstart file (for example) and floppy or PXE images that automatically access it. However, Seth also has primary responsibilities managing the physics LAN and secondary responsibilities helping out systems managers in many other University departments. He simply cannot be spread indefinitely thin by adding extensive cluster training and support responsibilities to this list beyond what he already does voluntarily and as time permits on e.g. the dulug mailing list, at least not without further augmenting the physics system staff to partially free some of his time.

The same is true to a greater or lesser extent for all the decentralized participants in this organization. Robert Brown already provides ongoing cluster computing support to groups all over the world via the beowulf list and there is no reason to suppose that that support wouldn't extend to anyone at Duke who needs it time permitting (given his other responsibilities of teaching, doing research, and raising children). Jeff Chase is similarly engaged in teaching and research. Sean O'Connell (and the various other systems managers with cluster experience who would almost certainly participate in the decentralized staff) has LAN responsibilities that are primary, but otherwise would cheerfully contribute time and energy via a list or cluster support group.

This point has been made repeatedly above and is worth repeating yet again. The ``decentralized'' model for centralized support proposed above is entirely consistent with the philosophy and reality of cluster management as it has evolved on campus, providing greatly improved and somewhat centralized support for (almost) ``free'' (out of opportunity cost time and zero margin scaling of FTE effort already being paid for around campus, plus the cost of a coordinating central group that can ``fill in the cracks'' created by the decentralized approach). However, whereever providing that support pushes key individuals to FTE boundaries, the University and its participating schools and entities have to be prepared to increase staff or redirect responsibilities to make it work. The model is scalable and largely self-supporting, it is not free. The way this is funded and accomodated may require some genuine cooperation and honest accounting for value between many disparate branches of the University; it is the view of this plan that this is all to the good.

It is also the editorial opinion of the author of this plan that this is actually possible at this point in Duke's IT history. Ten years ago this model would have laughably and expensively failed as someone or other attempted to build a centralized ``cluster computing empire'' and have it funded out completely out of proportion to the services delivered. Today the situation is entirely different, with genuine cooperation and coordination between all the various levels of IT across campus and a sense of collegieality and commonality of purpose that transcends funding models and empires. This model will likely test the University's ability to cooperatively implement things efficiently across many administrative boundaries, but given the success of netcom, of acpub, and of other groups that operate in just that matter there is no reason to doubt the possibility of success here.

Regardless, it is very likely that implementation of this plan will require adding one or two more FTE's (probably under the direction of Bill Rankin and/or Rob Carter) over the course of its first year, at least if it is at all successful in ``selling'' public cluster nodes. If it is not so successful, it minimizes startup investment in the first year in any event and can easily enough be modified to reflect experience at the end of a year.


next up previous contents
Next: Conclusion Up: A Model for Cluster Previous: Physical Infrastructure   Contents
Robert G. Brown 2003-06-02