The existing model for cluster computing at Duke is one of many locally centralized, generally autonomous, cluster computing operations. This model works, and it works for certain very good reasons. Well designed clusters, located in facilities that provide adequate infrastructure such as physical space, power, cooling capacity, and networking, scale extremely well in their system management requirements. That is, barring hardware failure a cluster node should require full-time equivalent (FTE) labor on the order of an hour a year or even less to install, update, and operate. In a department that already has a competent systems manager or systems management group, it is often possible to install and operate a cluster using opportunity cost labor provided by the local manager as just another aspect of managing the departmental LAN.
This is a particularly efficient solution, as the LAN manager already provides most of the core services required by the cluster (e.g. account management, disk and backup services, software installation and management services, and security) for the departmental groups utilizing the cluster resource. These services can be extended to the cluster nodes for essentially zero marginal cost, making the labor cost for installing and maintaining the nodes the only cost that scales with the size of the cluster, and this cost scales in a particularly predictable way.
This model is also efficient for a second reason. Since there are many clusters on campus, each engineered according to the needs of its local users and being perpetually built and rebuilt as new moneys become available, there is an evolutionary optimization that naturally occurs as new ideas are tried out, good ideas and bad ideas are discovered in small scale experiments, and these ideas and experiences shared across campus. This model works well in the rapidly changing world of computer and networking hardware, where ``revolutionary'' changes occur every year and are an accepted part of doing business.
This should be compared to the likely efficiency of a monolithic model where all cluster computer operations on campus where organized and managed by a single, centralized authority. Bad ideas would be costly on an institutional scale instead of a departmental or group scale; good ideas would have to diffuse into the institution from other institutions; change would necessarily proceed at a much slower rate. Worst of all, the cluster managers would likely become increasingly dissociated from their client base and increasingly narrow in their support of the wide range of user environments likely to be familiar to the cluster users. Accountability and flexibility would be lost.
These negative elements associated with monolithic models can all be observed now in those existing computer operations on Duke that are heavily centralized, especially in the realms of mainframe computing and in the generally homogeneous academic computing clusters1. Those of us who have been associated in some way with computing on campus over decades recall well the days of the Triangle Universities Computation Center (TUCC) and its campus equivalent (DUCC), and the inefficiencies that actively drove the primary computer users on campus to abandon this model altogether in favor of organization at the departmental scale.
For all of these reasons, the model proposed herein for improved institutional support of cluster computing remains a model that is centralized locally, at the departmental level where that makes sense and in a number of distributed cluster sites where it does not make sense. It avoids the creation of any sort of monolithic centralized cluster facility that might become the Duke Supercomputing Center (DSC) to mirror the North Carolina Supercomputing Center (NCSC) as DUCC once mirrored TUCC. It relies on institutional organization and coordination enabled by technology to achieve the desired support at the institutional scale while retaining the flexibility and cost efficiency of the localized management model.
The primary features of the proposed model are thus:
There are some additional cost penalties, however. The cost of physically managing and installing the nodes is considerably higher than with strictly local nodes, as it takes a relatively long time for the departmental manager to travel away from their primary departmental LAN over to the cluster site to perform such maintenance and installation duties that require physical presence. During this time offsite, their management of their departmental LAN is obviously somewhat less responsive. Similarly, they are necessarily less responsive to the needs of the cluster owners when those needs require a trip off site over to where the cluster is physically located. At a guess, offsite management by the systems manager of the owning group is roughly twice as costly per node as onsite management by the local systems manager of the owning group.
Running a cluster in addition to a LAN involves tradeoffs that affect productivity in many ways, the most obvious one being that in many cases an administrator must choose to do one or the other, performing a sort of a task prioritization or triage as needs for services and support emerge. If the LAN manager is relatively underutilized, this is not generally a problem. If they are already heavily burdened, it can easily overburden them and result in a reduction in the quality of services.
Also, these local systems administrators are (generally) well-trained in LAN administration but may lack expertise germane to cluster management per se (where it differs). The construction of a university-level mechanism to better support and to better train onsite and offsite local managers is also a primary focus of the model proposed in this white paper.
This, then is an outline for a campus cluster support model that is fleshed out in more detail below. In it, clusters will continue to be both managed and physically located locally where it makes obvious sense to do so, as this results in by far the greatest economies of scale. Nevertheless, a University-level cluster computing operation will be proposed that will remain at least partly delocalized itself, and which will be responsible for providing a variety of levels and kinds of support to groups operating or hoping to operate clusters for many purposes throughout the University.