next up previous contents
Next: Remote management-remote site Up: Cluster Siting Previous: Local management-local site   Contents

Local management-remote site

This model is very similar in cost to the former model. It presumes that the campus network backbone can be arranged and routed so that the remote site LAN can be made ``local'' as far as network routing, security, and network management are concerned. In particular, the most cost effective local-remote schema will be ones where NFS servers in the departmental LAN can be transparently, securely, and efficiently mounted by the compute node clients in the remote site. The network is ``flat'' across the two locations.

This can easily enough be set up between most departments and potential offsite but on campus cluster locations. The University wisely provide a large amount of overcapacity in terms of fibers and switching when designing the University backbone, so it is fairly straightforward to set up a pipe from (say) the physics server room directly to (say) the ISDS LAN so that the ISDS cluster in the physics server room is completely within the management boundaries of the ISDS LAN.

In this case also, then, all the LAN-specific aspects of setting up a generic cluster and making its nodes available to the owner/operators within the departmental LAN are also zero marginal cost, per node. Again, the irreducible charges are node installation and maintenance, cluster specific modifications and software management, and dealing with the special problems and needs of cluster users (in addition to the standard infrastructure cost of $1 per watt per year).

The one important additional cost is the extra time required to do anything physical with the nodes themselves. Since the nodes are in one location and the manager in another location altogether, the manager must leave their LAN and go to the node location to do things like physically boot a node with the power switch or reset button, pull a node that appears to have broken and diagnose the problem, install a new node or set of nodes. The time spent in transit, in particular, is lost relative to the local-local model. However, this makes it more likely that an FTE boundary will be encountered, especially if there are a lot of nodes at the remote site so that frequent trips are equired.

In addition, there are nonlinear costs associated with the lost productivity of individuals working in the department LAN who require immediate service when the LAN manager is offsite. In the worst case scenario, the LAN crashes the minute the LAN manager has left. Even if they return immediately once the reach the offsite cluster location and notice that the LAN is down or some crisis has occurred, it can easily cost an hour more of downtime and lost productivity for an entire department's worth of LAN users during the delay due to transit. This is one of many reasons that LAN managers don't like to have jobs split across locations, clusters or not.

It is worth noting that many of these additional costs and inconveniences can sometimes be avoided if there is a small surplus of FTE management capability at the physical site. In that case, it may be possible to just send email requesting that a site-local manager visit the server/cluster room and toggle power on a downed node, or even do the physical installation of a node (which may be nothing more complicated than racking the node, cabling it, and turning it on).

On a more organized basis, once some sort of centralized management entity is created, it may be possible to achieve something very similar at a higher level. The central management group might be running clusters in some of the same sites. As a concrete example, the physics cluster room might host both an ISDS cluster and a public cluster in addition to physics' own local clusters. It might well be that the public cluster is managed by one of the physics LAN managers (who is partially funded by the University for doing this particular job) or by the physics LAN management group (where all or part of an FTE in that group is provided for managing the public cluster). It would be a simple matter indeed to make an additional responsibility of that individual the physical (not LAN level) care of the remotely sited ISDS cluster, so that any routine work of pulling downed nodes for service or installing new nodes was done by this group without necessarily requiring the physical presence of the ISDS LAN manager.


next up previous contents
Next: Remote management-remote site Up: Cluster Siting Previous: Local management-local site   Contents
Robert G. Brown 2003-04-03