The Sun Grid Engine is a tool that makes a network of compute
resources into a "grid" where users can submit jobs and have them queued
and redistributed to run on resources anywhere on the grid as they
become available. It contains at least provisional abilities to manage
policy on grid resources that are also interactive workstation -- to
stop grid tasks "instantly" if the interactive users types or moves a
mouse. It also contains provisional abilities to handle code migration
with checkpoints. It operates in userspace (requires no kernel
modifications or modules) via daemons.
Parenthetical Aside With all those advantages, it is nearly
ideal for clusters devoted to embarrassingly parallel tasks. The only
catch is that it is SO powerful, and intended to build and run on SO
many architectures, that one gets the definite feeling that it would
build more simply and work better if most of those architectures were
trashed and energy focused on linux and gcc-based systems. aimk, for
example, is something I once played with and even hacked for my own
extensive use back when I was managing a highly heterogeneous Unix
network. Ah, what a child I was.
Eventually I sobered up and realized that heterogeneity is the root
of a lot of evil where systems management and application developement
are concerned, and that I really wanted NOT to EVER AGAIN have to screw
around with where this particular flavor of Unix keeps that particular
include file and how to hack and patch the code if the file (and its
associated library) were different and possibly incompatible.
Especially in a complex application with many contributing developers
and nonlinear constraints (like support by Sun, making it impolitic for
a Sun build ever to break) where somebody working on architecture X can
break the hell out of architecture Y and not have the fact revealed
until extensive testing has occurred on all the A-Z architectures
"supported", requiring yet another #ifdef instrumentation.
Still, SGE (like PVM) will undoubtedly prove to be worth the hassle
in the long run. In the meantime, one can only ask WHY the developers
put four or five steps into the README.BUILD instead of providing a
single script entitled "build.sh" or (better yet) a toplevel makefile
with autodocumenting targets? So fine, maybe aimk is great, but normal
humans have never used it so hide it behind regular old make. Or why
they use the incredibly arcane aimk at all instead of Gnu's autoconf
(intended to satisfy the same purpose, but a lot more modern and in
common enough use to actually be functional)? Or (being the linux bigot
that I am:-) they don't just scrap even this and focus on linux/gcc
builds only with a Makefile a child could understand?:-p
Our site REQUIRES rpm's (ideally built from src rpm's so they can
easily be REbuilt) for scalable management purposes. Building SGE into
portable RPM form looks like it will be a truly joyful process.
Especially given that following the strict instuctions in README.BUILD
on a clean checkout from the CVS tree has failed every time I've tried
it, most recently on a perfectly updated RH 7.3 system. For that
reason I've been slow to get enthused, although this summer I may have
the time required to get over the initial build blues.