| BPROC: Beowulf Distributed Process Space
|
|
Goals / Grand Vision
The goal of this package is to provide key elements needed for a
single system image on Beowulf cluster. Currently Beowulf style
clusters still look like a collection of PC's on a network. Once
logged into the front end of the cluster, the only way to start
processes on other nodes in the system is via rsh. MPI and PVM hide
this detail from the user but users but it's still there when either
of them starts up. Cleaning up after jobs is often made tedious by
this as well. (Especially when the jobs are misbehaving.)
The bproc distributed PID space (bproc) addresses these issues by
providing a mechanism to start processes on remote nodes without ever
logging into another node and by making all the remote processes
visible in the process table of the cluster's front end node.
The hope is that this will completely eliminate the need for people to
be able to login on the nodes of a cluster.
Overview
BPROC introduces a distributed process ID (PID) space. This allows a
node to run processes which appear in its process tree even though the
processes are physically present on other nodes. The remote processes
also appear to be part of the PID space of the front end node and not
the node which they are running on. The node which is distributing
its pid space is called the master and other nodes running processes
for the master are the slaves. Each PID space has exactly one master
and zero or more slaves.
Each PID space corresponds to a real PID space on some machine.
Therefore each machine can be the master of only one PID space. A
single machine can be a slave in more than one PID space.
Ghost processes
Remote processes on are represented on the master node by "ghost"
processes. These are kernel threads like any other kernel thread on
the system (i.e. nfsiod, kswapd, etc). They have no memory space,
open files, or file system context but they can wake up for signals or
other events and do things in kernel space. Using these threads,
the existing signal handling code and process management code remains
unchanged. Ghosts perform these basic functions:
- Signals they receive are forwarded to the real processes they
represent. Since they are kernel threads, even SIGKILL and SIGSTOP
can be caught and forwarded without destroying or stopping the ghost..
- When the remote process exits it will forward its exit code back
to the ghost and the ghost will also exit with the same code. This
allows other processes to wait() and receive meaningful exit status
for remote processes.
- When a remote process wants to fork, it will need to obtain a PID
for the new child process from the master node. This is obtained by
asking the ghost process to fork and return the PID of the new child
ghost process. (This also keeps the parent-child relationships in sync.)
- When a remote process waits on a child, the ghost will do the
same. This prevents accumulation of ghost zombies and keeps the
process trees in sync.
PID Masquerading
The PID masquerading modifications make it possible for a process to
appear as though a process exists in a different PID space.
Processes are still part of the single PID space that we're used to,
but the PID related syscalls (getpid, getppid, kill, fork, wait) have
been modified to treat their arguments differently and to give
different responses for processes that have been tagged as
masqueraded.
PID masquerading also introduces a user space daemon to control some
of the PID related operations normally done by the kernel. Each
daemon will define a new "PID space" and can create new processes in
that space. Operations such as new PID allocation are handled by this
daemon. (In the case of a masqueraded process forking a new
masqueraded PID will be needed for the child process. This request
gets sent out to the user space daemon which will forward it to the
master node. The ghost process there will fork and return the child
PID it gets back to the slave on the node. The child's new
masqueraded PID is set to that PID.)
The PID related syscalls will only operate on other masqeraded
processes that are in the same PID space (that is they're under
control of the same daemon.) Other non-masqueraded processes on the
local system become effectively invisible as a result.
Signals (kill(2)) that cannot be delivered locally are bounced out to
the user space daemon for delivery.
This is used in conjunction with ghosts to make it appear as though a
piece of the master node's PID space has been moved onto the slave
node. When creating remote processes, there are really 2 processes
created, a ghost on the master to represent the remote process and the
real process on the slave node. The (masqueraded) process on the
slave gets the same PID as the ghost on the master node. The daemon
controlling the masqueraded process space in forwards requests from
the real process back to the master node. This way any operations the
real process performs (fork, kill, wait, etc) will performed in the
context of the master node's process space. Most requests will be
satisfied by the ghost thread.
Starting processes
There are two basic ways to start a process in this scheme. The
simpler one is the rexec (remote execute) which takes the same
arguments as execve plus a node number. This inteface also has
roughly the same semantics as the execve system call. This doesn't
involve transfering much data, but it does require that all binaries
and any libraries they require be installed on remote nodes.
The other interface which bproc provides is a "move" or "rfork"
interface. This works by saving a process' memory region and
recreating it on the remote node. This has the advantage that it can
transport the binary and anything mmap'ed (such as the dynmaically
linked libraries) to the remote node. This could allow a great
reduction in the size of the software required to be installed on a
node.
More information
Here are the slides from the talk about BPROC
at the Linux Expo '99.
NEW: The bproc user guide is available here.
Getting bproc
The latest bproc release is available from: bproc-0.2.0.tar.gz
Current state
This state of this code is alpha. I believe it to be stable on the
x86 and Alpha architectures. That is, it's not likely to cause any
problems with the kernel although there are most likely still cases
where it can get hung up and will require you to restart the daemones,
etc.
Also, features and functionality are still actively being added so
interfaces are subject to change, etc, etc.
Mailing list
The beowulf-bproc@beowulf.gsfc.nasa.gov
mailing list exists for the purpose of
discussing development of this package.
To subscribe to this list
send a message containing "subscribe" as the body to
beowulf-bproc-request@beowulf.gsfc.nasa.gov.
For other help with majordomo (such as how to unsubscribe) see our
majordomo page.
Gee wouldn't be nice if... ([Partial] Wish List)
- ptrace-able ghosts: Wouldn't it be great if one could run GDB or
strace on a remote process right on the front end w/o having to log into the remote node.
- Better / more configurable IO forwarding.
- TTY allocation w/ the rudimentary IO forwarding.
Other projects
(Or: Why we believe we're not reinventing the wheel.)
- Beowulf Global PID space
- This should not be confused with the older global PID space done
for Beowulf systems. Bproc is not a derivative of that scheme. The
old global PID space scheme was a way for nodes to avoid allocating
PIDs that would collide with one another. That in conjunction with a
signal forwarding scheme allowed signaling processes on other nodes
transparently but it still required you to get knowledge about remote
processes in some other way.
- Condor
- Condor exists to address problems of resource allocation over very
large numbers of systems owned by different people. It includes some
process migration capabilities as well.
Bproc aims to provide a
single process space within the tightly controlled environment of a
beowulf cluster. Bproc doesn't address resource allocation or load
balancing at all.
- Mosix
- This project aims to provide a single system image similar to
Mosix but does not attempt to do Mosix style transparent process
migration. Bproc will allow a process to migrate from one node to
another but this migration is not transparent or automatic. On the
other hand, bproc should avoid most if not all of the performance
penalties associated with Mosix style migration.
Contact: Erik Hendriks hendriks@cesdis.gsfc.nasa.gov
Page last modified: 1999/08/19 17:37:09 GMT
CESDIS is operated for NASA by the USRA