Beowulf Logo

BPROC: Beowulf Distributed Process Space

CESDIS Logo

Goals / Grand Vision

The goal of this package is to provide key elements needed for a single system image on Beowulf cluster. Currently Beowulf style clusters still look like a collection of PC's on a network. Once logged into the front end of the cluster, the only way to start processes on other nodes in the system is via rsh. MPI and PVM hide this detail from the user but users but it's still there when either of them starts up. Cleaning up after jobs is often made tedious by this as well. (Especially when the jobs are misbehaving.)

The bproc distributed PID space (bproc) addresses these issues by providing a mechanism to start processes on remote nodes without ever logging into another node and by making all the remote processes visible in the process table of the cluster's front end node. The hope is that this will completely eliminate the need for people to be able to login on the nodes of a cluster.

Overview

BPROC introduces a distributed process ID (PID) space. This allows a node to run processes which appear in its process tree even though the processes are physically present on other nodes. The remote processes also appear to be part of the PID space of the front end node and not the node which they are running on. The node which is distributing its pid space is called the master and other nodes running processes for the master are the slaves. Each PID space has exactly one master and zero or more slaves.

Each PID space corresponds to a real PID space on some machine. Therefore each machine can be the master of only one PID space. A single machine can be a slave in more than one PID space.

Ghost processes

Remote processes on are represented on the master node by "ghost" processes. These are kernel threads like any other kernel thread on the system (i.e. nfsiod, kswapd, etc). They have no memory space, open files, or file system context but they can wake up for signals or other events and do things in kernel space. Using these threads, the existing signal handling code and process management code remains unchanged. Ghosts perform these basic functions:

PID Masquerading

The PID masquerading modifications make it possible for a process to appear as though a process exists in a different PID space. Processes are still part of the single PID space that we're used to, but the PID related syscalls (getpid, getppid, kill, fork, wait) have been modified to treat their arguments differently and to give different responses for processes that have been tagged as masqueraded.

PID masquerading also introduces a user space daemon to control some of the PID related operations normally done by the kernel. Each daemon will define a new "PID space" and can create new processes in that space. Operations such as new PID allocation are handled by this daemon. (In the case of a masqueraded process forking a new masqueraded PID will be needed for the child process. This request gets sent out to the user space daemon which will forward it to the master node. The ghost process there will fork and return the child PID it gets back to the slave on the node. The child's new masqueraded PID is set to that PID.)

The PID related syscalls will only operate on other masqeraded processes that are in the same PID space (that is they're under control of the same daemon.) Other non-masqueraded processes on the local system become effectively invisible as a result.

Signals (kill(2)) that cannot be delivered locally are bounced out to the user space daemon for delivery.

This is used in conjunction with ghosts to make it appear as though a piece of the master node's PID space has been moved onto the slave node. When creating remote processes, there are really 2 processes created, a ghost on the master to represent the remote process and the real process on the slave node. The (masqueraded) process on the slave gets the same PID as the ghost on the master node. The daemon controlling the masqueraded process space in forwards requests from the real process back to the master node. This way any operations the real process performs (fork, kill, wait, etc) will performed in the context of the master node's process space. Most requests will be satisfied by the ghost thread.

Starting processes

There are two basic ways to start a process in this scheme. The simpler one is the rexec (remote execute) which takes the same arguments as execve plus a node number. This inteface also has roughly the same semantics as the execve system call. This doesn't involve transfering much data, but it does require that all binaries and any libraries they require be installed on remote nodes.

The other interface which bproc provides is a "move" or "rfork" interface. This works by saving a process' memory region and recreating it on the remote node. This has the advantage that it can transport the binary and anything mmap'ed (such as the dynmaically linked libraries) to the remote node. This could allow a great reduction in the size of the software required to be installed on a node.

More information

Here are the slides from the talk about BPROC at the Linux Expo '99.

NEW: The bproc user guide is available here.

Getting bproc

The latest bproc release is available from: bproc-0.2.0.tar.gz

Current state

This state of this code is alpha. I believe it to be stable on the x86 and Alpha architectures. That is, it's not likely to cause any problems with the kernel although there are most likely still cases where it can get hung up and will require you to restart the daemones, etc. Also, features and functionality are still actively being added so interfaces are subject to change, etc, etc.

Mailing list

The beowulf-bproc@beowulf.gsfc.nasa.gov mailing list exists for the purpose of discussing development of this package.

To subscribe to this list send a message containing "subscribe" as the body to beowulf-bproc-request@beowulf.gsfc.nasa.gov.

For other help with majordomo (such as how to unsubscribe) see our majordomo page.

Gee wouldn't be nice if... ([Partial] Wish List)

Other projects

(Or: Why we believe we're not reinventing the wheel.)
Beowulf Global PID space
This should not be confused with the older global PID space done for Beowulf systems. Bproc is not a derivative of that scheme. The old global PID space scheme was a way for nodes to avoid allocating PIDs that would collide with one another. That in conjunction with a signal forwarding scheme allowed signaling processes on other nodes transparently but it still required you to get knowledge about remote processes in some other way.
Condor
Condor exists to address problems of resource allocation over very large numbers of systems owned by different people. It includes some process migration capabilities as well.

Bproc aims to provide a single process space within the tightly controlled environment of a beowulf cluster. Bproc doesn't address resource allocation or load balancing at all.

Mosix
This project aims to provide a single system image similar to Mosix but does not attempt to do Mosix style transparent process migration. Bproc will allow a process to migrate from one node to another but this migration is not transparent or automatic. On the other hand, bproc should avoid most if not all of the performance penalties associated with Mosix style migration.

Contact: Erik Hendriks hendriks@cesdis.gsfc.nasa.gov
Page last modified: 1999/08/19 17:37:09 GMT
CESDIS is operated for NASA by the USRA