BPROC: User's Guide: C Interface Library

3. C Interface Library

Programs using bproc should include the bproc header file sys/bproc.h and be linked with -lbproc. This package builds both static and dynamic versions of libbproc.

This initializes the bproc library. It reads the current machine state from /var/run/bproc. (This machine state is only available on the master node.) It also reads an initial node mapping from $HOME/.bprocnodes if it exists.

int bproc_numnodes(void)

Returns the number of nodes in the system. This is the number of slave nodes (not including the front end). The nodes are numbered 0 though n-1.

int bproc_nodeup(int node)

Returns true if node is up.

int bproc_nodeaddr(int node, struct sockaddr *s, int size)

Saves the IP address of node in the structure pointed to by s. Note that bproc_init has to be called on the master node in order for this information to be available.

3.2 Node mapping

The library allows the user to create a node mapping that sits on top of the real node numbers. This allows the user to always see the nodes he is using as nodes 0 through n-1 regardless of the physical nodes in use. Mappings are presented as an array of integers. The number in element zero is the real node number node zero will map on to and so on. bproc_init() reads an initial node mapping.

int bproc_set_node_map(int *map, int numnodes)

This sets the node mapping being used by libbproc. map is a pointer to an array which lists the real node numbers that node numbers 0 through numnodes should map onto.

void bproc_clear_node_map(void)

This clears any node mapping that might be present. After this call, all node numbers will be treated as physical node numbers.

3.3 Creating processes on remote nodes

Bproc provides a number of mechanisms for creating processes on remote nodes. It is probably better to think of these mechanisms as moving processes from the front end to the remote node. The rexec mechanism is like doing a move then exec with lower overhead. The rfork mechanism is implemented as an ordinary fork on the front end and then a move to the remote node before the system call returns. Execmove does an exec and then move before the exec returns to the new process.

Movement to another machine on the system is voluntary and is not transparent. Once a process has been moved all its open files are lost except for STDOUT and STDERR. These two are replaced with a single socket. (Their output is combined.) There is an IO daemon what will forward between the other end of that connection and whatever the original STDOUT was connected to. No pseudo tty operations are done.

The move is completely visible to the process after it has moved except for process ID space operations. Process ID space operations include fork(),wait,kill, etc. All file operations will operate on files local to the node that the process has been moved to. Memory that was shared on the front end will no longer be shared.

Processes currently cannot move twice. The process movement API is only provided on the master node.

Bug: Any child processes that a process had before moving will no longer be visible to it after moving. SIGCHLD's will be delivered when they exit but it will be impossible to pick up their exits status with wait().

int bproc_rexec(int node, char *cmd, char **argv, char **envp)

This call is like execve in that it replaces the current process with a new one. The new process is created on node and the local process becomes the ghost representing it. All arguments are interpreted on the remote machine. The binary and all libraries it needs must be present on the remote machine. This function returns -1 on failure and does not return on success.

int bproc_move(int node, int flags)

This call will move the current process to the remote node number given by node. The flags argument determines the details of the memory space move. See the VMADump for details on the flags argument. Returns 0 on success, -1 on failure.

int bproc_rfork(int node, int flags)

The semantics of this function are designed to minic fork() except that the child process created will end up on the node given by the node argument. What happens behind the scenes is the process forks a child and that child performs a bproc_move() to move itself to the remote node.

By combining these two operations in a system call, we can prevent zombies and SIGCHLD's in the case that the fork is successful but the move is not.

On success, this function returns the process ID of the new child process, on failure it returns -1.

int bproc_execmove(int node, char *cmd, char **argv, char **envp)

This function allows migration of ordinary binaries by allowing you to exec a new process and move the new process before it "wakes up".

Returns -1 on failure, does not return on success.

3.4 VMADump: Dumping and restoring processes

VMADump is a kernel module distributed with bproc which will dump a process's state to or from a file descriptor. VMADump is short for Virtual Memory Area Dumper. It will read or write to pipes, sockets, etc. as well as ordinary files. These functions are used internally by bproc to move processes around. The saved state includes:

All the processes memory regions. The date for all writable regions is saved. Read-only regions that are mmap'ed from files (i.e. glibc code) can be stored as file references to reduce the size of dumps.
Other information about memory mmap'ed regions like where the bss and stacks here. This allows stacks to grow and setbrk (malloc) to work after restoring the memory space.
The process's registers including FPU state.
The process's signal handlers.

The following interface is provided for vmadump in libbproc:

int bproc_vmadump(int fd, int flags)

This takes the current process and dumps it to the file fd. It returns the number of bytes written to fd. When the process is undumped, this function will return 0. The flags argument determines what memory regions will have their data dumped and which ones will be stored as file references. Writable memory regions are never stored as file references.

VMAD_DUMP_LIBS

If given, read only mmaps from files in /lib and /usr/lib will not be stored as file references.

VMAD_DUMP_EXEC

If given, read only mmaps from the executable file will not be stored as file references.

VMAD_DUMP_OTHER

If given, other read only mmaps not falling into the categories above will not be stored as file references.

VMAD_DUMP_ALL

If given, no read only mmaps will be stored as file references. This is the safest option if in doubt. This is the logical OR of the other flags.

int bproc_vmaundump(int fd)

This attempts to undump an image from fd. This function is not very error tolerant. If something goes wrong half way through undumping, it will return with a half-undumped process. If successful, the current process is replaced with the image from the dump. (much like exec)

Next Previous Contents

3. C Interface Library

3.1 System Information

3.2 Node mapping

3.3 Creating processes on remote nodes

3.4 VMADump: Dumping and restoring processes