- Profile and Analyze Task(s) for Parallelizability:
- Profile your task(s). Determine how much time is being spent
doing work that could be done in parallel and how much time is being
spend doing work that must be done serially. Task profiling is
covered in a chapter below.
- Use Amdahl's Law (covered in a chapter of its own) to determine
whether or not there is any point in proceeding further. If not,
quit. Your task(s) runs optimally on a single processor, and all you
get to choose is which of the various single processors to buy. This
book can still help (although it isn't its primary purpose) - check
out the chapter on ``node hardware'' to learn about benchmarking
numerical performance.

- Profile your task(s). Determine how much time is being spent
doing work that could be done in parallel and how much time is being
spend doing work that must be done serially. Task profiling is
covered in a chapter below.
**Parallelize your code**:- Just splitting up the serially written routines in your
original task (as you probably did to get your first estimate of
potential parallel speedup) isn't particularly optimal - one has to
adopt ``real'' parallel algorithms to gain the greatest benefit.
Consult one of the resources in the bibliography to learn about
parallel algorithms.
- Figure out how much information will have to be communicated
between the various subtasks being done in parallel, and estimate how
long it take for those communications to be completed with various
kinds of networks.
- Parallelizing added some time to the serial and parallel
time estimates again, so recalculate them.
- Some of the tasks finish at the same time, some don't.
Sometimes we have to wait for everything to be done and resynchronize
(at ``barriers''), sometimes we don't. Add in and estimate for all
the serial time wasted waiting for things to finish up and so forth.
- Use the more detailed forms of Amdahl's Law given in the
chapter to determine whether or not there is any point in proceeding
further. If the answer is unequivocally no, then quit. Give it up;
use a single processor that is as fast as you can afford or need.

Note that these first two steps can be greatly shortened (to approximately zero) if there already exists a decently written parallel version of your code. Quite a lot of parallel code already has been written by various folks around the world. Look around in the chapter on parallel applications and ask on the beowulf list before tackling a new parallelization project - you might find that your particular wheel has already been invented or that some other wheel can, with a little regrinding, be made to serve as a template for your own.

- Just splitting up the serially written routines in your
original task (as you probably did to get your first estimate of
potential parallel speedup) isn't particularly optimal - one has to
adopt ``real'' parallel algorithms to gain the greatest benefit.
Consult one of the resources in the bibliography to learn about
parallel algorithms.
**Build Tables**:- Build a single node performance table for as many candidate
nodes as you wish or can afford, using measurements where you can and
calculated estimates where you must.
- Build an estimated multinode performance table for as many
networks as you wish or can afford, using measurements where you can
and calculated estimates where you must.
- All your answers in the previous step depended on a certain
program size. Sometimes making a program ``bigger'' makes it
parallelize more efficiently (explained below in the Amdahl's Law
chapter). Determine the approximate effect of scaling up the task
size
*and*varying the number of nodes in your multinode performance table. - Attach costs and determine benefits as weights and transform
the multinode performance table into a cost-benefit table (covered
in a chapter of its own).

- Build a single node performance table for as many candidate
nodes as you wish or can afford, using measurements where you can and
calculated estimates where you must.
**Design your Beowulf**:- Determine your performance threshold.
- Determine your price threshold.
- Select the beowulf (or cluster, or even dedicated parallel
supercomputer - if necessary) design that optimizes price
performance.

- Determine your performance threshold.

Simple, really. You figure out if your computational task can be sped
up ``at all'' by parallelizing it under ideal circumstances. You figure
out pretty much how to parallelize it, up to and including writing or
finding or adapting a parallel version of the computational code. You
then do a *cost-benefit analysis* of the alternative ways to get the
work done^{2.9}.

This is a bit oversimplified, of course. For example, the *best*
parallel algorithm to choose for certain numerical tasks depends on
things like the size of the task and the relative speeds of numerical
operations and IPC's, so using one particular one for all the different
combinations of node and network hardware isn't likely to give you a
global optimum in price performance. The price performance is a
function of many variables - the problem itself, the node design (and
all the variables that describe a node), network design (and all the
variables that describe the network, including e.g. the topology), and
the algorithms and methodology used to parallelize the problem.
Furthermore, it is a *nonlinear* function of all of those variables
- sometimes changing something a tiny bit at very low cost or no cost
at all can double overall performance. Or halve it.

Still, the general approach above should suffice to get you into the
*vicinity* of an optimum solution, which is generally good enough.
In particular, it will give you a very reliable answer to the question
of whether or not you can afford to build a beowulf that will run your
problem satisfactorily faster. If the answer is yes (and often it will
be) then some strategic prototyping and parametric tweaking will often
let you do even better.