Of course, reality is generally *worse* than the highly optimistic
upper bound predicted by Amdahl's Law^{4.10}.
After all, imagine how long it would take you to actually build that
model airplane if the entire population of the world WERE in your
kitchen trying to help. ``Forever'' might be a reasonable answer as you
(and your kitchen) are crushed beneath the weight of all that help.
``Forever'' is the generally correct answer for parallel processing,
too.

There are lots of ways to divvy up the work, and the process of divvying
up the work itself takes time. It is also entirely possible to reach
fundamental limitations on how far you can subdivide a finite task made
of discrete parts (imagine a billion hands on one model airplane that
has, after all, only forty or fifty parts that are typically connected
by glue bonds made between two objects at a time). There are technical
details concerning the order in which subtasks have to be completed that
can prevent the parallel work from being cleanly divisible among
nodes^{4.11}. And so on. The following analysis
works through a few of the simpler and more obvious corrections that we
have to consider when parallelizing a task.

One way of looking at all of these corrections is that accounting for
all the extra time spent setting up and executing a parallelized task
*changes the serial and parallel fractions!* We might expect to see
new terms being added to or taken away from the fraction.
Alternatively, thinking about the time it takes to complete the various
chores in a team effort, we left out a bunch of *times* that may
well be important. Indeed, in many cases (or scales) these additional
times may be *dominant* - the *most* important thing that
determines the rate or relative speedup.

To understand some of the times we continue with our example of building a model airplane. Note that in our original speedup estimate we ignored the time that it took you to give your friend the wing part of the kit and take back the completed wing. Well, if it takes your friend twenty minutes to build a wing and twenty seconds total to receive the parts and hand back a finished wing, that's probably ok. Your final time estimate is twenty seconds longer than you thought, but compared to twenty minutes that's not too big an error.

What if your friend lives next door and is not in the kitchen with you?
You have to get up, take him the wing parts and a tube of glue, come
back, and go to work. As soon as he finishes, he has to get up and
bring you the finished wing. Now it might take *five minutes* to
take him the wing parts and *five minutes* to come back and get to
work and suddenly you've spent ten minutes setting up a parallel process
designed to save you twenty minutes. That ten minute net savings might
also be relative to an hour's total labor if you did it all yourself.
Not too good.

At least you're still in the black - you finish your airplane faster
than you would have without your friend. On the other hand, if he lived
on the other side of town (a ten minute drive either way) it would take
you much *longer* to complete the airplane with his help than it
would without it. Whoa! We can actually lose ground and *slow a
program down* by parallelizing it! Amdahl's Law (which at least
permitted all speedups strictly greater than 1 for any value of ) is
*way too optimistic*.

We've discovered a couple of new ideas in our corrected model airplane example, and we'll now proceed to incorporate them into our algebraic discussion of times and rates and speedups and such. The most important is the analogy of ``Inter Processor Communications'' (IPCs) - this is the communications step where one processor (you) sends part of the program and/or data (the wing parts and glue and instructions for assembly) to another processor (your friend), and later get back a finished wing. In all parallel code, SOME sort of IPC's are necessary. They can take a long time compared to the time actually required to do the parallel work or they can take a short time compared to the time to do the parallel work.

In some very important cases this communication process can be done only one or twice and may take a very short time compared to the time working in parallel, for example at the beginning and the end of a calculation. In all cases, though, the program itself and initial data has to somehow get to the processors working in parallel and the results of their parallel work have to be reunited into some finished whole. IPC's are definitely essential to the notion of parallel processes.

And^{4.12}they take time. And^{4.13} for
us to correctly estimate the speedup of a parallelized program, we have
to insert the time required, since it may well be significant.

In fact, if we think about it, it costs us time at least TWICE, in fundamentally different ways. Parallelizing a program, we generally increase by a bit the time required to complete the serial fraction. If we have friends waiting to take various airplane parts that are originally in our possession and build them, we may well have to carry the parts and instructions and glue to each one, one after the other. The more friends, the longer this takes. Overall, this time thus scales like the number of friends (or worse) and, if you are also responsible for doing the serial work it adds directly to the serial time because you're not working on the airplane at all while you're carrying parts around and collecting finished sub-assemblies.

We also have to increase the time required to execute the parallel task
on each node a bit over what it would have taken serially. It takes you
a certain amount of time (already accounted for in the original serial
time estimate) to put newspapers on the table^{4.14}, get the glue
open^{4.15}, and read the overall instructions
- your friend *also* needs to spread his own newspapers on his own
table ^{4.16}, open the glue, and read the general instructions for
himself before he can get started on building his wing from the
wing-specific instructions. It might take him a few minutes to do these
things and we have to increase the time it takes for ``wing building in
parallel by a friend'' over the time it takes for ``building each wing
as part of sequentially building the whole airplane yourself''. This
time usually does *not* scale up with the number of friends helping
as they can all be spreading newspaper, etc. at the same time. At the
same time, it doesn't scale down - your friends can't set up their
worksites ``in parallel'' in less time than it takes to set up a
worksite.

Finally, you and your friend will almost certainly have to wait for each
other from time to time, as already illustrated above. If you finish
one part that gets glued to a part he's working on as the next step, you
have to wait for him. Or, you may well be sharing resources. You may
have to wait while he uses the glue, for example, and a bit later he may
have to wait for you to give it back. In the meantime, you each *may* have to wait idle, although this often depends on how the task is
organized.

All this introduces new times and fractions and rates into our earlier statement of Amdahl's Law. Here's a table to help you keep track of this menagerie of variables:

- The original single-processor serial time.
- The (average) additional
*serial*time spent doing things like IPC's, setup, and so forth, per processor, in all parallelized tasks. - The original single-processor parallizable time.
- The (average) additional time spent by each processor doing just the setup and work that it does in parallel. This may well include idle time, which is often important enough to be accounted for separately.

Using these definitions, we can write our modified task completion time
when using processors:

(4.8) |

To find the Amdahlian^{4.17}speedup, we again have to evaluate . This time, being lazy,
we'll note that the always cancels so we're really just evaluating
:

Now, if we were to be picky (and let's be, just this once) this result,
however useful and marvelous, is still way too general (and hence
incorrect) to be *truly* useful and marvelous. We are in a bit of a
quandry, though. Every time we add a bit of detail, our speedup
expression gets a bit more complex. This cannot be helped, it can only
be understood, however much work it takes to understand it for your
particular numerical task. In *many* cases, this expression will
suffice to get at least a general feel for the scaling properties of a
task that might be parallelized on a typical beowulf. In others, it
won't, and you'll have to work much harder.

As just a single (but very important) example of the latter, it is
well-known that certain numerical tasks require a pairwise exchange of
information between *all nodes* between parallel steps. Each
pairwise communication might take a time (where I have no idea
what the subscript stands for, but it is different from , and
). If there are nodes and each node can thus talk to *other* nodes and we make the unhappy assumption that they are connected
by a *hub* that permits only *one* pair to talk in *one*
direction at a time (no broadcasting allowed), we find that the *total* serial IPC step requires individual pairwise
communications each costing , or

(4.10) |

This, alas, goes in the *denominator* of our relative rate
expression, which now contains terms with powers of that range from
-1 to 2. Oooo^{4.18}. Suddenly:

(4.11) |

This is by no means the only kind of scaling behavior possible. If
represents your problem ``size'' (for example, the length of a lattice
side or the number of items in a list to be sorted) then the work being
split up can depend strongly on as well. There may well be (work
and/or communication) times that scale like , other times that scale
like , and so forth. If you parallelize the part that scales like
but *not* the part that scales like , you might get decent
speedup for smallish but at some value of the will
overwhelm it.

Unfortunately, we're just getting warmed up. Imagine what we might have
to write if we account for *all* of the times spent in all the
parallelized subroutines of a complex piece of code, *including* the
effects of nonlinear determinants like cache size, memory speed, memory
size, context switching, communications speed, communications latency,
communications *pattern*, the need for points of parallel code
resynchronization (called ``barriers''), and a whole lot
more^{4.20}. Each
little piece cranks a new term into our relative rate equation, or
modifies a term that is already there.

The final insult is that all of these times are totally algorithm
dependent and completely different algorithms are often ``best'' for
parallel computations than for serial computations. There are Clever
Tricks that can often be used to change the communications pattern and
that produce quite different scalings of the communication times and
idle times. I *told* you that this sort of thing carries over into
the realm of Real Computer Science really quickly. Trying to
calculate all these terms and times, in detail, *a priori* for
complex pieces of code is well-nigh impossible, and few beowulfers (or
other kinds of parallel computer programmers) do it. Real Computer
Scientists do, it appears (often less for the result than for the papers
they can publish describing the result), and bless them for it since
then we don't have to.

However, there is no escaping the need to perform a few *basic and
practical* steps. The Wise Beowulfer *will* determine the
-scaling and -scaling of the times of at least the most important
blocks of parallelizable code in your task(s) (hopefully, your task will
fall into some ``generic'' category discussed in detail below, but you
at least have to identify the category). The Wise Beowulfer *will*
at least think about the interaction between your hardware and
networking design and algorithm and these powers and times.

Let me conclude this chapter by showing you why you should bother with a couple of practical - if somewhat contrived - examples.

Suppose *you* are charged with building a beowulf to carry out an
``all pairs communicate every step'' task like the one described above
that led to the (somewhat naive) time scaling for ``all
pairs one at a time'' *hub* based communications. Studying the
problem, you rapidly learn that you can achieve the same result (all
pairs exchanging data) in time if a *switch* that can
support simultaneous bidirectional communications is used instead
of a hub (and is even). Or you could try using a broadcast on the
hub (if your software parallel communications library supports a true
hardware broadcast, a thing that might require a prototype to validate
since some libraries might implement their group ``broadcast'' function
as a series of pairwise communications to the hosts in the group list),
which would yield time. You also must consider that might
be ten times smaller for a 100 Mbps hub than for a 10 Mbps switch that
costs about the same amount. Then there is a 100 Mbps switch to
consider, which costs a bit more. Then, we haven't really considered
the algorithm. Depending on the message sizes, the latency, and a few
very esoteric things, there are algorithms that might reduce the
to (e.g.) even for a hub (or parallel library) with no
broadcast. Finally, we haven't worried about *how* to program a
synchronized transfer like the one required to obtain optimum time
through a switch. All the nodes have to be talking at the same time and
to just the right node, in both directions, to get the improved node
scaling, and this is not at all easy to arrange.

Your job, your future, your health and your happiness all ride on making the right choice here. Which one do you buy?

One is tempted to say ``the 100 Mbps switch'', and for many problems
(possibly most) this would be the correct answer. That's why it is in
the ``recipe'' given at the very beginning. It is relatively cheap and
adequate to give decent scaling (both power and base time) for many
problems. However, if you can only afford eight nodes, and you're doing
a problem where
for a *10* Mbps hub, the correct
answer might be to get the cheapest damn thing you can lay your hands
on. There are times when the answer could even be ``who needs a hub''
- early PC-based ``beowulfish'' computer efforts not infrequently
involved a floppy disk carried around to a bunch of computers *by
hand*^{4.21}. Pop in the floppy, merge the data, carry it
around, and when it is all shared start another week-long run on all the
computers involved.

In quite a few cases, though, the correct answer would be a far more
expensive *gigabit* (per second) switch, like Myrinet or perhaps
gigabit ethernet. If you want to be able to scale up to ``lots'' of
nodes (say 64 or more) it may be crucial to reduce by an order of
magnitude *and* obtain the most favorable -scaling!

This illustrates just a few of the realities confronting the would-be
beowulf designer. In some cases you can derive an approximate scaling
form for the relative rate equation appropriate for your particular
algorithm and communication pattern. In other cases (most cases,
really) you are far better off looking it up (along with a whole lot of
other stuff) in a *real* book on parallel computation^{4.22}. The reason you should consider
looking things up is because there are smart and stupid ways to do
simple little things like multiply matrices in parallel or send a
message between all nodes in between program steps. Some of them are
very non-intuitive - you'll never invent them on your
own^{4.23}. This is what
Computer Scientists (the real variety) live for. C'mon, give them their
moment of glory.

Once you have the scaling form of the relative speedup appropriate to
your algorithm and the various network media types you are considering
you can use it (and some measurements) to make *estimates* for the
speedup possible for a parallelized chunk of code. This is less
difficult than it sounds. In practice, all this mathematical work isn't
so daunting - usually most of the parallelizable work is done in just a
few program blocks and all the surrounding serial code can be added up
at once into the irreducible serial work and the irreducible serial
time. In a lot of cases, in fact, there will be just *one*
parallelizable block. In the best cases the *whole program* can be
converted to a parallel block, where the only required serial code is
something to start a lot of programs in parallel and collect the
results. These are called ``embarrassingly^{4.24} parallel
computations''.

Although embarrassingly parallel computations are important enough to be
given their own acronym (EPC) and to be considered in detail later,
we'll take a moment to think about them here as well as they have an
important lesson for us to learn before we leave the discussion of
mathematical estimates of rates and so forth behind. To understand them
we can return to Our Favorite Metaphor by thinking of building *lots* of identical model airplanes with our friends. One can get great
parallel efficiency (another way of saying ``a speedup like
'') by just getting friends into a room^{4.25} and
giving each one their own kit and glue. If it takes you ten minutes to
distribute 100 kits, and your ``nodes'' build 100 airplanes in one hour
more, you've built 100 airplanes in and hour and ten minutes, for a
speedup a hair less than 100. Not bad compared to the 100 hours it
would have taken you to build them all one at a time, and you didn't
even need to get glue on your own fingers. If you have 1000
friends^{4.26} and can still distribute
all the kits in an hour or so (good luck), your gains get even better.

The model airplane construction, in this case, is being run as an
embarrassingly parallel task. Now you know what the phrase means. One
processor starts essentially identical jobs (on other processors)
all at once, then kicks back for a relatively long time (perhaps sipping
a metaphorical brew or two, perhaps doing a job itself) until they all
complete, and then collecting the results. Repeat until finished, with
near perfect scaling. Technically, we've arranged things so that ,
, and are all much much less than (with no
particularly strong additional constraints on the way tasks are started
or finish) so that
as required. This is the way a
compute cluster of nearly *any* sort can be used to get fabulous
amounts of work done in parallel. Later we'll talk about the SETI
project and how to turn the entire internet into a cluster
supercomputer.

This embarrassingly parallel example also gives us a hint of how to *improve* our speedup for parallel operation, all things being equal.
Suppose we can distribute one airplane kit per minute and need to build
ten airplanes. Suppose it also takes only one minute to build an
airplane one at a time (perhaps they are the cheap balsa ones your
dentist gives to your kids as a ``reward'' for not biting her fingers).
Hmmm, pretty lousy gain^{4.27}. Now, think about the speedup if it
takes two minutes to build an airplane^{4.28}. Better, but not
spectacular. What about a hundred minutes^{4.29}? Aha!
In a lot of cases we can go from pretty shabby parallel speedup scaling
to spectacular astounding parallel speedup scaling by just *increasing the amount of parallel work done* while, of course, keeping a
lid on the additional serial fraction associated with doing the
additional parallel work.

As a parenthetical aside (in a work that a cruel person might consider a
huge conglomeration of parenthetical asides [some including nested
parenthetical comments of their own] arranged non-parenthetically),
one could also be tempted to *reorganize the task completely* from
its serial arrangement by setting up an assembly line where each friend
just adds one part to a model airplane and hands it to the next person
in line. As Henry Ford discovered, such an arrangement requires
considerably greater effort (and capital) to set up, but actually can
allow the model airplanes to be completed even *faster* than in the
embarrassingly coarse grained parallel implementation of the serial work
by actively reducing the time required to complete the parallelized
work.

Naturally, similar arrangements can occur in parallel programming,
especially when considering the additional costs of e.g. flushing and
reloading a cache or performing a context switch (which can make it more
expensive to do a series of different things in parallel than to do one
thing many times). We might even discuss a few later. This is one of
several circumstances where Amdahl's Law *might* be wrong, or at
least (as previously noted) irrelevant, as there is no useful analog of
an assembly line in a non-parallel work situation.

Anyway, I've now completed most of the formal algebraic analysis that
I'm going to do. That's the good news. The bad news is that I didn't
even try to do a complete or detailed job of the formal analysis - I've
only taught you enough^{4.30} that you
should be able to *figure out* how to do what you need to do for
your own particular task of beowulf design. If your task is complicated
enough to be beyond the power of this simple analysis to elucidate (and
isn't similar to one of the ones I consider in detail later) then I
guess you'll have some work to do, including obtaining and learning from
more advanced resources.

There is still one important step to complete before leaving scaling laws completely. Many of you probably looked at equations like (4.9) with the patient, somewhat quizzical expression that I might have if suddenly confronted with a pair of Tibetan monks asking directions to the nearest mall (in Tibetan, of course). I'm so glad you managed to hold on (out of sheer politeness, I'm sure). In the next section we'll actually show the pictures.