This is a new and improved outline for this paper.

   I) Rough introduction, Amdahl's Law, improved versions.  Skim past
discussion of communication patterns and algorithms (refer to online
references).

   II) Discuss/Define Primary Bottlenecks:  Latency vs. Bandwidth.
Indicate how individual programs may be rate limited by any or all of
these.  Also mention superlinear speedup, when one redesigns a program
to live within some significant boundary.

  III) Then in order:CPU.  Memory (L1/L2/main).  Network (IPC channel)
-- raw (socket) and cooked (in PVM/MPI application).  Disk.  Very short.
Just define and indicate the primary rate defining features.

   IV) Engineering tools to determine key latencies/bottlenecks.
lmbench (under development, but a "complete" suite).  Tools in
"beobench".  Homemade tools.

    V) Profiling an application and then parallelizing it (predicting
overall performance).

   VI) Conclusion:  The >>best<< benchmark is always your application.
Complexity matters, and only rarely can one completely predict
performance on the basis of benchmark (micro or macro) numbers.

This MAY be too much.  Then again, it may not.  It's worth trying.