From josip@icase.edu Fri Dec  8 18:53:43 2000
Date: Fri, 08 Dec 2000 12:49:07 -0500
From: Josip Loncaric <josip@icase.edu>
To: Daniel Ridge <newt@scyld.com>
Cc: beowulf@beowulf.org
Subject: Re: Compiling Beowulf software

Daniel Ridge wrote:
> 
> Your fresh Scyld Beowulf machine probably does not have LAM installed --
> we ship a very slightly modified MPICH instead.
> BTW: I had a conversation with Jeff Squyers on this point and I think we
> might be able to get LAM to support the Scyld Beowulf platform with only a
> small amout of work.

That would be rather nice.  In our tests, LAM has generally performed
better than MPICH (which has 36% higher latency).  Also, LAM shared
memory performance using usysv transport on our SMP boxes is about as
good as the hardware can deliver (1 microsecond latency, 266 Mbyte/s
peak bandwidth).  MPICH shared memory performance is not as good (16
microsecond latency, 235 Mbyte/s peak bandwidth).  On the minus side,
LAM requires an auxiliary process lamd on each node.

While we are on the topic of daemons on nodes, PBS uses some (pbs_mom). 
Also, some networks require daemons on nodes (e.g. Giganet cLAN uses
clanmgr and clanagent).  It would be nice if this could be incorporated
into a Scyld cluster on per-node basis (e.g. some nodes may need such
daemons, others not).  Given that clusters are often built with
different generation hardware sets, there may be other specific
requirements (e.g. different lm_sensors or ECC monitoring modules).  A
mechanism similar to using /var/beowulf/boot.img.# to load node # with
its own boot file may help (we were able to use this to load SMP or
uniprocessor kernels into nodes as appropriate).  

Could the Scyld node_up script be augmented to carry out node-specific
hardware initialization and daemon startup?  Perhaps the the default
node_up could try starting a node_specific.# script after finishing the
default setup...

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

_______________________________________________
Beowulf mailing list
Beowulf@beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf

From josip@icase.edu Fri Dec  8 18:55:40 2000
Date: Fri, 08 Dec 2000 15:18:50 -0500
From: Josip Loncaric <josip@icase.edu>
To: Patrick Geoffray <pgeoffra@cs.utk.edu>
Cc: beowulf@beowulf.org
Subject: LAM SMP performance

Patrick Geoffray wrote:
> 
> On Fri, 8 Dec 2000, Josip Loncaric wrote:
> 
> > memory performance using usysv transport on our SMP boxes is about as
> > good as the hardware can deliver (1 microsecond latency, 266 Mbyte/s
> > peak bandwidth).  MPICH shared memory performance is not as good (16
> > microsecond latency, 235 Mbyte/s peak bandwidth).  On the minus side,
> 
> I am very surprised by the SMP performance. 1 us is very very (too) low,
> it's the cost a of a system call. usysv uses SYS V semaphores, and I don't
> think it's possible to reach this level of latency with them.

I believe that you are thinking of sysv (semaphores).  LAM compiled with
usysv uses spinlocks, and the peak 266 Mbyte/s bandwidth is reached for
8KB cache-to-cache copies.  Memory gets involved only for larger message
sizes, and then the bandwidth drops to 127 Mbyte/s.  See my raw data
(NPmpi from netpipe-2.3) at:

http://www.icase.edu/~josip/phase23-64-TCP-lam/NPmpi.out
http://www.icase.edu/~josip/phase23-64-TCP-mpich/NPmpi.out

and the summary of my findings at

http://www.icase.edu/~josip/MPIonCoral.html

Also, I'm told that LAM reaches similar shared memory performance levels
on Suns (Solaris).

Important: LAM with usysv (spinlocks) works great, but performance can
drop by a factor of 100,000 if more than one process per CPU is
started.  If you must use more than one process per CPU, compile LAM
with sysv (semaphores) instead.  Benchmark your code and pick the best
library for the job...

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

_______________________________________________
Beowulf mailing list
Beowulf@beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf

From josip@icase.edu Fri Dec  8 19:02:54 2000
Date: Fri, 08 Dec 2000 17:26:53 -0500
From: Josip Loncaric <josip@icase.edu>
To: Patrick Geoffray <pgeoffra@cs.utk.edu>
Cc: beowulf@beowulf.org
Subject: Re: LAM SMP performance

Patrick Geoffray wrote:
> 
> On Fri, 8 Dec 2000, Josip Loncaric wrote:
> 
> > I believe that you are thinking of sysv (semaphores).  LAM compiled with
> > usysv uses spinlocks, and the peak 266 Mbyte/s bandwidth is reached for
> > 8KB cache-to-cache copies.  Memory gets involved only for larger message
> > sizes, and then the bandwidth drops to 127 Mbyte/s.  See my raw data
> 
> For the bandwidth measurement, it's a good occasion to talk about a good
> way to measure SMP bandwidth : some people do not accept cache-to-cache
> performance values because they do not show the real memory bus capacity,
> some others do.
> 
> I believe that it gives a more accurate result as an application usually
> write the message to send just before to send it, so the data is in the
> sender cache. On another hand, the message can be asynchronous and the
> cache can be trashed on the receiving side before the user application
> uses the payload.
> 
> What do you think ?

The performance figures which NPmpi reports are what an application
sees, and therefore should be accepted.  These cache effects are due to
the computer architecture, not clever coding.  (I believe that at the
lowest level, LAM invokes plain memcpy() to move shared memory data, but
management of data movements is actually done by hardware, which
exploits the L2 caches as much as it can.)

We all know that RAM bandwidth is a bottleneck that should be avoided
whenever possible.  Therefore, the whole idea of having caches is to get
performance boosts at small but still reasonable data sizes. 
Applications which manage their work in a cache friendly way will see
significant benefits, whether doing matrix multiplies or exchanging data
via shared memory.  In that light, peak performance is definitively
interesting, even when we measure the entire curve from 1 byte to 1
Mbyte message size.

BTW, if you are interested in sustained (out-of-cache) shared memory
performance, LAM-6.3.2-usysv still works very nicely (127.3 Mbyte/s). 
MPICH-1.2.0 is almost as good at 120.0 Mbyte/s.

Sincerely,
Josip  

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

_______________________________________________
Beowulf mailing list
Beowulf@beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf

From josip@icase.edu Fri Dec  8 19:06:22 2000
Date: Fri, 08 Dec 2000 18:31:47 -0500
From: Josip Loncaric <josip@icase.edu>
To: Patrick Geoffray <pgeoffra@cs.utk.edu>
Cc: beowulf@beowulf.org
Subject: Re: LAM SMP performance

Patrick Geoffray wrote:
> 
> On another hand, the message can be asynchronous and the
> cache can be trashed on the receiving side before the user application
> uses the payload.

I forgot to say that this is not a "data push" situation.  It is the
receiver's act of picking up the payload activates the cache-to-cache
transfer, because (thanks to cache snooping) the sender's CPU detects
that the receiver's CPU is trying to access modified data in sender's
cache.  The sender's CPU signals this to the receiver (via HITM# signal
line) and performs an implicit write-back of the modified data.  Intel's
PII manual states that "The implicit write-back is transferred directly
to the initial requesting processor and snooped by the memory controller
to assure that system memory has been updated."  This single step gives
the receiver's CPUs the sender's data, while the memory controller
updates RAM.  This situation is very likely when the receiver acts
immediately.  

However, if the receiver is busy doing something else for a while and
then decides to act, it could find that the sender's copy has long gone
from cache to RAM.  Then, the receiver would have to reload the data
from RAM.  Since the receiver acted so slowly, this outcome seems fair
to me.

BTW, since spinlocks are so fast, the probability of finding the data
still in sender's cache is greater.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

_______________________________________________
Beowulf mailing list
Beowulf@beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf
