From rcferri@us.ibm.com Wed Apr 11 23:39:47 2001
Date: Thu, 5 Apr 2001 21:40:32 -0400
From: Richard C Ferri <rcferri@us.ibm.com>
To: Robert G. Brown <rgb@phy.duke.edu>
Cc: Giovanni Scalmani <Giovanni@lsdm.dichi.unina.it>, beowulf@beowulf.org
Subject: Re: Node cloning

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

List-Id: Discussion of topics related to Beowulf clusters <beowulf.beowulf.org>
X-BeenThere: beowulf@beowulf.org
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by blueraja.scyld.com id VAA05941



     Since cloning continues to be a fertile topic, I'll jump right in...
if you're not interested in node installation or cloning, skip this note.=
..

     I feel that the node installation problem using the NFS root/tftp/px=
e
boot approach has been solved by LUI (oss.software.ibm.com/lui) and other=
s.
I don't see why anyone would need to roll their own solution.  When one
defines a node to LUI, LUI creates a custom remote NFS root  and updates
dhcpd.conf or /etc/bootptab with an entry for the node.  One chooses a se=
t
of resources to install on the node, and creates a disk partition table.
Resources are lists of RPMs or tar files, custom kernels, individual file=
s
(/etc/hosts, /etc/resolv.conf, /etc/shadow would be good examples).  You
either pxe boot the node, or boot from diskette using etherboot technolog=
y.
The node boots, gets a custom boot kernel over the network via tftp, and
transfers control.  The kernel mounts the remote root, and reads the list
of allocated resources.  Based on the resources, the node partitions the
harddrive,creates FSs,  installs RPMS or tar files, copies any specified
files, installs a custom kernel, and so on.  The software configures the
eth0 device based on the IP info for that particular node, assigns a
default route and runs lilo to make the node ready to boot. If you allow
rsh, LUI will also remove the /etc/bootptab entry, and optionally reboot
the node.  It keeps a log of all activity during install.

     The goal of the LUI project is to install any distro on any
architecture (ia-32, itanium, PowerPC and alpha).  So far RedHat and ia-3=
2
are supported, but Suse and PowerPC are in test but not ready for prime
time.  It's an open source project, and open to contributors.  Since LUI =
is
resource based, and resources are reusable, it's perfect for heterogenous
clusters, clusters where nodes have different requirements.  Many people
have said that the NFS/tftp/pxe solution doesn't scale and should be
abandoned.  Well, users have installed 80-way clusters using LUI, and whi=
le
that's not huge, it's not dog meat either.

     Simple cloning, basically copying an image from one golden node to
another, changing some rudimentary info along the way, is performed today
by SystemImager, based on rsync technology. rysync is superior to simple
copy in that you can easily exclude files or directories (/var for exampl=
e)
and can be used for maintainence as well.  rsync does intelligent copying
for maintenance -- it copies only files that are different on the source
and target systems, and copies only the parts of the file that have
changed.  SystemImager and rsync are good solutions when the nodes in you=
r
cluster are basically the same, except for IP info and disk size.

     Then there's kickstart.  Well, it's ok if you do RedHat.

     I think the real burning issue is not how to install nodes, but
*whether* to install nodes or embrace the beowulf 2 technology from SCYLD.
I think SCYLD is close to becoming the linux beowulf appliance, a turnkey
commodity supercomputer.   It will be interesting to see how many new
clusters adopt traditional beowulf solutions, and how many adopt beowulf
2...

the view from here, Rich

Richard Ferri
IBM Linux Technology Center
rcferri@us.ibm.com
845.433.7920

"Robert G. Brown" <rgb@phy.duke.edu>@beowulf.org on 04/05/2001 06:47:46 P=
M

Sent by:  beowulf-admin@beowulf.org


To:   Giovanni Scalmani <Giovanni@lsdm.dichi.unina.it>
cc:   <beowulf@beowulf.org>
Subject:  Re: Node cloning



On Thu, 5 Apr 2001, Giovanni Scalmani wrote:

>
> Hi!
>
> On Thu, 5 Apr 2001, Oscar Roberto [iso-8859-1] L=F3pez Bonilla wrote:
>
> > And then use the command (this will take long, so you can do it
overnight)
> >          cp /dev/hda /dev/hdb ; cp /dev/hda /dev/hdc ; cp /dev/hda
/dev/hdd
>
>   I also did this way for my cluster, BUT I've experienced instability
> for some nodes (3/4 over 20). My guess was that "cp /dev/hda /dev/hdb"
> copied also the bad-blocks list of hda onto hdb and this looks wrong
> to me. So I partitioned and made the filesystems on each node and then
> cloned the content of each filesystem. Those nodes are now stable.
>
> A question to the 'cp gurus' out there: is my guess correct about
> the bad blocks list?

One of many possible problems, actually.  This approach to cloning
makes me shudder -- things like the devices in /dev generally have to
built, not copied, there are issues with the boot blocks and bad block
lists and the bad blocks themselves on both target and host.  raw
devices are dangerous things to use as if they were flatfiles.

Tarpipes (with tar configured the same way it would be for a
backup|restore but writing/reading stdout) are a much safer way to
proceed.  Or dump/restore pipes on systems that have it -- either one is
equivalent to making a backup and restoring it onto the target disk.
One reason I gave up cloning (after investing many months writing a
first generation cloning tool for nodes (which booted a diskless
configuration, formatted a local disk, and cloned itself onto the local
disk) and started a second generation GUI-driven one) was that just
cloning isn't enough.  There is all sorts of stuff that needs to be done
to the clones to give them a unique identity (even something as simple
as their own ssh keys), one needs to rerun lilo, it requires that you
keep one "pristine" host to use as the master to clone or you have the
very host configuration creep you set out to avoid.  Either way you end
up inevitably having to upgrade all the nodes or install security or
functionality updates.

These days there are just better ways (in my opinion) to proceed if your
goal is simple installation and easy upgrade/update and low maintenance.
Cloning is also very nearly an irreversible decision -- if you adopt
clone methods it can get increasingly difficult to maintain your cluster
without ALSO developing tools and methods that could just as easily have
been used to install and clean things up post-install.

Even so, if you are going to clone, I think that the diskless->local
clone is a very good way to proceed, because it facilitates
reinstallation and emergency operation of a node even if a hard disk
crashes (you can run it diskless while getting a replacement).  It does
require either a floppy drive ($15) or a PXE chip, but this is a pretty
trivial investment per node.

   rgb

--
Robert G. Brown                            http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www=
.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

From pdiaz88@terra.es Fri Jul  6 12:28:55 2001
Date: Fri, 6 Jul 2001 19:15:00 +0000
From: "Pedro [iso-8859-1] Daz Jimnez" <pdiaz88@terra.es>
To: Robert G. Brown <rgb@phy.duke.edu>, Eric Linenberg <elinenbe@umich.edu>
Cc: beowulf@beowulf.org
Subject: Re: Newbie who needs help!

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On the subject of installation and disk cloning on cluster system I've write 
an small article wich can be reached 
here:http://planetcluster.org/sections.php?op=viewarticle&artid=3
It's basicaly a recopilation of my own experiences installing medium sized 
clusters, common errors I've found and strategies used

Hope you'll find it interesting

Regards
Pedro


On Friday 06 July 2001 14:39, Robert G. Brown wrote:
> On Thu, 5 Jul 2001, Eric Linenberg wrote:
> > Ok, this is going to be kind of long, but I figured there are people
> > out there with more experience than me, and I don't have the option to
> > mess up as I have to be finished this project by AUG. 7th!
> >
> > I am working as a research assistant, and my task is to build an 8
> > node Beowulf cluster to run LS-DYNA (the world's most advanced
> > general purpose nonlinear finite element program (from their page))
> > (lstc.com)  My budget is $25,000 and I just want general help with
>
> A pretty generous budget for an eight node operation based on Intel or
> AMD, depending on what kind of networking the application needs.  I
> spent only $15K on a 16 node beowulf equipped with 1.33 GHz cpus
> (including the Home Depot heavy duty shelf:-).  Duals are actually
> generally cheaper on a per-CPU basis, although if you get large memory
> systems the cost of memory goes up very quickly.
>
> > where I should begin and what should be done to maximize the
> > upgradibility (would it be possible to just image the disk -- change
> > the IP settings, maybe update a boot script to add another node to the
> > cluster?) and to maximize the performance (what are the benefits of
> > dual-processor machines -- what about gigabit network cards?)
>
> Any of the 15 "slave" nodes (not the server node, which is of course
> more complicated) can be reinstalled from scratch in between five and
> six minutes flat by simply booting the node with a kickstart-default
> floppy (no keyboard, monitor, or mouse required at any time).  I've
> reinstalled nodes in just this way to demonstrate to visitors just how
> easy it is to administer and maintain a decent beowulf design -- just
> pop the floppy in, press the reset button, and by the time I'm finished
> giving them a tour of the hardware layout the node reboots itself back
> into the state it was in when I pressed reset, but with a brand new disk
> image.
>
> Then there is Scyld, which is even more transparently scalable but
> requires that you adopt the Scyld view and build a "true beowulf"
> architected cluster and think about the cluster a bit differently than
> just a "pile of PC's" with a distributed application (dedicated or not).
> Not being able to login to nodes, NFS mount from nodes, use nodes like
> networked workstations (headless or not) works for folks used to e.g.
> SP3's but isn't as mentally comfortable to folks used to using a
> departmental LAN as a distributed computing resource.
>
> Finally, as has been discussed on the list a number of times, yes, you
> can maintain "preinstalled" disk images and write post-install scripts
> to transform a disk image into a node with a particular name and IP
> number.  Although I've written at least four generations worth of
> scripts to do just this over the last 14 years or so (some dated back to
> SunOS boxes) I have to say that I think that this is the worst possible
> solution to this particular problem.  Perhaps it is >>because<< I've
> invested so much energy in it for so long that I dislike this approach
> -- I know from personal experience that although it scales better than a
> one-at-a-time installation/maintenance approach, it sucks down immense
> amounts of personal energy to write and/or tune the scripts used and it
> is very difficult and clumsy to maintain.
>
> For example, if your cluster paradigm is a NOW/COW arrangement and the
> nodes aren't on a fully protected private network (with a true
> firewall/gateway/head node between them and the nasty old Internet with
> all of its darkness and impurity and evil:-) then you will (if you are a
> sane person, and you seem sensible enough) want to religiously and
> regularly install updates on whatever distribution you put on the nodes.
> After all, there have been wide open holes in every release of every
> networked operating system (that claimed to have a security system in
> the first place) ever made.  If you don't patch these holes as they are
> discovered in a timely way, you are inviting some pimple-faced kid in
> Arizona or some juvenile entrepreneur in Singapore to put an IRC server
> or a SPAM-forwarder onto your systems.  If you use the image-based
> approach, you will have to FIRST upgrade your image, THEN recopy it to
> all of your nodes and run your post-install script.  If any of the
> software packages in your upgrade/update that interact with your
> post-install script have changed, you'll have to hand edit and test your
> post-install script.  Even so, there is a good chance that you'll have
> to reinstall all the nodes more than once to get it all right.
>
> Sure, there are alternatives.  You can maintain a single node as a
> template (you'll have to anyway) and then write a fairly detailed
> rsync-based script to synchronize the images of your template and your
> nodes, but not >>this<< file or >>that<< file, and even so if you do a
> major distribution upgrade you'll simply have to reinstall from the bare
> image as replacing e.g. glibc on a running system is probably not a good
> idea.  No matter how you cut it, you'll end up doing a fair amount of
> work to keep your systems sync'd and current and quite a lot of work for
> a full upgrade.
>
> Compare with the kickstart alternative above.  The ONLY work is in
> building the kickstart file for "a node", which is mostly a matter of
> selecting packages for the install and yes, writing a post-install
> script to handle any site-specific customization.  The post-install
> script will generally NOT have to mess with specific packages, though,
> since their RPMs already contain the post-install instructions
> appropriate for seamless installation as a general rule.  At most it
> will have to install the right fstab, set up NIS or install the default
> /etc/password and so forth -- the things that have to be done regardless
> of the (non-Scyld) node install methodology.
>
> Regarding the scaling to more nodes -- the work is truly not significant
> and how much there is depends on how much you care that a particular
> node retains its identity.  I tend to boot each new node twice -- once
> to get its ethernet number for the dhcpd.conf table entry that assigns
> it its own particular IP number, and once to do the actual install.
> This is laziness on m part -- if I were more energetic (or had 256 nodes
> to manage and were PROPERLY lazy:-) I'd invest the energy in developing
> a schema whereby nodes were booted with an IP number from a pool during
> the install while gleaning their ethernet addresses and then e.g. run a
> secondary script on the dhcpd server to install all the gleaned
> addresses with hard IP numbers and do a massive reboot of the newly
> installed nodes.  Or something -- there are several other ways to
> proceed.  However, with only 8-16 nodes it is hardly worth it to mess
> with this as it takes only twenty seconds to do a block copy in the
> dhcpd.conf and edit the ethernet numbers to correspond to what you pull
> from the logs -- maybe five minutes total for 8 nodes, and even the
> simplest script would take a few hours to write and test.  For 128 nodes
> it is worth it, of course.
>
> > Another concern here is actual floor space.  We have about a 6ft x 3ft
> > area for the computers, so I think I am just going to be putting them
> > onto a Home Depot industrial shelving system or something similar, so
> > dual processor systems may be much better for me.  Cooling and
> > electricity have both already been taken care of.
>
> With only 8 nodes the space is adequate and they should fairly easily
> run on a single 20 Amp circuit.  You're right at the margin where
> cooling becomes an issue -- you'll be burning between one and two
> kilowatts, sustained, with most node designs -- depending mostly on
> whether they are single or dual processor nodes.
>
> > I appreciate any help that is provided as I know someone out there has
> > had similar experiences (possibly with this software package)
>
> I'm afraid I cannot help you with the software package, but I can still
> give you some generic advice -- in one sense you MAY be heavily
> overbudgeted for only 8 nodes.  I actually favor answering all the
> architectural questions before setting the budget and not afterwards,
> but I'm also fully aware that this isn't always how things work in the
> real world.
>
> What you need to do (possibly by checking with the authors of the
> package itself) is to figure out what is likely to bottleneck its
> operation at the scales you wish to apply it.  Is it CPU bound (a good
> thing, if so)?  Then use your budget to get as many CPU cycles as
> possible per dollar and minimize your networking and memory expenses
> (e.g. cheap switched 100BT and just get the smallest memory that will
> comfortably hold your application, or at least 256 MB, whichever is
> larger).  Is it memory I/O bound (lots of vector operations, stream-like
> performance)?  Then investing in DDR-equipped Athlons or perhaps a P4
> with a wider memory path may make sense.  Look carefully at stream or
> cpu-rate benchmarks and optimize cost benefit in the occupied memory
> size regime you expect to run the application at.  Is it a "real
> parallel application" that has moderate-to-fine granularity, may be
> synchronous (has barriers where >>all<< the nodes have to complete a
> subtask before computation proceeds on >>any<< node)?  In that case your
> budget may be about right for eight nodes as you'll need to invest in a
> high-end network like myrinet or possibly gigabit ethernet.  In either
> case you may find yourself actually spending MORE on the networking per
> node than you do on the nodes themselves.
>
> Finally, you need to think carefully about the single vs dual node
> alternatives.  If the package is memory I/O bound AT ALL on a single CPU
> it is a BAD IDEA to get a dual packaging as you'll simply ensure that
> one CPU is often waiting for the other CPU to finish using memory so it
> can use memory.  You can easily end up paying for two processors in one
> node and getting only 1.3-1.4x as much work done as you would with two
> processors in two nodes.  You also have think carefully about duals if
> you are network bound -- remember, both CPUs in a dual will be sharing a
> single bus structure and quite possibly sharing a single NIC (or bonded
> channel).  Again, if your computation involves lots of communication
> between nodes, one CPU can often be waiting for the other to finish
> using the sole IPC channel so it can proceed.  Waiting is "bad".  We
> hate waiting.  Waiting wastes money and our time.
>
> Generally, duals make good economic sense for strictly CPU bound tasks
> and "can" make decent sense for certain parallel computation models
> where the two CPUs can sanely share the communications resource or where
> one CPU manages net traffic while the other does computations.  The
> latter can often be accomplished just as well with better/higher end
> communications channels, though -- you have to look at the economics and
> scaling.
>
> Given a choice between myrinet and gigabit ethernet, my impressions from
> being on the list a long time and listening are that myrinet is pretty
> much "the best" IPC channel for parallel computations.  It is very low
> latency, very high bandwitch, and puts a minimal burden on the CPU when
> operating.  Good drivers exist for the major parallel computation
> libraries e.g. MPI.  Check to make sure your application supports its
> use if it is a real parallel app.  It may be that gigabit ethernet is
> finally coming into its own -- I personally have no direct experience
> with either one as my own tasks are generally moderately coarse grained
> to embarrassingly parallel and I don't need high speed networking.
>
> Hope some of this helps.  If you are very fortunate and your task is CPU
> bound (or only weakly memory bound) and coarse grained to EP and will
> fit comfortably in 512-768 MB of memory, you can probably skip the
> eight-node-cluster stage altogether.  If you build a "standard" beowulf
> with switched 100BT and nodes with minimal gorp (a floppy and HD,
> memory, a decent NIC, perhaps a cheap video card) you can get 512 MB
> DDR-equipped bleeding edge (1.4 GHz) Athlon nodes for perhaps $850
> apiece.  (Cheap) switched 100Base ports cost anywhere from $10 each to
> perhaps $30 each in units from 8 to 40 ports.  You can easily do
> something like:
>
> 23 $900 nodes = $20700
> 1 $2000 "head node" with lotsa disk and maybe a Gbps ethernet NIC
> 1 <$1000 24 port 100BT switch with a gigabit port/uplink for your head
> node. $500 for shelving etc.
>
> you could build a 24 node 'wulf easily for your $25K budget.  Even if
> you have to get myrinet for each node (and hence spend $2000/node) you
> can probably afford 12 nodes, one equipped as a head node.
>
> Good luck.
>
>     rgb

-- 

  __________________________________________________
 /                                                  \
 | Pedro Diaz Jimenez                               |
 |                                                  |
 | pdiaz88@terra.es      pdiaz@acm.asoc.fi.upm.es   |
 |                                                  |
 |                                                  |
 | http://planetcluster.org                         |
 | Clustering & H.P.C. news and documentation       |
 |                                                  |
 | There are no stupid questions, but there's a lot |
 | of inquisitve idiots                             |
 |        Anonymous                                 |
 \__________________________________________________/

-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

mQGiBDqcGZsRBADFIahNPLk8suMlS39m8RqatLgX4dO7PU2F5p1oVvkyB7PaLQCv
FREWwfrjGpxAjRnxyZ4TdaFi1oCP495t5R2CdjPZu0EfjsEqosdLXkjDsKl2n4Wo
Afb6BaHMJS5PADEI0QfpZOkB8OruAZja/oGmn5rThyjgCxWHUuK1ArmeGwCg7+9a
owg9wP1RohePHJSDB9d2HYMD/i7z1X4ev+K90LumgJwSWlScJ7MEip5rw4wqGOkK
lF/C2nTYsoX5CVEn/pu7hROL/BWIYtBgkNDaEjsVsyb+4KjQXcZUW5l3ADipWYx2
r9s4sFfeZ9nfhDcG0aNYRcCNkYSZ/WxUkXS8UjVEAEhkFu1BA+6UZmeq3pKtJZTR
+HqKA/9zRmgTon36zt2qe9eiR6DyY0EpGEI0iY+KYX6GC/wxizeHBw0FW1eOEoxF
GjtxdBv/U9vi7Vgav6aY+pr4la5q6jVabe03Y8yGDFeL8jM+lqww1rzpABiGrF+W
qge65zCUjL3jJE5+5yi+KcRyllb1OA7uXQTtsRw+TGq9Dvaaz7QwUGVkcm8gRGlh
eiBKaW1lbmV6IChCLk8uRi5ILikgPHBkaWF6ODhAdGVycmEuZXM+iFYEExECABYF
AjqcGZsECwoEAwMVAwIDFgIBAheAAAoJEJ7ud33hGMZRj20An2Ce4S/vBTuZDxnL
WFBrJRnc3UdaAKDnIPNRbz7r4dh9AuBcpbCE1pQ/SLkBDQQ6nBmqEAQAr7O07Dws
5zAbQvm1hwGthXKCHtIIuWCPdX/XkNG6ZxV/cXgs4LI4oAg3GhttD2JIEk2SoVXE
FOf/wIddIDz70/9mIZavMvpR31LxBFSJk0Up3caOvThM90wMttRi7tg7cf04rrMM
Phy8T5bOIW/q5SMwZffbJXD7bA0/jDLdQ6MAAwYD/1emSwNTzOOmMCZadoEBpKIE
HA35P2/m/SsCI+pQ/OKXKPvvrQKTQqRCcDa5aq31oSiT9M5WQ96BlIGKHRPWGpvm
0822V7M9RF2mYZPIfgKfTSvZpYHzjz+RM7PvBBiBc9l95vy70Sh7SywIF86H80Ag
D0dUIDtGlrSANhXjx4EJiEYEGBECAAYFAjqcGaoACgkQnu53feEYxlHdVACgjVhU
Y8CKf6MYZgQOR9eIDNvTX0AAn3dwbW1HLxEF5OQKJIsngl0BUlYK
=d4S3
-----END PGP PUBLIC KEY BLOCK-----

From scott@moriarty.chem.ualberta.ca Thu Aug 30 08:40:40 2001
Date: Wed, 29 Aug 2001 17:12:37 -0600 (MDT)
From: Scott Delinger <scott@moriarty.chem.ualberta.ca>
Reply-To: scott.delinger@ualberta.ca
To: rgb@phy.duke.edu
Subject: compute node cloning

Robert,

I know you use RH's kickstart, but I thought this one deserves a mention,
as I am using it to roll out compute nodes (and master nodes!) now. In
fact, I do this recursively:

master node (with client node image) imaged.

Clone master node for next 'wulf, then use the client node image to begin
cloning client nodes.

Also, you can edit the image directly (which .iso images wouldn't).

http://www.systemimager.org/

I expect it is saving about three weeks of time for the cloning of five
clusters this "summer" (which includes fall this year).

Scott

-- 
Scott Delinger, Ph.D.                      scott.delinger@ualberta.ca
I.T. Administrator
Department of Chemistry
University of Alberta
Edmonton, Alberta, Canada T6G 2G2

