From charwel@chthry.chem.lsu.edu Fri Nov 10 12:34:00 2000
Date: Tue, 19 Oct 1999 20:58:30 -0500 (CDT)
From: Chris <charwel@chthry.chem.lsu.edu>
To: beuwulf list <beowulf@cesdis1.gsfc.nasa.gov>
Subject: checkpointing?

hi,

check pointing was recently mentioned. i am trying to learn more 
about it and see if it works (for my apps).

currently, i know of two sources for linux checkpointing:
http://www.cs.rochester.edu/~edpin/epckpt/  (for 2.2.1)
http://bioinfo.mbb.yale.edu/~wkrebs/queue.html 

1) any positive or negative experiences with these? -- please share :>
2) are there other resources i should check out?

thanks,
chris
charwel@chthry.chem.lsu.edu

-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to beowulf-request@beowulf.org

From llonergan@hpti.com Fri Nov 10 12:34:46 2000
Date: Wed, 20 Oct 1999 08:37:33 -0400
From: Luke Lonergan <llonergan@hpti.com>
To: beuwulf list <beowulf@cesdis1.gsfc.nasa.gov>
Cc: Chris <charwel@chthry.chem.lsu.edu>
Subject: RE: checkpointing?

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "US-ASCII" character set.  ]
    [ Some characters may be displayed incorrectly. ]

> currently, i know of two sources for linux checkpointing:
> http://www.cs.rochester.edu/~edpin/epckpt/  (for 2.2.1)
> http://bioinfo.mbb.yale.edu/~wkrebs/queue.html

The first is Eduardo's epckpt, which is used by GNU Queue, the second. So,
in fact these two both rely on the same kernel checkpointing scheme.
Unfortunately for us Alpha afficionados, Eduardo's routine only works for
x86 machines.

> 1) any positive or negative experiences with these? -- please share :>
> 2) are there other resources i should check out?

The only negative on epckpt is that it doesn't support Alpha, and that it
doesn't support open file descriptors or messages in flight (open sockets).
You will still need the user to catch a signal and prepare for the
checkpoint to be safe.

I don't know how well GNU Queue functions for checkpoint/restart. I do know
that Greg Lindahl's recent fixes/additions to the PBS system for
checkpoint/restart work well. We will be releasing them to the community
after they get used/hardened a bit.

Luke

-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to beowulf-request@beowulf.org
