LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Samuel M. Goosen (smgoosen_at_[hidden])
Date: 2005-07-18 08:29:25


Jeff,

I simply want to checkpoint a multi-node job, reboot the nodes and then
restart the job. No migration, but the original lamd(s) are gone.

Thanks,
Sam Goosen

> -----Original Message-----
> From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
> Of Jeff Squyres
> Sent: Friday, July 15, 2005 8:29 PM
> To: General LAM/MPI mailing list
> Subject: Re: LAM: Lam and multinode BLCR
>
> On Jul 15, 2005, at 5:02 PM, Samuel M. Goosen wrote:
>
> > A very basic question:
> >
> > Has anyone been able to use BLCR to checkpoint and then restart a
> > multi node job.
>
> Yes. :-)
>
> > According to this doc
> > http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
> >
> > I found:
> >
> > > BLCR Known Limitations:
> > > 1. BLCR doesn't support checkpointing of a process group yet.
> > > 2. To restart from a context file, the PID of the original process
> > > must NOT be in use.
> > > 3. To restart from a context file, the original executables and
> > shared
> > > libraries used must exists and contents remain the same.
>
> These look correct.
>
> > > As a result of limitation 2 and the fact that process IDs are not
> > > unique between nodes in a cluster, the integration discussed below
> > > will not migrate jobs between 2 different nodes.
>
> If you have a fairly homogeneous cluster, we've been able to restart on
> a different set of nodes than the job was originally running (because
> the executable and all relevant shared libraries are available on the
> other nodes as well).
>
> So is your question just about restarting, or about migration?
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/