On Jul 15, 2005, at 5:02 PM, Samuel M. Goosen wrote:
> A very basic question:
>
> Has anyone been able to use BLCR to checkpoint and then restart a
> multi node job.
Yes. :-)
> According to this doc
> http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
>
> I found:
>
> > BLCR Known Limitations:
> > 1. BLCR doesn't support checkpointing of a process group yet.
> > 2. To restart from a context file, the PID of the original process
> > must NOT be in use.
> > 3. To restart from a context file, the original executables and
> shared
> > libraries used must exists and contents remain the same.
These look correct.
> > As a result of limitation 2 and the fact that process IDs are not
> > unique between nodes in a cluster, the integration discussed below
> > will not migrate jobs between 2 different nodes.
If you have a fairly homogeneous cluster, we've been able to restart on
a different set of nodes than the job was originally running (because
the executable and all relevant shared libraries are available on the
other nodes as well).
So is your question just about restarting, or about migration?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|