A very basic question:
Has anyone been able to use BLCR to checkpoint and then restart a multi node
job.
According to this doc
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf
<http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf>
I found:
> BLCR Known Limitations:
> 1. BLCR doesn't support checkpointing of a process group yet.
> 2. To restart from a context file, the PID of the original process
> must NOT be in use.
> 3. To restart from a context file, the original executables and shared
> libraries used must exists and contents remain the same.
> As a result of limitation 2 and the fact that process IDs are not
> unique between nodes in a cluster, the integration discussed below
> will not migrate jobs between 2 different nodes.
> (Note: a little trick can be used here by checking the status of the
> cr_restart command within the shell script and resubmitting the job.
>
> "Berkeley Lab Checkpoint/Restart (BLCR) is a kernel module that allows
> you to save a process to a file and restore the process from the file.
> This file is called a context file. A context file is similar to a
> core file, but a context file holds enough information to continue
> running the process. A context file can be created at any point in a
> process's execution. The process may be resumed from that point at a
> later time, or even on a different workstation."
Sam Goosen
PBS Pro support
|