Hello -
It looks like we didn't cover your particular case very well (using
LAM's checkpoint/restart code under a batch scheduler). Shame on
us. Can you apply the attached patch and rebuild? I think this will
solve the problem, but don't have a system right now running both PBS
and C/R to really test with.
Thanks,
Brian
On Sep 16, 2005, at 10:02 AM, thanhtn wrote:
> Hi all,
>
> I am using LAM(v7.0.6) + BLCR(v0.3.1) + PBS(v2.3.16).
> I think i installed them correctly because :
> - I could submit job successfully.
> - I checkpointed and restarted successfully with mpi
> program that run by mpirun command.
> But when i submit a mpi job (my script below), I can
> checkpoint mpirun process, It generate a context file
> (and each mpi process has a context file). But I can't
> restart. ???
>
> - Here are myssript:
> #!/bin/sh
> #PBS -l walltime=10:00:00
> #PBS -l mem=400mb
> #PBS -l ncpus=2
> #PBS -j oe
>
> lamboot
> mpirun N -ssi rpi crtcp -ssi cr blcr ./hello
> lamhalt
>
> - I submit job with command:
> qsub myscript
> - And checkpoint with command:
> cr_checkpoint <PID of mpirun>
> - Restart command:
> cr_restart <context file>
> - although i had lamboot, i still get error:
> ----------------------------------------------------------------------
> -------
> It seems that there is no lamd running on the host
> may15.
>
> This indicates that the LAM/MPI runtime environment is
> not operating.
> The LAM/MPI runtime environment is necessary for the
> "mpirun" command.
>
> Please run the "lamboot" command the start the LAM/MPI
> runtime
> environment. See the LAM/MPI documentation for how to
> invoke
> "lamboot" across multiple machines.
> ----------------------------------------------------------------------
> -------
|