LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-03-22 13:24:33


Interesting. What version of LAM/MPI and BLCR are you using? Can you
checkpoint/restart a non-MPI application on both of these machines
you are using individually?

If you can do all of that I'd be interested in seeing a debugging
backtrace (say from gdb) of mpirun, and the processes launched. That
should tell us where they got stuck or what they are waiting on.

Cheers,
Josh

On Mar 13, 2007, at 10:19 AM, Fu HongYi wrote:

> hi everyone.
> i am working with blcr/lammpi, but something goes wrong and i don't
> know what's the reason.
> first i installed blcr and lammpi properly. hence i tested c/r on a
> mpi program running on a single node. everything went smoothly. the
> program ran, checkpoint commands were executed successfully,
> context files were generated, and restart process as well ran
> properly. later i tried the same experiment on a 2-node cluster, in
> which i got failed. i started the mpi program with command:
>
> mpirun -np 2 -ssi rpi crtcp -ssi cr blcr C ./lamtest
>
> while the program was running, i did checkpoints using command:
>
> lamcheckpoint -ssi cr blcr -pid 10411 (*10411 is the pid of mpirun.)
>
> thus the command stopped there and never returned until ctrl-c.
> i checked the working directory, i. e., my home directory, and no
> context file was found. however, some temporary files named
> as .context-xxxxx-xx.tmp presented.
> so someone please tell me what's the problem and i will be much
> appreciated.
> thanks .
>
> 抢注雅虎免费邮箱-3.5G容量,20M附件!
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/