BTW: The LAM/MPI version I use is 7.1.1
And the problem seem to be when I invoke the
>cr_checkpoint --term $PID_of_mpirun
only the process of mpirun is checkpointed, all the other MPI processes
are not. So when I call
>cr_restart context.$PID_of_mpirun
it always reports :
>mpirun (rpwait): Bad file descriptor
It's obvious, no MPI process' image checkpointed. But, why? I have
checked the output parameter of MPI_Init_thread(..., &provided), the
thread level does have been set to MPI_THREAD_SERIALIZED already.
Would you give me some hints?
Thanks!
Yuan
attached mail follows:
Dear LAM developer/maintainer,
I am now installing LAM/MPI + BLCR on my linux cluster box. After
installed BLCR, everything seems OK. Also, I checked the
cr_checkpoint/cr_restart with some small C tester, it works fine. But
when I compiled some MPI program with LAM, the cr_checkpoint
--term/cr_restart doesnot work any more.
i.e.
First, I invoke the application run with:
lamboot ./hostf_lam
/home/yuantang/local/bin/mpirun C -ssi rpi crtcp -ssi cr blcr -x
LD_LIBRARY_PATH ${prog}
Then I invoke :
>cr_checkpoint --term $PID_of_mpirun
It will generate the file : context.$PID_of_mpirun under current directory.
Then I invoke :
>cr_restart $PID_of_mpirun
It report:
>mpirun (rpwait): Bad file descriptor
and exit.
My LAM configuration line is as follows:
./configure --prefix=/home/yuantang/local --with-rpi=crtcp
--with-threads=posix --with-wrapper-extra-ldflags
--with-cr-blcr=/home/yuantang/local --with-cr-base-file-dir=/tmp
Would you give me some hints what might be wrong??
Thanks!
Yuan
|