Dear LAM developer/maintainer,
I am now installing LAM/MPI + BLCR on my linux cluster box. After
installed BLCR, everything seems OK. Also, I checked the
cr_checkpoint/cr_restart with some small C tester, it works fine. But
when I compiled some MPI program with LAM, the cr_checkpoint
--term/cr_restart doesnot work any more.
i.e.
First, I invoke the application run with:
lamboot ./hostf_lam
/home/yuantang/local/bin/mpirun C -ssi rpi crtcp -ssi cr blcr -x
LD_LIBRARY_PATH ${prog}
Then I invoke :
>cr_checkpoint --term $PID_of_mpirun
It will generate the file : context.$PID_of_mpirun under current directory.
Then I invoke :
>cr_restart $PID_of_mpirun
It report:
>mpirun (rpwait): Bad file descriptor
and exit.
My LAM configuration line is as follows:
./configure --prefix=/home/yuantang/local --with-rpi=crtcp
--with-threads=posix --with-wrapper-extra-ldflags
--with-cr-blcr=/home/yuantang/local --with-cr-base-file-dir=/tmp
Would you give me some hints what might be wrong??
Thanks!
Yuan
|