LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Yuan Tang (yuantang_at_[hidden])
Date: 2006-01-03 23:05:29


BTW: The LAM/MPI version I use is 7.1.1

And the problem seem to be when I invoke the

>cr_checkpoint --term $PID_of_mpirun

only the process of mpirun is checkpointed, all the other MPI processes
are not. So when I call

>cr_restart context.$PID_of_mpirun

it always reports :
>mpirun (rpwait): Bad file descriptor

It's obvious, no MPI process' image checkpointed. But, why? I have
checked the output parameter of MPI_Init_thread(..., &provided), the
thread level does have been set to MPI_THREAD_SERIALIZED already.

Would you give me some hints?

Thanks!

Yuan


attached mail follows:


Dear LAM developer/maintainer,

I am now installing LAM/MPI + BLCR on my linux cluster box. After
installed BLCR, everything seems OK. Also, I checked the
cr_checkpoint/cr_restart with some small C tester, it works fine. But
when I compiled some MPI program with LAM, the cr_checkpoint
--term/cr_restart doesnot work any more.
i.e.
First, I invoke the application run with:
lamboot ./hostf_lam
/home/yuantang/local/bin/mpirun C -ssi rpi crtcp -ssi cr blcr -x
LD_LIBRARY_PATH ${prog}

Then I invoke :
>cr_checkpoint --term $PID_of_mpirun

It will generate the file : context.$PID_of_mpirun under current directory.
Then I invoke :
>cr_restart $PID_of_mpirun
It report:
>mpirun (rpwait): Bad file descriptor
and exit.

My LAM configuration line is as follows:
./configure --prefix=/home/yuantang/local --with-rpi=crtcp
--with-threads=posix --with-wrapper-extra-ldflags
--with-cr-blcr=/home/yuantang/local --with-cr-base-file-dir=/tmp

Would you give me some hints what might be wrong??

Thanks!

Yuan