LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-03-18 07:37:07


On Mar 18, 2005, at 7:20 AM, Heiko Bauke wrote:

>> Try to checkpoint the first process started by mpirun instead of
>> mpirun.
>> I think this will work. (I've tried this a few months ago, and this
>> has
>> worked).
>
> thanks for your reply. But checkpointing the first process started by
> mpirun did not solve my problems.

This is actually not correct. You must checkpoint mpirun itself, not
an individual MPI process.

> Today I also tried checkpointing with LAM/MPI 7.0.6 without success.
> But
> LAM/MPI 7.0.6 behaves different. In LAM/MPI 7.1.1 a single checkpoint
> file with the context of mpirun but without the context of the
> application was saved. No errormessages were displayed. With LAM/MPI
> 7.0.6
> no checkpoint file at all was written, but in the console where mpirun
> runs
> the error message "rploadgov failed." was displayed.
>
> Another difference in the behaviour of LAM/MPI 7.1.1 and 7.0.6 is that
> unter 7.0.6 mpirun is listed three times in the process table, while
> unter 7.1.1 mpirun occurs only once. I don't know, if this is a bug or
> a
> feature.

This is the Linux 2.4 "feature" of threads showing up in the process
table. So you're seeing the multiple threads in mpirun.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/