LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Heiko Bauke (heiko.bauke_at_[hidden])
Date: 2005-03-18 07:20:42


Hi,

On Thu, 17 Mar 2005 22:06:03 +0100
FARKAS Zoltan <zfarkas_at_[hidden]> wrote:

> Heiko Bauke wrote:
> > Dear all,
> >
> > I'm trying to use LAM/MPI 7.1.1 with Berkeley Lab Checkpoint/Restart
> > 0.4.0 and kernel 2.4.26. But I don't get things working correctly. Is
> > anybody using BLCR to checkpoint MPI applications?
[...]
> Try to checkpoint the first process started by mpirun instead of mpirun.
> I think this will work. (I've tried this a few months ago, and this has
> worked).

thanks for your reply. But checkpointing the first process started by
mpirun did not solve my problems.

Today I also tried checkpointing with LAM/MPI 7.0.6 without success. But
LAM/MPI 7.0.6 behaves different. In LAM/MPI 7.1.1 a single checkpoint
file with the context of mpirun but without the context of the
application was saved. No errormessages were displayed. With LAM/MPI 7.0.6
no checkpoint file at all was written, but in the console where mpirun runs
the error message "rploadgov failed." was displayed.

Another difference in the behaviour of LAM/MPI 7.1.1 and 7.0.6 is that
unter 7.0.6 mpirun is listed three times in the process table, while
unter 7.1.1 mpirun occurs only once. I don't know, if this is a bug or a
feature.

If anybody has done checkpointing with LAM/MPI and BLCR successfully, I
would like to know his exact configuration.

        Heiko

--
-- Gesunder Menschenverstand in ungewöhnlichem Maße ist das, was die Welt
-- Weisheit nennt. (Samuel Coleridge, 1772-1834)
-- Supercomputing in Magdeburg @ http://tina.nat.uni-magdeburg.de
--                 Heiko Bauke @ http://www.uni-magdeburg.de/bauke