Hi,
On Thu, 17 Mar 2005 22:06:03 +0100
FARKAS Zoltan <zfarkas_at_[hidden]> wrote:
> Heiko Bauke wrote:
> > Dear all,
> >
> > I'm trying to use LAM/MPI 7.1.1 with Berkeley Lab Checkpoint/Restart
> > 0.4.0 and kernel 2.4.26. But I don't get things working correctly. Is
> > anybody using BLCR to checkpoint MPI applications?
[...]
> Try to checkpoint the first process started by mpirun instead of mpirun.
> I think this will work. (I've tried this a few months ago, and this has
> worked).
thanks for your reply. But checkpointing the first process started by
mpirun did not solve my problems.
Today I also tried checkpointing with LAM/MPI 7.0.6 without success. But
LAM/MPI 7.0.6 behaves different. In LAM/MPI 7.1.1 a single checkpoint
file with the context of mpirun but without the context of the
application was saved. No errormessages were displayed. With LAM/MPI 7.0.6
no checkpoint file at all was written, but in the console where mpirun runs
the error message "rploadgov failed." was displayed.
Another difference in the behaviour of LAM/MPI 7.1.1 and 7.0.6 is that
unter 7.0.6 mpirun is listed three times in the process table, while
unter 7.1.1 mpirun occurs only once. I don't know, if this is a bug or a
feature.
If anybody has done checkpointing with LAM/MPI and BLCR successfully, I
would like to know his exact configuration.
Heiko
--
-- Gesunder Menschenverstand in ungewöhnlichem Maße ist das, was die Welt
-- Weisheit nennt. (Samuel Coleridge, 1772-1834)
-- Supercomputing in Magdeburg @ http://tina.nat.uni-magdeburg.de
-- Heiko Bauke @ http://www.uni-magdeburg.de/bauke
|