Hi all,
I got some problem when checkpointing lam/mpi code using blcr.
My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
I have built blcr-0.5.0 and it works well with serial codes.
I built LAM/MPI 7.1.2
---------------------------------------------
$ ./configure --prefix=/home/pst/lam
--with-rsh="ssh -x"
--with-cr-blcr=/home/pst/blcr $ make
$ make install
---------------------------------------------
The laminfo output is
-----------------------------------------------------
LAM/MPI: 7.1.2
Prefix: /home/pst/lam
Architecture: i686-pc-linux-gnu
Configured by: pst
Configured on: Sat Mar 24 00:40:42 GMT 2007
Configure host: master00
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: blcr (API v1.0, Module v1.1)
SSI cr: self (API v1.0, Module v1.0)
--------------------------------------------------------
My parallel code works well with lam without any checkpoint
$ mpirun -np 2 ./job
Then I run my parallel job in checkpointable way
$ mpirun -np 2 -ssi cr blcr ./rotating
And checkpoint this job in another window
$ lamcheckpoint -ssi cr blcr -pid 11928
This operation produces a context file for mpirun
"context.mpirun.11928"
plus two context files for the job
"context.11928-n0-11929"
"context.11928-n0-11930"
Seems so far so good :)
-------------------------------------------------------
However, when I restart the job with the context file:
$ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
I got the following error:
Results CORRECT on rank 0 ["This line is the output in code"]
MPI_Finalize: internal MPI error: Invalid argument (rank 137389200,
MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Finalize()
Rank (0, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
Anyone met this problem before and know how to solve it?
Many Thanks
--Yuan
Yuan Wan
--
Unix Section
Information Services Infrastructure Division
University of Edinburgh
tel: 0131 650 4985
email: ywan_at_[hidden]
2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ
|