I noticed that you didn't request the crtcp SSI module needed for
checkpointing. I am not sure that this is the problem, but can you
try it with:
> $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
Let me know if that helps.
-- Josh
On Mar 26, 2007, at 9:13 AM, Yuan Wan wrote:
>
> Hi all,
>
> I got some problem when checkpointing lam/mpi code using blcr.
>
> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
> I have built blcr-0.5.0 and it works well with serial codes.
>
> I built LAM/MPI 7.1.2
> ---------------------------------------------
> $ ./configure --prefix=/home/pst/lam
> --with-rsh="ssh -x"
> --with-cr-blcr=/home/pst/blcr $ make
> $ make install
> ---------------------------------------------
>
> The laminfo output is
> -----------------------------------------------------
> LAM/MPI: 7.1.2
> Prefix: /home/pst/lam
> Architecture: i686-pc-linux-gnu
> Configured by: pst
> Configured on: Sat Mar 24 00:40:42 GMT 2007
> Configure host: master00
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (API v1.1, Module v0.6)
> SSI boot: rsh (API v1.1, Module v1.1)
> SSI boot: slurm (API v1.1, Module v1.0)
> SSI coll: lam_basic (API v1.1, Module v7.1)
> SSI coll: shmem (API v1.1, Module v1.0)
> SSI coll: smp (API v1.1, Module v1.2)
> SSI rpi: crtcp (API v1.1, Module v1.1)
> SSI rpi: lamd (API v1.0, Module v7.1)
> SSI rpi: sysv (API v1.0, Module v7.1)
> SSI rpi: tcp (API v1.0, Module v7.1)
> SSI rpi: usysv (API v1.0, Module v7.1)
> SSI cr: blcr (API v1.0, Module v1.1)
> SSI cr: self (API v1.0, Module v1.0)
> --------------------------------------------------------
>
>
> My parallel code works well with lam without any checkpoint
> $ mpirun -np 2 ./job
>
> Then I run my parallel job in checkpointable way
> $ mpirun -np 2 -ssi cr blcr ./rotating
>
> And checkpoint this job in another window
> $ lamcheckpoint -ssi cr blcr -pid 11928
>
> This operation produces a context file for mpirun
>
> "context.mpirun.11928"
>
> plus two context files for the job
>
> "context.11928-n0-11929"
> "context.11928-n0-11930"
>
> Seems so far so good :)
> -------------------------------------------------------
>
> However, when I restart the job with the context file:
> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/
> context.mpirun.11928
>
> I got the following error:
>
> Results CORRECT on rank 0 ["This line is the output in code"]
>
> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200,
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Finalize()
> Rank (0, MPI_COMM_WORLD): - main()
> ----------------------------------------------------------------------
> -------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> ----------------------------------------------------------------------
> -------
>
> Anyone met this problem before and know how to solve it?
>
> Many Thanks
>
> --Yuan
>
>
> Yuan Wan
> --
> Unix Section
> Information Services Infrastructure Division
> University of Edinburgh
>
> tel: 0131 650 4985
> email: ywan_at_[hidden]
>
> 2032 Computing Services, JCMB
> The King's Buildings,
> Edinburgh, EH9 3JZ
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/
|