On Mon, 26 Mar 2007, Josh Hursey wrote:
Hi Josh,
Thanks for your suggest.
In fact, I have tried to explicitly use "crtcp" module, but it caused a
failure on checkpoint:
$ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
$ lamcheckpoint -ssi cr blcr -pid 17256
-----------------------------------------------------------------------------
Encountered a failure in the SSI types while continuing from
checkpoint. Aborting in despair :-(
-----------------------------------------------------------------------------
And The code never exit after it getting the end.
I check the 'ps' list and found there are two 'mpirun' and
three 'checkpoint' processes running:
---------------------------------------
17255 ? 00:00:00 lamd
17256 pts/2 00:00:00 mpirun
17257 ? 00:00:15 rotating
17258 ? 00:00:15 rotating
17263 pts/3 00:00:00 lamcheckpoint
17264 pts/3 00:00:00 cr_checkpoint
17265 pts/2 00:00:00 mpirun
17266 ? 00:00:00 cr_checkpoint
17267 ? 00:00:00 cr_checkpoint
---------------------------------------
--Yuan
> I noticed that you didn't request the crtcp SSI module needed for
> checkpointing. I am not sure that this is the problem, but can you
> try it with:
>> $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
>
>
> Let me know if that helps.
>
> -- Josh
>
> On Mar 26, 2007, at 9:13 AM, Yuan Wan wrote:
>
>>
>> Hi all,
>>
>> I got some problem when checkpointing lam/mpi code using blcr.
>>
>> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
>> I have built blcr-0.5.0 and it works well with serial codes.
>>
>> I built LAM/MPI 7.1.2
>> ---------------------------------------------
>> $ ./configure --prefix=/home/pst/lam
>> --with-rsh="ssh -x"
>> --with-cr-blcr=/home/pst/blcr $ make
>> $ make install
>> ---------------------------------------------
>>
>> The laminfo output is
>> -----------------------------------------------------
>> LAM/MPI: 7.1.2
>> Prefix: /home/pst/lam
>> Architecture: i686-pc-linux-gnu
>> Configured by: pst
>> Configured on: Sat Mar 24 00:40:42 GMT 2007
>> Configure host: master00
>> Memory manager: ptmalloc2
>> C bindings: yes
>> C++ bindings: yes
>> Fortran bindings: yes
>> C compiler: gcc
>> C++ compiler: g++
>> Fortran compiler: g77
>> Fortran symbols: double_underscore
>> C profiling: yes
>> C++ profiling: yes
>> Fortran profiling: yes
>> C++ exceptions: no
>> Thread support: yes
>> ROMIO support: yes
>> IMPI support: no
>> Debug support: no
>> Purify clean: no
>> SSI boot: globus (API v1.1, Module v0.6)
>> SSI boot: rsh (API v1.1, Module v1.1)
>> SSI boot: slurm (API v1.1, Module v1.0)
>> SSI coll: lam_basic (API v1.1, Module v7.1)
>> SSI coll: shmem (API v1.1, Module v1.0)
>> SSI coll: smp (API v1.1, Module v1.2)
>> SSI rpi: crtcp (API v1.1, Module v1.1)
>> SSI rpi: lamd (API v1.0, Module v7.1)
>> SSI rpi: sysv (API v1.0, Module v7.1)
>> SSI rpi: tcp (API v1.0, Module v7.1)
>> SSI rpi: usysv (API v1.0, Module v7.1)
>> SSI cr: blcr (API v1.0, Module v1.1)
>> SSI cr: self (API v1.0, Module v1.0)
>> --------------------------------------------------------
>>
>>
>> My parallel code works well with lam without any checkpoint
>> $ mpirun -np 2 ./job
>>
>> Then I run my parallel job in checkpointable way
>> $ mpirun -np 2 -ssi cr blcr ./rotating
>>
>> And checkpoint this job in another window
>> $ lamcheckpoint -ssi cr blcr -pid 11928
>>
>> This operation produces a context file for mpirun
>>
>> "context.mpirun.11928"
>>
>> plus two context files for the job
>>
>> "context.11928-n0-11929"
>> "context.11928-n0-11930"
>>
>> Seems so far so good :)
>> -------------------------------------------------------
>>
>> However, when I restart the job with the context file:
>> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/
>> context.mpirun.11928
>>
>> I got the following error:
>>
>> Results CORRECT on rank 0 ["This line is the output in code"]
>>
>> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200,
>> MPI_COMM_WORLD)
>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>> Rank (0, MPI_COMM_WORLD): - MPI_Finalize()
>> Rank (0, MPI_COMM_WORLD): - main()
>> ----------------------------------------------------------------------
>> -------
>> It seems that [at least] one of the processes that was started with
>> mpirun did not invoke MPI_INIT before quitting (it is possible that
>> more than one process did not invoke MPI_INIT -- mpirun was only
>> notified of the first one, which was on node n0).
>>
>> mpirun can *only* be used with MPI programs (i.e., programs that
>> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
>> to run non-MPI programs over the lambooted nodes.
>> ----------------------------------------------------------------------
>> -------
>>
>> Anyone met this problem before and know how to solve it?
>>
>> Many Thanks
>>
>> --Yuan
>>
>>
>> Yuan Wan
>> --
>> Unix Section
>> Information Services Infrastructure Division
>> University of Edinburgh
>>
>> tel: 0131 650 4985
>> email: ywan_at_[hidden]
>>
>> 2032 Computing Services, JCMB
>> The King's Buildings,
>> Edinburgh, EH9 3JZ
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey_at_[hidden]
> http://www.open-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
--
Unix Section
Information Services Infrastructure Division
University of Edinburgh
tel: 0131 650 4985
email: ywan_at_[hidden]
2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ
|