LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-03-27 16:46:35


Unfortunately I cannot reproduce this. I am using the latest build of
LAM/MPI (7.1.3) and BLCR (0.5.0), and all seems well. :(

Can you upgrade to the latest LAM/MPI doing a fresh build? This will
help reduce the number of variables that could be the issue.

The failure of the SSI types upon checkpoint worries me a bit, since
I have never seen that error thrown before. It makes me think that
some memory is getting corrupted across the checkpoint.
Can you try checkpointing/restarting a simple MPI program to see if
it has the same problem? Something like a hello world program in a
wait loop. This will give you enough time to checkpoint the process,
terminate it, and restart it.

-- Josh

On Mar 27, 2007, at 12:39 PM, Paul H. Hargrove wrote:

> Yuan,
>
> I've certainly not seen anything like that before. The fact that the
> error message changed after adding "-ssi rpi crtcp" suggests to me
> that
> Josh was on the right track. However, the new failure mode looks even
> more ominous.
>
> My best guess would be that something changed in either BLCR or FC6
> that
> has broken the assumptions being made by the crtcp rpi module in
> LAM/MPI. I don't currently have a system on which to test LAM/MPI
> +BLCR,
> so I can't verify this.
>
> Depending on what has broken, the fix might belong in either LAM/
> MPI or
> BLCR. I am afraid I probably won't have any chance to look at this in
> detail for a couple weeks at least.
>
> Not sure about the 2 mpirun instances, but would guess that one of
> them
> might be internal to lamcheckpoint operation. Passing an option
> such as
> "-f" or "-l" to ps would give the parent id (PPID) and make it clear
> who/what started the 2nd mpirun. As for the the 3 cr_checkpoint
> instances, they correspond to the 3 context files you would eventually
> get: one for the mpirun and one for each of the two "rotating"
> processes.
>
> -Paul
>
> Yuan Wan wrote:
>> On Mon, 26 Mar 2007, Paul H. Hargrove wrote:
>>
>> Hi Paul,
>>
>> Thanks for your reply.
>>
>> I have tried to explicitly use "crtcp" module, but it caused a
>> failure on checkpoint:
>>
>> $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
>> $ lamcheckpoint -ssi cr blcr -pid 17256
>>
>> ---------------------------------------------------------------------
>> --------
>>
>> Encountered a failure in the SSI types while continuing from
>> checkpoint. Aborting in despair :-(
>> ---------------------------------------------------------------------
>> --------
>>
>> And The code never exit after it getting the end.
>> I check the 'ps' list and found there are two 'mpirun' and
>> three'checkpoint'processes running:
>> ---------------------------------------
>> 17255 ? 00:00:00 lamd
>> 17256 pts/2 00:00:00 mpirun
>> 17257 ? 00:00:15 rotating
>> 17258 ? 00:00:15 rotating
>> 17263 pts/3 00:00:00 lamcheckpoint
>> 17264 pts/3 00:00:00 cr_checkpoint
>> 17265 pts/2 00:00:00 mpirun
>> 17266 ? 00:00:00 cr_checkpoint
>> 17267 ? 00:00:00 cr_checkpoint
>> ---------------------------------------
>>
>> --Yuan
>>
>>
>>
>>>
>>> Yuan,
>>>
>>> I've not encountered this problem before. It looks as if
>>> something is
>>> triggering a LAM-internal error message. It is possible that
>>> this is
>>> a result of a BLCR problem, or it could be a LAM/MPI problem. If
>>> the
>>> problem *is* in BLCR, then there is not enough information here
>>> to try
>>> to find it.
>>> I see that you have also asked on the LAM/MPI mailing list, and that
>>> Josh Hursey made a suggestion there. I am monitoring that thread
>>> and
>>> will make any BLCR-specific comments if I can. However, at this
>>> point
>>> I don't have any ideas beyond Josh's suggestion to explicitly set
>>> the
>>> rpi module to crtcp.
>>>
>>> -Paul
>>>
>>> Yuan Wan wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I got some problem when checkpointing lam/mpi code using blcr.
>>>>
>>>> My platform is a 2-cpu machine running Fedora Core 6 (kernel
>>>> 2.6.19)
>>>> I have built blcr-0.5.0 and it works well with serial codes.
>>>>
>>>> I built LAM/MPI 7.1.2
>>>> ---------------------------------------------
>>>> $ ./configure --prefix=/home/pst/lam
>>>> --with-rsh="ssh -x"
>>>> --with-cr-blcr=/home/pst/blcr $ make
>>>> $ make install
>>>> ---------------------------------------------
>>>>
>>>> The laminfo output is
>>>> -----------------------------------------------------
>>>> LAM/MPI: 7.1.2
>>>> Prefix: /home/pst/lam
>>>> Architecture: i686-pc-linux-gnu
>>>> Configured by: pst
>>>> Configured on: Sat Mar 24 00:40:42 GMT 2007
>>>> Configure host: master00
>>>> Memory manager: ptmalloc2
>>>> C bindings: yes
>>>> C++ bindings: yes
>>>> Fortran bindings: yes
>>>> C compiler: gcc
>>>> C++ compiler: g++
>>>> Fortran compiler: g77
>>>> Fortran symbols: double_underscore
>>>> C profiling: yes
>>>> C++ profiling: yes
>>>> Fortran profiling: yes
>>>> C++ exceptions: no
>>>> Thread support: yes
>>>> ROMIO support: yes
>>>> IMPI support: no
>>>> Debug support: no
>>>> Purify clean: no
>>>> SSI boot: globus (API v1.1, Module v0.6)
>>>> SSI boot: rsh (API v1.1, Module v1.1)
>>>> SSI boot: slurm (API v1.1, Module v1.0)
>>>> SSI coll: lam_basic (API v1.1, Module v7.1)
>>>> SSI coll: shmem (API v1.1, Module v1.0)
>>>> SSI coll: smp (API v1.1, Module v1.2)
>>>> SSI rpi: crtcp (API v1.1, Module v1.1)
>>>> SSI rpi: lamd (API v1.0, Module v7.1)
>>>> SSI rpi: sysv (API v1.0, Module v7.1)
>>>> SSI rpi: tcp (API v1.0, Module v7.1)
>>>> SSI rpi: usysv (API v1.0, Module v7.1)
>>>> SSI cr: blcr (API v1.0, Module v1.1)
>>>> SSI cr: self (API v1.0, Module v1.0)
>>>> --------------------------------------------------------
>>>>
>>>>
>>>> My parallel code works well with lam without any checkpoint
>>>> $ mpirun -np 2 ./job
>>>>
>>>> Then I run my parallel job in checkpointable way
>>>> $ mpirun -np 2 -ssi cr blcr ./rotating
>>>>
>>>> And checkpoint this job in another window
>>>> $ lamcheckpoint -ssi cr blcr -pid 11928
>>>>
>>>> This operation produces a context file for mpirun
>>>>
>>>> "context.mpirun.11928"
>>>>
>>>> plus two context files for the job
>>>>
>>>> "context.11928-n0-11929"
>>>> "context.11928-n0-11930"
>>>>
>>>> Seems so far so good :)
>>>> -------------------------------------------------------
>>>>
>>>> However, when I restart the job with the context file:
>>>> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file
>>>> ~/context.mpirun.11928
>>>>
>>>> I got the following error:
>>>>
>>>> Results CORRECT on rank 0 ["This line is the output in code"]
>>>>
>>>> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200,
>>>> MPI_COMM_WORLD)
>>>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>>> Rank (0, MPI_COMM_WORLD): - MPI_Finalize()
>>>> Rank (0, MPI_COMM_WORLD): - main()
>>>>
>>>> -------------------------------------------------------------------
>>>> ----------
>>>> It seems that [at least] one of the processes that was started with
>>>> mpirun did not invoke MPI_INIT before quitting (it is possible that
>>>> more than one process did not invoke MPI_INIT -- mpirun was only
>>>> notified of the first one, which was on node n0).
>>>>
>>>> mpirun can *only* be used with MPI programs (i.e., programs that
>>>> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec"
>>>> program
>>>> to run non-MPI programs over the lambooted nodes.
>>>>
>>>> -------------------------------------------------------------------
>>>> ----------
>>>>
>>>> Anyone met this problem before and know how to solve it?
>>>>
>>>> Many Thanks
>>>>
>>>>
>>>> --Yuan
>>>>
>>>>
>>>> Yuan Wan
>>>
>>>
>>>
>>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> HPC Research Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

----
Josh Hursey
jjhursey_at_[hidden]
http://www.open-mpi.org/