LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Neville Lee (neville.lee_at_[hidden])
Date: 2004-10-21 05:42:56


Compiled 7.1.2b6 today, it finally works now.

I have a question though, there are 3 mpirun processes instead of 1. Why
is that?

Jeff Squyres wrote:

> Ugh. This is actually two issues:
>
> 1. LAM didn't find your BLCR installation directory, so it's getting a
> path wrong at run time (hence the "rploadgov failed") and failing to
> do the checkpoint. Ensure that you use
> --with-cr-blcr=/path/to/blcr/installation in your configure
> statement. I've also updated the logic a little such that if LAM
> doesn't find the path at configure time, it'll simply rely on your
> $PATH to find the blcr applications.
>
> 2. There was faulty logic in the crlam blcr module such that it didn't
> add -lcr when it compiled its module. I've fixed it so that it now
> does -- it should no longer be necessary to add "-lcr" to the mpicc
> command line.
>
> I've committed both fixes and am cutting b6 right now. It'll be on
> the web site later today.
>
> Sorry for all the hassle -- I swear we'll get this right in the near
> future! :-)
>
>
>
> On Oct 20, 2004, at 12:40 PM, <ducong_at_[hidden]> wrote:
>
>> I have met this problem before and I solve it by adding a -lcr when I
>> am compiling with mpicc. If you can use lam mpi 7.0 to compile it, it
>> also works.
>> mpicc -lcr ....
>> Cong
>>
>>
>> On Wed, 20 Oct 2004 22:13:51 +0800, Neville Lee
>> <neville.lee_at_[hidden]> wrote:
>>
>>> Thanks for the reply.
>>>
>>> I tried version 7.1.2b5. When doing cr_checkpoint, it says:
>>> rploadgov failed.: No such file or directory
>>> Process mpirun is also terminated after cr_checkpoint but mpi program
>>> continued running.
>>> I configured LAM with
>>> --with-blcr=/usr/local --with-rpi=crtcp
>>>
>>> I also tried 7.0.6 and 7.2b1r9913 with the same configure parameters.
>>> 7.0.6 does not have any problem but 7.2b1r9913 has similar problems
>>> with
>>> 7.1.1.
>>>
>>> Any explanation for this?
>>>
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Jeff Squyres <jsquyres_at_[hidden]>
>>> To: General LAM/MPI mailing list <lam_at_[hidden]>
>>> Date: Tue, 19 Oct 2004 11:05:17 -0400
>>> Subject: Re: LAM: cr_pthread.c:82 cri_pthread_init: When linking
>>> libpthread, it must be linked AFTER libcr
>>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Sorry
>>> for the delay on this -- I looked into this and found that there
>>> are actually two issues here:
>>>
>>> - compiling MPI apps with checkpoint support using BLCR
>>> - using checkpoints at run-time
>>>
>>> The compiling issue turns out to be by design of BLCR -- the "cr"
>>> library must be linked in before libpthread (which is what you were
>>> seeing). In the DSO module case, the cr library is linked to the
>>> module (and not the user's app), it gets loaded in the process *after*
>>> libpthread, and there's really no way to get the ordering right.
>>> Hence, these components really need to be statically liked into libmpi
>>> (I've added release notes about this for 7.1.2).
>>>
>>> There are two main ways to do this:
>>>
>>> 1. Configure all LAM modules to be statically linked into libmpi. This
>>> is the default mode, so if you don't specify --enable-shared
>>> --disable-static --with-modules, it should build this way.
>>>
>>> 2. Configure just the cr modules statically linked into libmpi. For
>>> example:
>>>
>>> ./configure --disable-static --enable-shared
>>> --with-modules=boot,coll,rpi
>>>
>>> (you can be a little more fine-grained than that if you want -- the
>>> above will also compile the self modules statically in libmpi, for
>>> example)
>>>
>>> The second issue is that we apparently accidentally disabled blcr
>>> altogether with a hackaround for a corner case that shouldn't matter
>>> (ducong mailed me about this off-list). I have fixed this in SVN and
>>> have released a new beta tarball with the fixes -- 7.1.2b5. Could you
>>> give it a whirl?
>>>
>>> http://www.lam-mpi.org/beta/
>>>
>>> Let me know how this goes.
>>>
>>> On Oct 17, 2004, at 3:27 PM, Neville Lee wrote:
>>>
>>>> I'm having the exact sam problem.
>>>>
>>>> mpicc -showme:
>>>> gcc -I/usr/local/include -pthread -ldl -lpthread -L/lib
>>>> -L/usr/local/lib
>>>> -llammpio -llamf77mpi -lmpi -llam -lutil -lcr -ldl
>>>>
>>>> ldd a.out
>>>> libm.so.6 => /lib/libm.so.6 (0x4002c000)
>>>> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
>>>> libpthread.so.0 => /lib/libpthread.so.0 (0x40051000)
>>>> libutil.so.1 => /lib/libutil.so.1 (0x400a2000)
>>>> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x400a6000)
>>>> libc.so.6 => /lib/libc.so.6 (0x400ad000)
>>>> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>>>>
>>>> With mpicc -v source.c 2 > out, I can see that -lcr appears after
>>>> -lpthread in the argument list of collect2.
>>>> So I remove the message lines in file 'out', leaving only commands,
>>>> and
>>>> run the file as a script. This produces an executable that run without
>>>> complaints.
>>>>
>>>> And ldd output of the new executable:
>>>> libm.so.6 => /lib/libm.so.6 (0x4002c000)
>>>> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
>>>> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x40051000)
>>>> libpthread.so.0 => /lib/libpthread.so.0 (0x40058000)
>>>> libutil.so.1 => /lib/libutil.so.1 (0x400aa000)
>>>> libc.so.6 => /lib/libc.so.6 (0x400ad000)
>>>> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>>>> Apparently libcr appears before libpthread now.
>>>>
>>>> Is this a bug of mpicc?
>>>>
>>>> However, after that I can mpirun the program, and do cr_chechpoint,
>>>> but
>>>> when I call cr_restart, it says:
>>>> mpirun (rpwait): Bad file descriptor
>>>> Any ideas?
>>>>
>>>> BTW I'm using LAM-MPI 7.1.1 and blcr 0.2.3.
>>>>
>>>>> Can you send the output of "mpicc -showme" and "ldd a.out"?
>>>>>
>>>>> What version of LAM are you using?
>>>>>
>>>>>
>>>>> On Oct 12, 2004, at 1:51 PM, <ducong_at_xxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>> When I am trying to run a MPI program, I got the following error:
>>>>> $ mpirun -ssi rpi crtcp -np 1 a.out
>>>>> cr_pthread.c:82 cri_pthread_init: When linking libpthread, it must be
>>>>> linked AFTER libcr
>>>>>
>>>>> My configuration is as follows:
>>>>> $ ./configure --with-cr-blcr=/usr/local/blcr --with-rpi=crtcp
>>>>> --prefix=/home/ducong/lam --with-rsh=ssh -x
>>>>>
>>>>> How to solve this problem?
>>>>> Thanks
>>>>> _______________________________________________
>>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>>
>>>>>
>>>>> -- {+} Jeff Squyres {+} jsquyres_at_xxxxxxxxxxx {+}
>>>>> http://www.lam-mpi.org/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>
>>> --
>>> {+} Jeff Squyres
>>> {+} jsquyres_at_[hidden]
>>> {+} http://www.lam-mpi.org/
>>>
>>> </div>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>