Ugh. This is actually two issues:
1. LAM didn't find your BLCR installation directory, so it's getting a
path wrong at run time (hence the "rploadgov failed") and failing to do
the checkpoint. Ensure that you use
--with-cr-blcr=/path/to/blcr/installation in your configure statement.
I've also updated the logic a little such that if LAM doesn't find the
path at configure time, it'll simply rely on your $PATH to find the
blcr applications.
2. There was faulty logic in the crlam blcr module such that it didn't
add -lcr when it compiled its module. I've fixed it so that it now
does -- it should no longer be necessary to add "-lcr" to the mpicc
command line.
I've committed both fixes and am cutting b6 right now. It'll be on the
web site later today.
Sorry for all the hassle -- I swear we'll get this right in the near
future! :-)
On Oct 20, 2004, at 12:40 PM, <ducong_at_[hidden]> wrote:
> I have met this problem before and I solve it by adding a -lcr when I
> am compiling with mpicc. If you can use lam mpi 7.0 to compile it, it
> also works.
> mpicc -lcr ....
> Cong
>
>
> On Wed, 20 Oct 2004 22:13:51 +0800, Neville Lee
> <neville.lee_at_[hidden]> wrote:
>> Thanks for the reply.
>>
>> I tried version 7.1.2b5. When doing cr_checkpoint, it says:
>> rploadgov failed.: No such file or directory
>> Process mpirun is also terminated after cr_checkpoint but mpi program
>> continued running.
>> I configured LAM with
>> --with-blcr=/usr/local --with-rpi=crtcp
>>
>> I also tried 7.0.6 and 7.2b1r9913 with the same configure parameters.
>> 7.0.6 does not have any problem but 7.2b1r9913 has similar problems
>> with
>> 7.1.1.
>>
>> Any explanation for this?
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> To: General LAM/MPI mailing list <lam_at_[hidden]>
>> Date: Tue, 19 Oct 2004 11:05:17 -0400
>> Subject: Re: LAM: cr_pthread.c:82 cri_pthread_init: When linking
>> libpthread, it must be linked AFTER libcr
>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Sorry
>> for the delay on this -- I looked into this and found that there
>> are actually two issues here:
>>
>> - compiling MPI apps with checkpoint support using BLCR
>> - using checkpoints at run-time
>>
>> The compiling issue turns out to be by design of BLCR -- the "cr"
>> library must be linked in before libpthread (which is what you were
>> seeing). In the DSO module case, the cr library is linked to the
>> module (and not the user's app), it gets loaded in the process *after*
>> libpthread, and there's really no way to get the ordering right.
>> Hence, these components really need to be statically liked into libmpi
>> (I've added release notes about this for 7.1.2).
>>
>> There are two main ways to do this:
>>
>> 1. Configure all LAM modules to be statically linked into libmpi.
>> This
>> is the default mode, so if you don't specify --enable-shared
>> --disable-static --with-modules, it should build this way.
>>
>> 2. Configure just the cr modules statically linked into libmpi. For
>> example:
>>
>> ./configure --disable-static --enable-shared
>> --with-modules=boot,coll,rpi
>>
>> (you can be a little more fine-grained than that if you want -- the
>> above will also compile the self modules statically in libmpi, for
>> example)
>>
>> The second issue is that we apparently accidentally disabled blcr
>> altogether with a hackaround for a corner case that shouldn't matter
>> (ducong mailed me about this off-list). I have fixed this in SVN and
>> have released a new beta tarball with the fixes -- 7.1.2b5. Could you
>> give it a whirl?
>>
>> http://www.lam-mpi.org/beta/
>>
>> Let me know how this goes.
>>
>> On Oct 17, 2004, at 3:27 PM, Neville Lee wrote:
>>
>>> I'm having the exact sam problem.
>>>
>>> mpicc -showme:
>>> gcc -I/usr/local/include -pthread -ldl -lpthread -L/lib
>>> -L/usr/local/lib
>>> -llammpio -llamf77mpi -lmpi -llam -lutil -lcr -ldl
>>>
>>> ldd a.out
>>> libm.so.6 => /lib/libm.so.6 (0x4002c000)
>>> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
>>> libpthread.so.0 => /lib/libpthread.so.0 (0x40051000)
>>> libutil.so.1 => /lib/libutil.so.1 (0x400a2000)
>>> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x400a6000)
>>> libc.so.6 => /lib/libc.so.6 (0x400ad000)
>>> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>>>
>>> With mpicc -v source.c 2 > out, I can see that -lcr appears after
>>> -lpthread in the argument list of collect2.
>>> So I remove the message lines in file 'out', leaving only commands,
>>> and
>>> run the file as a script. This produces an executable that run
>>> without
>>> complaints.
>>>
>>> And ldd output of the new executable:
>>> libm.so.6 => /lib/libm.so.6 (0x4002c000)
>>> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
>>> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x40051000)
>>> libpthread.so.0 => /lib/libpthread.so.0 (0x40058000)
>>> libutil.so.1 => /lib/libutil.so.1 (0x400aa000)
>>> libc.so.6 => /lib/libc.so.6 (0x400ad000)
>>> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>>> Apparently libcr appears before libpthread now.
>>>
>>> Is this a bug of mpicc?
>>>
>>> However, after that I can mpirun the program, and do cr_chechpoint,
>>> but
>>> when I call cr_restart, it says:
>>> mpirun (rpwait): Bad file descriptor
>>> Any ideas?
>>>
>>> BTW I'm using LAM-MPI 7.1.1 and blcr 0.2.3.
>>>
>>>> Can you send the output of "mpicc -showme" and "ldd a.out"?
>>>>
>>>> What version of LAM are you using?
>>>>
>>>>
>>>> On Oct 12, 2004, at 1:51 PM, <ducong_at_xxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> Hi,
>>>> When I am trying to run a MPI program, I got the following error:
>>>> $ mpirun -ssi rpi crtcp -np 1 a.out
>>>> cr_pthread.c:82 cri_pthread_init: When linking libpthread, it must
>>>> be
>>>> linked AFTER libcr
>>>>
>>>> My configuration is as follows:
>>>> $ ./configure --with-cr-blcr=/usr/local/blcr --with-rpi=crtcp
>>>> --prefix=/home/ducong/lam --with-rsh=ssh -x
>>>>
>>>> How to solve this problem?
>>>> Thanks
>>>> _______________________________________________
>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>>
>>>> -- {+} Jeff Squyres {+} jsquyres_at_xxxxxxxxxxx {+}
>>>> http://www.lam-mpi.org/
>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>>
>> </div>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|