Sorry for the delay on this -- I looked into this and found that there
are actually two issues here:
- compiling MPI apps with checkpoint support using BLCR
- using checkpoints at run-time
The compiling issue turns out to be by design of BLCR -- the "cr"
library must be linked in before libpthread (which is what you were
seeing). In the DSO module case, the cr library is linked to the
module (and not the user's app), it gets loaded in the process *after*
libpthread, and there's really no way to get the ordering right.
Hence, these components really need to be statically liked into libmpi
(I've added release notes about this for 7.1.2).
There are two main ways to do this:
1. Configure all LAM modules to be statically linked into libmpi. This
is the default mode, so if you don't specify --enable-shared
--disable-static --with-modules, it should build this way.
2. Configure just the cr modules statically linked into libmpi. For
example:
./configure --disable-static --enable-shared
--with-modules=boot,coll,rpi
(you can be a little more fine-grained than that if you want -- the
above will also compile the self modules statically in libmpi, for
example)
The second issue is that we apparently accidentally disabled blcr
altogether with a hackaround for a corner case that shouldn't matter
(ducong mailed me about this off-list). I have fixed this in SVN and
have released a new beta tarball with the fixes -- 7.1.2b5. Could you
give it a whirl?
http://www.lam-mpi.org/beta/
Let me know how this goes.
On Oct 17, 2004, at 3:27 PM, Neville Lee wrote:
> I'm having the exact sam problem.
>
> mpicc -showme:
> gcc -I/usr/local/include -pthread -ldl -lpthread -L/lib
> -L/usr/local/lib
> -llammpio -llamf77mpi -lmpi -llam -lutil -lcr -ldl
>
> ldd a.out
> libm.so.6 => /lib/libm.so.6 (0x4002c000)
> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
> libpthread.so.0 => /lib/libpthread.so.0 (0x40051000)
> libutil.so.1 => /lib/libutil.so.1 (0x400a2000)
> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x400a6000)
> libc.so.6 => /lib/libc.so.6 (0x400ad000)
> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>
> With mpicc -v source.c 2 > out, I can see that -lcr appears after
> -lpthread in the argument list of collect2.
> So I remove the message lines in file 'out', leaving only commands, and
> run the file as a script. This produces an executable that run without
> complaints.
>
> And ldd output of the new executable:
> libm.so.6 => /lib/libm.so.6 (0x4002c000)
> libdl.so.2 => /lib/libdl.so.2 (0x4004e000)
> libcr.so.0 => /usr/local/lib/libcr.so.0 (0x40051000)
> libpthread.so.0 => /lib/libpthread.so.0 (0x40058000)
> libutil.so.1 => /lib/libutil.so.1 (0x400aa000)
> libc.so.6 => /lib/libc.so.6 (0x400ad000)
> /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
> Apparently libcr appears before libpthread now.
>
> Is this a bug of mpicc?
>
> However, after that I can mpirun the program, and do cr_chechpoint, but
> when I call cr_restart, it says:
> mpirun (rpwait): Bad file descriptor
> Any ideas?
>
> BTW I'm using LAM-MPI 7.1.1 and blcr 0.2.3.
>
>> Can you send the output of "mpicc -showme" and "ldd a.out"?
>>
>> What version of LAM are you using?
>>
>>
>> On Oct 12, 2004, at 1:51 PM, <ducong_at_xxxxxxxxx> wrote:
>>
>>
>>
>> Hi,
>> When I am trying to run a MPI program, I got the following error:
>> $ mpirun -ssi rpi crtcp -np 1 a.out
>> cr_pthread.c:82 cri_pthread_init: When linking libpthread, it must be
>> linked AFTER libcr
>>
>> My configuration is as follows:
>> $ ./configure --with-cr-blcr=/usr/local/blcr --with-rpi=crtcp
>> --prefix=/home/ducong/lam --with-rsh=ssh -x
>>
>> How to solve this problem?
>> Thanks
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>> -- {+} Jeff Squyres {+} jsquyres_at_xxxxxxxxxxx {+}
>> http://www.lam-mpi.org/
>>
>>
>>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|