Thanks for the detailed analysis!
First off, for the PGI linker problems; sorry about that. Good, bad, or
indifferent, we're chained to support only what GNU Libtool supports.
:-\ This is apparently a situation where Libtool isn't doing quite the
Right Thing. It would be great if you could file a bug on the Libtool
list -- PGI will probably do the same thing with Libtool for even trivial
compilations (i.e., something signficantly simpler than LAM's build
process).
We actually don't have access to any Opeterons with PGI and gm for
testing. It looks like it's failing inside malloc, which, with LAM's gm
module, *should* be the internal ptmalloc module. And from the gdb back
trace, it looks like it is. So something is dying inside chunk_alloc(),
but I'm guessing the real problem happened before this such that the stack
and/or heap is corrupted and chunk_alloc() is dying as a result of this.
Can you run this through valgrind or some memory-checking debugger? (I
don't know if valgrind supports opteron) You should be able to test just
a single "hello world" and/or laminfo and get the same results, if it
really is a ptmalloc problem.
On Thu, 20 May 2004, Marc Rieffel wrote:
> I'm having trouble running programs that I build with LAM and Myrinet
> and PGI. It works if I use gcc and myrinet or if I use pgi and
> ethernet, but not PGI and myrinet. This is on an Opteron system running
> Rocks 3.1.0, which is based on RHEL 3.0.
>
> I'm using http://www.lam-mpi.org/download/files/lam-7.0.6.tar.bz2 .
>
> Configuration:
>
> CC=pgcc FC=pgf77 F90=pgf90 CXX=pgCC ./configure
> --prefix=/home/marc/lam-7.0.6.pgi.myrinet/ --with-rsh=/usr/bin/ssh
> --with-rpi=gm --with-gm=/opt/gm
>
> Building LAM:
>
> make
>
> Linking fails because PGI doesn't understand -rpath,
>
> /bin/sh ../../libtool --mode=link pgcc -O -o lamboot lamboot.o
> ../../share/libmpi.la ../../share/liblam.la -laio -lutil -lpthread
> -lpthread
> pgcc -O -o lamboot lamboot.o ../../share/.libs/libmpi.a -L/opt/gm/lib
> /opt/gm/lib/libgm.so ../../share/.libs/liblam.a -laio -lutil -lpthread
> --rpath /opt/gm/lib --rpath /opt/gm/lib
> pgcc-Warning-Unknown switch: --rpath
> pgcc-Warning-Unknown switch: --rpath
> File with unknown suffix passed to linker: /opt/gm/lib
> File with unknown suffix passed to linker: /opt/gm/lib
> /opt/gm/lib: file not recognized: Is a directory
>
> but I can fix that by linking manually:
>
> (cd tools/lamboot/; pgcc -O -o lamboot lamboot.o
> ../../share/.libs/libmpi.a -L/opt/gm/lib /opt/gm/lib/libgm.so
> ../../share/.libs/liblam.a -laio -lutil -lpthread )
>
> or by using "-Wl,--rpath,/opt/gm/lib" instead of "--rpath /opt/gm/lib
> --rpath /opt/gm/lib". This might be a useful change to incorporate in
> future versions of LAM.
>
> make install
>
> Building CPI (from mpich-1.2.5.2/examples):
>
> /home/marc/lam-7.0.6.pgi.myrinet/bin/mpicc cpi.c -o cpi.lam.pgi.myrinet
> -lm
>
> Starting lamboot:
>
> opt310: /home/marc/examples >
> /home/marc/lam-7.0.6.pgi.myrinet//bin/lamboot -v
> /home/marc/examples/hostfile
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<11707> ssi:boot:base:linear: booting n0 (opt310.public)
> n-1<11707> ssi:boot:base:linear: booting n1 (compute-0-0.local)
> n-1<11707> ssi:boot:base:linear: finished
>
> Running cpi:
>
> opt310: /home/marc/examples > /home/marc/lam-7.0.6.pgi.myrinet/bin/mpirun -np 1 ./cpi.lam.pgi.myrinet
> -----------------------------------------------------------------------------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> -----------------------------------------------------------------------------
>
> And when I run it directly instead of through mpirun, it segfaults:
>
> [root_at_opt310 examples]# ./cpi.lam.pgi.myrinet
> Segmentation fault
>
> gdb reports this back trace:
>
> #0 0x000000000041c3b2 in chunk_alloc ()
> #1 0x000000000041c25a in malloc ()
> #2 0x000000000041b827 in malloc_hook_ini ()
> #3 0x000000000041c1b6 in malloc ()
> #4 0x000000000044a364 in lam_arr_init ()
> #5 0x00000000004136a0 in lam_updown_init ()
> #6 0x000000000041372a in lam_setfunc ()
> #7 0x0000000000404966 in MPI_Init ()
> #8 0x0000000000403cd0 in main (argc=1, argv=0x7fbffff868) at cpi.c:20
>
> Similarly, laminfo crashes:
>
> /home/marc/lam-7.0.6.pgi.myrinet/bin/laminfo
> Segmentation fault
>
>
> I've attached the output from configuring and building LAM.
>
> Again, it works just fine when I use these combinations:
> mpich-gcc-myrinet
> mpich-gcc-ethernet
> mpich-pgi-myrinet
> mpich-pgi-ethernet
> lam-gcc-myrinet
> lam-pgi-ethernet
>
> So it's something about the combination of lam, pgi, and myrinet,
> perhaps Opteron-specific. I haven't tried it on Intel systems.
>
>
> <<config.log.gz>>
> <<make.myrinet.pgi.log.gz>>
>
> Thanks.
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|