I'm having trouble running programs that I build with LAM and Myrinet and PGI. It works if I use gcc and myrinet or if I use pgi and ethernet, but not PGI and myrinet. This is on an Opteron system running Rocks 3.1.0, which is based on RHEL 3.0.
I'm using http://www.lam-mpi.org/download/files/lam-7.0.6.tar.bz2 .
Configuration:
CC=pgcc FC=pgf77 F90=pgf90 CXX=pgCC ./configure --prefix=/home/marc/lam-7.0.6.pgi.myrinet/ --with-rsh=/usr/bin/ssh --with-rpi=gm --with-gm=/opt/gm
Building LAM:
make
Linking fails because PGI doesn't understand -rpath,
/bin/sh ../../libtool --mode=link pgcc -O -o lamboot lamboot.o ../../share/libmpi.la ../../share/liblam.la -laio -lutil -lpthread -lpthread
pgcc -O -o lamboot lamboot.o ../../share/.libs/libmpi.a -L/opt/gm/lib /opt/gm/lib/libgm.so ../../share/.libs/liblam.a -laio -lutil -lpthread --rpath /opt/gm/lib --rpath /opt/gm/lib
pgcc-Warning-Unknown switch: --rpath
pgcc-Warning-Unknown switch: --rpath
File with unknown suffix passed to linker: /opt/gm/lib
File with unknown suffix passed to linker: /opt/gm/lib
/opt/gm/lib: file not recognized: Is a directory
but I can fix that by linking manually:
(cd tools/lamboot/; pgcc -O -o lamboot lamboot.o ../../share/.libs/libmpi.a -L/opt/gm/lib /opt/gm/lib/libgm.so ../../share/.libs/liblam.a -laio -lutil -lpthread )
or by using "-Wl,--rpath,/opt/gm/lib" instead of "--rpath /opt/gm/lib --rpath /opt/gm/lib". This might be a useful change to incorporate in future versions of LAM.
make install
Building CPI (from mpich-1.2.5.2/examples):
/home/marc/lam-7.0.6.pgi.myrinet/bin/mpicc cpi.c -o cpi.lam.pgi.myrinet -lm
Starting lamboot:
opt310: /home/marc/examples > /home/marc/lam-7.0.6.pgi.myrinet//bin/lamboot -v /home/marc/examples/hostfile
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<11707> ssi:boot:base:linear: booting n0 (opt310.public)
n-1<11707> ssi:boot:base:linear: booting n1 (compute-0-0.local)
n-1<11707> ssi:boot:base:linear: finished
Running cpi:
opt310: /home/marc/examples > /home/marc/lam-7.0.6.pgi.myrinet/bin/mpirun -np 1 ./cpi.lam.pgi.myrinet
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
And when I run it directly instead of through mpirun, it segfaults:
[root_at_opt310 examples]# ./cpi.lam.pgi.myrinet
Segmentation fault
gdb reports this back trace:
#0 0x000000000041c3b2 in chunk_alloc ()
#1 0x000000000041c25a in malloc ()
#2 0x000000000041b827 in malloc_hook_ini ()
#3 0x000000000041c1b6 in malloc ()
#4 0x000000000044a364 in lam_arr_init ()
#5 0x00000000004136a0 in lam_updown_init ()
#6 0x000000000041372a in lam_setfunc ()
#7 0x0000000000404966 in MPI_Init ()
#8 0x0000000000403cd0 in main (argc=1, argv=0x7fbffff868) at cpi.c:20
Similarly, laminfo crashes:
/home/marc/lam-7.0.6.pgi.myrinet/bin/laminfo
Segmentation fault
I've attached the output from configuring and building LAM.
Again, it works just fine when I use these combinations:
mpich-gcc-myrinet
mpich-gcc-ethernet
mpich-pgi-myrinet
mpich-pgi-ethernet
lam-gcc-myrinet
lam-pgi-ethernet
So it's something about the combination of lam, pgi, and myrinet, perhaps Opteron-specific. I haven't tried it on Intel systems.
<<config.log.gz>>
<<make.myrinet.pgi.log.gz>>
Thanks.
|