LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: John Ouellette (ouellet_at_[hidden])
Date: 2005-09-30 10:12:03


Hi -- I'm trying to get LAM/MPI to work with a BProc-based cluster. Although
I am able to get LAM to compile and boot using the bproc module, I am unable
to get any parallel jobs to run.

Our cluster has 128 nodes connected with Myrinet. We're currently using
MPICH-GM and the GCC compilers, but are switching to the Intel compilers
because we need a fully functional Fortran 90 compiler. I've had problems
getting MPICH to compile properly using the Intel (v9.0) compilers, and bproc
patches are not available for the newest version of MPICH-GM, so LAM seemed
to be the natural choice. For our C-based codes, MPICH and GCC seem to work
fine.

We're currently running a 2.6.9 vanilla kernel patched with the bproc
4.0.0pre8 patches and libraries. lam-7.1.1 compiles and boots fine, but the
lamtest suite fails one set of tests with the following errors:

mpirun -x TEST -s h C -ssi rpi
gm /home/ouellet/lamtests-7.1.1/dynamic/./comm_join
[**ERROR**]: LAM/MPI MPI_COMM_WORLD rank 1, file comm_join.c:129:
ERROR: Client could not gethostbyname properly
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 5284 failed on node n1 (10.0.0.35) with exit status 1.

The test always fails on the first node of the lambooted nodes, regardless of
which physical nodes are used. The nodes all get their names (and each
others) directly from bproc, so this should be working....

When we try to run a parallel code in the lam environment, the process on the
first node hangs -- it doesn't die, it just doesn't do anything -- while the
processes on the other nodes quickly get to a state where they are waiting
for the first process to return a result. The process on the first node also
can't be killed: the node must be rebooted to get rid of it.

Actually, as I was writing this, I found a solution to the above error with
the lamtest suite: I manually added the nodes I was using to the hosts file
which is exported to the bproc nodes (and updated the nsswitch.conf file).
This cleared the gethostbyname error, but not the problem with the process on
the first node in the LAM environment.

The one bproc related error in the LAM configuration is the following:

configure:25972: checking for bproc_getnodebyname in -lbproc
configure:26002: icc -o conftest -O3 -I/usr/local/include -pthread
-DLAM_BUILDING=1 -L/usr/local/lib conftest.c -lbproc -lbproc >&5
/tmp/iccueYKUr.o(.text+0x12): In function `main':
: undefined reference to `bproc_getnodebyname'

Is it possible that the version of bproc we're using (4.0.0pre8) is newer than
the version with which LAM is compatible?

Thanks for any assistance,
John Ouellette

  

-- 
+++++++++++++++++++++++++++++++++++
John Ouellette
Department of Astrophysics
American Museum of Natural History
Ph: 212-313-7919 Fax: 212-769-5007