LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-09-30 11:06:51


On Sep 30, 2005, at 10:12 AM, John Ouellette wrote:

> We're currently running a 2.6.9 vanilla kernel patched with the bproc
> 4.0.0pre8 patches and libraries. lam-7.1.1 compiles and boots
> fine, but the
> lamtest suite fails one set of tests with the following errors:
>
> mpirun -x TEST -s h C -ssi rpi
> gm /home/ouellet/lamtests-7.1.1/dynamic/./comm_join
> [**ERROR**]: LAM/MPI MPI_COMM_WORLD rank 1, file comm_join.c:129:
> ERROR: Client could not gethostbyname properly

Yeah, this is an error with the test itself, not with LAM/MPI. The
test calls gethostname(), which won't work so well on BProc
clusters. My guess is that all the dynamic tests are going to fail
because they all do some really nasty things to verify correctness
that probably won't work out on the BProc compute nodes.

> When we try to run a parallel code in the lam environment, the
> process on the
> first node hangs -- it doesn't die, it just doesn't do anything --
> while the
> processes on the other nodes quickly get to a state where they are
> waiting
> for the first process to return a result. The process on the first
> node also
> can't be killed: the node must be rebooted to get rid of it.

Interesting... I know it's a pain on BProc, but can you get a stack
trace on the first node? I can't think of anything we'd be doing
that would cause this, and I certainly haven't seen it before, but
the bulk of our testing is over ethernet instead of myrinet.

> Actually, as I was writing this, I found a solution to the above
> error with
> the lamtest suite: I manually added the nodes I was using to the
> hosts file
> which is exported to the bproc nodes (and updated the nsswitch.conf
> file).
> This cleared the gethostbyname error, but not the problem with the
> process on
> the first node in the LAM environment.

That makes sense - then gethostname() will work, so the gethostname
call in the test will work properly.

> The one bproc related error in the LAM configuration is the following:
>
> configure:25972: checking for bproc_getnodebyname in -lbproc
> configure:26002: icc -o conftest -O3 -I/usr/local/include -pthread
> -DLAM_BUILDING=1 -L/usr/local/lib conftest.c -lbproc -
> lbproc >&5
> /tmp/iccueYKUr.o(.text+0x12): In function `main':
> : undefined reference to `bproc_getnodebyname'
>
> Is it possible that the version of bproc we're using (4.0.0pre8) is
> newer than
> the version with which LAM is compatible?

I believe that we should be fine with BProc 4.0pre8. The test above
is to determine whether we have bproc_getnodebyname or not. If we
don't, we use the BProc 4 method for getting the information.

Brian