On May 25, 2005, at 6:18 AM, Zeljko Sljivancanin wrote:
> I compiled lam-7.1.1 on our opteron cluster with myrinet network.
> I login to the nodes using PBSpro (qsub -I ), and when I try
> lamboot it fails.
> My config.log, and outputs from 'laminfo' and 'lamboot -d' are
> attached.
> I would appreciate very much you suggestions.
Do you have multiple TCP networks on your nodes? And/or do you have
non-uniform routing in your cluster?
What looks like is happening is that LAM is using one network for
connections, but is expecting connections to come in from another
(that's what the "unexpected" connections).
Here's an abbreviated version of your output:
> [snipped]
> n-1<30185> ssi:boot:select: selected boot module tm
> n-1<30185> ssi:boot:tm: found the following 4 hosts:
> n-1<30185> ssi:boot:tm: n0 node001 (cpu=1)
> n-1<30185> ssi:boot:tm: n1 node002 (cpu=1)
> n-1<30185> ssi:boot:tm: n2 node003 (cpu=1)
> n-1<30185> ssi:boot:tm: n3 node004 (cpu=1)
> n-1<30185> ssi:boot:tm: starting RTE procs
> n-1<30185> ssi:boot:base:linear_windowed: starting
> n-1<30185> ssi:boot:base:linear_windowed: window size: 5
> n-1<30185> ssi:boot:base:server: opening server TCP socket
> n-1<30185> ssi:boot:base:server: opened port 54279
> n-1<30185> ssi:boot:base:linear_windowed: booting n0 (node001)
> n-1<30185> ssi:boot:tm: starting wipe on (node001)
> n-1<30185> ssi:boot:tm: starting on n0 (node001):
> /home/sljivanc/local/bin/tkill -setsid -d
> n-1<30185> ssi:boot:tm: successfully launched on n0 (node001)
> n-1<30185> ssi:boot:tm: waiting for completion on n0 (node001)
> [snipped]
> n-1<30185> ssi:boot:base:linear_windowed: finished launching
> n-1<30185> ssi:boot:base:server: expecting connection from finite list
> n-1<30185> ssi:boot:base:server: got connection from 10.2.2.1
> n-1<30185> ssi:boot:base:server: this connection is expected (n0)
> n-1<30185> ssi:boot:base:server: remote lamd is at 10.2.2.1:32820
> n-1<30185> ssi:boot:base:server: expecting connection from finite list
> n-1<30185> ssi:boot:base:server: got connection from 10.2.2.2
> n-1<30185> ssi:boot:base:server: unexpected connection; dropping
> n-1<30185> ssi:boot:base:server: got connection from 10.2.2.3
> n-1<30185> ssi:boot:base:server: unexpected connection; dropping
> n-1<30185> ssi:boot:base:server: got connection from 10.2.2.4
> n-1<30185> ssi:boot:base:server: unexpected connection; dropping
I'm guessing that the "expected" connection is from the localhost, but
the others are from the unexpected network.
Specifically, LAM resolves the names provided by PBS (node001, node002,
etc.) and expects connections to come exactly those IP addresses. But
if you have non-uniform routing in your cluster (e.g., FNN kinds of
things from the University of Kentucky), it is possible that the
connections will take a different route and appear to be coming from a
different IP address.
LAM allows you to disregard the "expected connection" list by setting
the "boot_base_promisc" SSI parameter. For example:
lamboot -ssi boot_base_promisc 1
This should fix the problem that you're seeing. If you want to set
this cluster-wide (so that users don't need to include "-ssi
boot_base_promisc 1" in all their lamboot command lines), you can add a
shell startup script in /etc/profile.d (or wherever your default shell
startup scripts are) and set this as an environment variable:
csh: setenv LAM_MPI_SSI_boot_base_promisc 1
sh: export LAM_MPI_SSI_boot_base_promisc=1
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|