We iterated off-list a little to solve this problem.
Short version:
--------------
1. There is a problem with LAM 6.5.x and 7.0 if you try to use a unix
socket with a really long pathname. It rarely/never happens in 6.5.x, but
can happen in 7.0 in environments where, for example, $TMPDIR has a very
long value.
2. There are workarounds for both versions:
- in 6.5.x, set $LAM_MPI_SOCKET_SUFFIX to something shorter.
- in 7.0, set $LAM_MPI_SESSION_PREFIX to something shorter. This will
take precedence over $TMPDIR.
6.5.x is unlikely to be fixed (since it is officially retired). Fixes for
7.0 have been committed to CVS. Since there is a workaround, and since
this bug affects so few people, we do not plan to release 7.0.1 because of
it. Future versions of LAM/MPI will include this fix.
Longer version:
---------------
It seems that there is a max pathname length to unix socket names that is
significantly shorter than MAXPATHNAMELEN. On recent Linux systems, for
example, it seems that the max length is somewhere around 130 characters
(i.e., essentially sizeof(struct socket_un)). If the pathname of the
socket you're trying to open is over that length, binding to the socket
will fail.
Starting with 7.0, LAM/MPI uses $TMPDIR as one of the factors determining
where LAM's session directory should be located. The problem can happen,
for example, in PBS Pro environments with LAM/MPI 7.0 where TMPDIR is
automatically set by PBS Pro to a sufficiently long value (due, in part,
to a sufficiently long hostname of the PBS Pro server). It results in
exactly what was noted by Jon Bernard in his post listed below.
The workaround for users is simply to set LAM_MPI_SESSION_PREFIX to
something else (i.e., something shorter than the too-long $TMPDIR value
set by PBS Pro's). Using $LAM_MPI_SESSION_PREFIX will override the use of
$TMPDIR, and therefore the problem will be avoided.
The workaround for LAM/MPI is to chdir() to the target directory and open
the relative unix socket filename from there. The filename that LAM uses
(without the path) is guaranteed to be small enough to not cause problems.
This fix enables even very long values in $TMPDIR to work properly (e.g.,
even in the case noted in the mail below). This fix has been committed to
CVS for the 7.x series. 6.5.x is officially retired, and will not be
fixed.
Thanks to Jon Bernard for making us aware of this problem.
On Wed, 9 Jul 2003, Jon B Bernard wrote:
> I've just built lam-7.0 on a RedHat 7.3 system on which we've happily
> been using 6.5.9 for months. I'm having trouble getting things to work,
> however: lamboot fails with
>
> lamd kernel: problem with bind(): Invalid argument
>
> The output of laminfo and lamboot -d follows. There are no filters or
> routing problems, and telnet 172.20.3.57 33577 gives me a connection
> refused error.
>
> Thanks,
> Jon Bernard
>
> LAM/MPI: 7.0
> Prefix: /usr/local/lam/7.0/gnu/ssh
> Architecture: i686-pc-linux-gnu
> Configured by: root
> Configured on: Tue Jul 8 16:07:45 CDT 2003
> Configure host: cahaba
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (Module v0.5)
> SSI boot: rsh (Module v1.0)
> SSI coll: lam_basic (Module v7.0)
> SSI coll: smp (Module v1.0)
> SSI rpi: crtcp (Module v1.0)
> SSI rpi: lamd (Module v7.0)
> SSI rpi: sysv (Module v7.0)
> SSI rpi: tcp (Module v7.0)
> SSI rpi: usysv (Module v7.0)
>
>
> n0<31675> ssi:boot: Opening
> n0<31675> ssi:boot: opening module globus
> n0<31675> ssi:boot: initializing module globus
> n0<31675> ssi:boot:globus: globus-job-run not found, globus boot will
> not run
> n0<31675> ssi:boot: module not available: globus
> n0<31675> ssi:boot: opening module rsh
> n0<31675> ssi:boot: initializing module rsh
> n0<31675> ssi:boot:rsh: module initializing
> n0<31675> ssi:boot:rsh:agent: ssh -x
> n0<31675> ssi:boot:rsh:username: <same>
> n0<31675> ssi:boot:rsh:verbose: 1000
> n0<31675> ssi:boot:rsh:algorithm: linear
> n0<31675> ssi:boot:rsh:priority: 10
> n0<31675> ssi:boot: module available: rsh, priority: 10
> n0<31675> ssi:boot: finalizing module globus
> n0<31675> ssi:boot:globus: finalizing
> n0<31675> ssi:boot: closing module globus
> n0<31675> ssi:boot: Selected boot module rsh
> n0<31675> ssi:boot:base: looking for boot schema in following
> directories:
> n0<31675> ssi:boot:base: <current directory>
> n0<31675> ssi:boot:base: $TROLLIUSHOME/etc
> n0<31675> ssi:boot:base: $LAMHOME/etc
> n0<31675> ssi:boot:base: /usr/local/lam/7.0/gnu/ssh/etc
> n0<31675> ssi:boot:base: looking for boot schema file:
> n0<31675> ssi:boot:base:
> /var/spool/PBS/5.3.2/aux/13446.cahaba.cahaba.eng.uab.edu
> n0<31675> ssi:boot:base: found boot schema:
> /var/spool/PBS/5.3.2/aux/13446.cahaba.cahaba.eng.uab.edu
> n0<31675> ssi:boot:rsh: found the following hosts:
> n0<31675> ssi:boot:rsh: n0 node57 (cpu=2)
> n0<31675> ssi:boot:rsh: n1 node3 (cpu=2)
> n0<31675> ssi:boot:rsh: n2 node46 (cpu=2)
> n0<31675> ssi:boot:rsh: n3 node47 (cpu=2)
> n0<31675> ssi:boot:rsh: n4 node53 (cpu=2)
> n0<31675> ssi:boot:rsh: n5 node51 (cpu=2)
> n0<31675> ssi:boot:rsh: n6 node48 (cpu=2)
> n0<31675> ssi:boot:rsh: n7 node58 (cpu=2)
> n0<31675> ssi:boot:rsh: resolved hosts:
> n0<31675> ssi:boot:rsh: n0 node57 --> 172.20.3.57 (origin)
> n0<31675> ssi:boot:rsh: n1 node3 --> 172.20.3.3
> n0<31675> ssi:boot:rsh: n2 node46 --> 172.20.3.46
> n0<31675> ssi:boot:rsh: n3 node47 --> 172.20.3.47
> n0<31675> ssi:boot:rsh: n4 node53 --> 172.20.3.53
> n0<31675> ssi:boot:rsh: n5 node51 --> 172.20.3.51
> n0<31675> ssi:boot:rsh: n6 node48 --> 172.20.3.48
> n0<31675> ssi:boot:rsh: n7 node58 --> 172.20.3.58
> n0<31675> ssi:boot:rsh: starting RTE procs
> n0<31675> ssi:boot:base:linear: starting
> n0<31675> ssi:boot:base:server: opening server TCP socket
> n0<31675> ssi:boot:base:server: opened port 33577
> n0<31675> ssi:boot:base:linear: booting n0 (node57)
> n0<31675> ssi:boot:rsh: starting lamd on (node57)
> n0<31675> ssi:boot:rsh: starting on n0 (node57): hboot -t -c
> lam-conf.lamd -d -sessionsuffix pbs-13446.cahaba.cahaba.eng.uab.edu -I
> -H 172.20.3.57 -P 33577 -n 0 -o 0
> n0<31675> ssi:boot:rsh: launching locally
> tkill: setting prefix to (null)
> tkill: setting suffix to pbs-13446.cahaba.cahaba.eng.uab.edu
> tkill: got killname back:
> /tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
> .cahaba.eng.uab.edu/lam-killfile
> tkill: removing socket file ...
> tkill: socket file:
> /tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
> .cahaba.eng.uab.edu/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file:
> /tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahaba
> .cahaba.eng.uab.edu/lam-io-socket
> tkill: f_kill =
> "/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahab
> a.cahaba.eng.uab.edu/lam-killfile"
> tkill: nothing to kill:
> "/tmp/pbs.13446.cahaba.cahaba.eng.uab.edu/lam-jon_at_node57-pbs-13446.cahab
> a.cahaba.eng.uab.edu/lam-killfile"
> hboot: performing tkill
> hboot: tkill -sessionsuffix pbs-13446.cahaba.cahaba.eng.uab.edu -d
> hboot: booting...
> hboot: fork /usr/local/lam/7.0/gnu/ssh/bin/lamd
> [1] 31679 lamd -H 172.20.3.57 -P 33577 -n 0 -o 0 -d -sessionsuffix
> pbs-13446.cahaba.cahaba.eng.uab.edu
> n0<31675> ssi:boot:rsh: successfully launched on n0 (node57)
> n0<31675> ssi:boot:base:server: expecting connection from finite list
> lamd kernel: problem with bind(): Invalid argument
> n0<31675> ssi:boot:base:server: got connection from 144.155.5.8
> ------------------------------------------------------------------------
> -----
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> ------------------------------------------------------------------------
> -----
> n0<31675> ssi:boot:base:server: failed to connect to remote lamd!
> n0<31675> ssi:boot:base:server: closing server socket
> n0<31675> ssi:boot:base:linear: aborted!
> ------------------------------------------------------------------------
> -----
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> ------------------------------------------------------------------------
> -----
> lamboot did NOT complete successfully
>
> LAM 7.0/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: wipe -- nothing to do
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|