LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Javier Martínez de Pisón Ascacíbar (fjmartin_at_[hidden])
Date: 2006-02-02 11:21:13


Hi, LAM comunity

I have a lamboot problem with LAM v7.0.4 (ABAQUS 6.5.1 needs this
version) in 3 machines (xeon2, xeon3, apidell) with SUSE 9.3 to run
ABAQUS 6.5.1.

I think, LAM is correctly installed in the 3 machines.

Access using "rsh" is running ok in any directions without querying
password.

Firewalls (iptables) are stopped.

But, "lamboot" doesn't start. ¿What could I do? ¿What is happening?

For example, i have tried "telnet 192.168.0.28 4271" from xeon3 and it
seems to work fine… I have gotten "connection refused" error.

This is my lamboot report

> lamboot -v hostpar -d
n0<19818> ssi:boot: Opening
n0<19818> ssi:boot: opening module globus
n0<19818> ssi:boot: initializing module globus
n0<19818> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<19818> ssi:boot: module not available: globus
n0<19818> ssi:boot: opening module rsh
n0<19818> ssi:boot: initializing module rsh
n0<19818> ssi:boot:rsh: module initializing
n0<19818> ssi:boot:rsh:agent: rsh
n0<19818> ssi:boot:rsh:username: <same>
n0<19818> ssi:boot:rsh:verbose: 1000
n0<19818> ssi:boot:rsh:algorithm: linear
n0<19818> ssi:boot:rsh:priority: 10
n0<19818> ssi:boot: module available: rsh, priority: 10
n0<19818> ssi:boot: finalizing module globus
n0<19818> ssi:boot:globus: finalizing
n0<19818> ssi:boot: closing module globus
n0<19818> ssi:boot: Selected boot module rsh

LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University

n0<19818> ssi:boot:base: looking for boot schema in following directories:
n0<19818> ssi:boot:base: <current directory>
n0<19818> ssi:boot:base: $TROLLIUSHOME/etc
n0<19818> ssi:boot:base: $LAMHOME/etc
n0<19818> ssi:boot:base: /opt/lam/etc
n0<19818> ssi:boot:base: looking for boot schema file:
n0<19818> ssi:boot:base: hostpar
n0<19818> ssi:boot:base: found boot schema: hostpar
n0<19818> ssi:boot:rsh: found the following hosts:
n0<19818> ssi:boot:rsh: n0 xeon2 (cpu=2)
n0<19818> ssi:boot:rsh: n1 xeon3 (cpu=2)
n0<19818> ssi:boot:rsh: n2 apidell (cpu=2)
n0<19818> ssi:boot:rsh: resolved hosts:
n0<19818> ssi:boot:rsh: n0 xeon2 --> 192.168.0.28 (origin)
n0<19818> ssi:boot:rsh: n1 xeon3 --> 192.168.0.38
n0<19818> ssi:boot:rsh: n2 apidell --> 192.168.0.220
n0<19818> ssi:boot:rsh: starting RTE procs
n0<19818> ssi:boot:base:linear: starting
n0<19818> ssi:boot:base:server: opening server TCP socket
n0<19818> ssi:boot:base:server: opened port 4271
n0<19818> ssi:boot:base:linear: booting n0 (xeon2)
n0<19818> ssi:boot:rsh: starting lamd on (xeon2)
n0<19818> ssi:boot:rsh: starting on n0 (xeon2): hboot -t -c
lam-conf.lamd -d -v -I -H 192.168.0.28 -P 4271 -n 0 -o 0
n0<19818> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-abaquspar_at_xeon2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-abaquspar_at_xeon2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon2/lam-io-socket
tkill: f_kill = "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
tkill: nothing to kill: "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
hboot: booting...
hboot: fork /opt/lam/bin/lamd
hboot: attempting to execute
[1] 19821 lamd -H 192.168.0.28 -P 4271 -n 0 -o 0 -d
n0<19818> ssi:boot:rsh: successfully launched on n0 (xeon2)
n0<19818> ssi:boot:base:server: expecting connection from finite list
n-1<19821> ssi:boot: Opening
n-1<19821> ssi:boot: opening module globus
n-1<19821> ssi:boot: initializing module globus
n-1<19821> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<19821> ssi:boot: module not available: globus
n-1<19821> ssi:boot: opening module rsh
n-1<19821> ssi:boot: initializing module rsh
n-1<19821> ssi:boot:rsh: module initializing
n-1<19821> ssi:boot:rsh:agent: rsh
n-1<19821> ssi:boot:rsh:username: <same>
n-1<19821> ssi:boot:rsh:verbose: 1000
n-1<19821> ssi:boot:rsh:algorithm: linear
n-1<19821> ssi:boot:rsh:priority: 10
n-1<19821> ssi:boot: module available: rsh, priority: 10
n-1<19821> ssi:boot: finalizing module globus
n-1<19821> ssi:boot:globus: finalizing
n-1<19821> ssi:boot: closing module globus
n-1<19821> ssi:boot: Selected boot module rsh
n0<19818> ssi:boot:base:server: got connection from 192.168.0.28
n0<19818> ssi:boot:base:server: this connection is expected (n0)
n0<19818> ssi:boot:base:server: remote lamd is at 192.168.0.28:10919
n0<19818> ssi:boot:base:linear: booting n1 (xeon3)
n0<19818> ssi:boot:rsh: starting lamd on (xeon3)
n0<19818> ssi:boot:rsh: starting on n1 (xeon3): hboot -t -c
lam-conf.lamd -d -v -s -I "-H 193.146.23
5.28 -P 4271 -n 1 -o 0"
n0<19818> ssi:boot:rsh: launching remotely
n0<19818> ssi:boot:rsh: attempting to execute "rsh xeon3 -n echo $SHELL"
n0<19818> ssi:boot:rsh: remote shell /bin/bash
n0<19818> ssi:boot:rsh: attempting to execute "rsh xeon3 -n hboot -t -c
lam-conf.lamd -d -v -s -I "-
H 192.168.0.28 -P 4271 -n 1 -o 0""
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-abaquspar_at_xeon3/lam-killfiletkill:
removing socket file ...
tkill: socket file: /tmp/lam-abaquspar_at_xeon3/lam-kernel-socketdtkill:
removing IO daemon socket file
...
tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon3/lam-io-socket
tkill: f_kill = "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
tkill: nothing to kill: "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /opt/lam/bin/lamd
[1] 22404 lamd -H 192.168.0.28 -P 4271 -n 1 -o 0 -d
n0<19818> ssi:boot:rsh: successfully launched on n1 (xeon3)
n0<19818> ssi:boot:base:server: expecting connection from finite list
n0<19818> ssi:boot:base:server: got connection from 192.168.0.39
n0<19818> ssi:boot:base:server: unexpected connection; dropping
n0<19818> ssi:boot:base:server: got connection from 192.168.0.39
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

- There are network filters between the lamboot agent host and
the remote host such that communication on random TCP ports
is blocked
- Network routing from the remote host to the local host isn't
properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
one is the IP address of where the lamboot agent was invoked, the
other is the port number that the lamboot agent is expecting the
newly-booted process to call back on (this will be a random
integer).

2. Manually login to the remote machine and try to telnet to the port
indicated on the hboot command line. For example,
telnet <ipnumber> <portnumber>
If all goes well, you should get a "Connection refused" error. If
you get any other kind of error, it could indicate either of the
two conditions above. Consult with your system/network
administrator.
-----------------------------------------------------------------------------
n0<19818> ssi:boot:base:server: failed to connect to remote lamd!
n0<19818> ssi:boot:base:server: closing server socket
n0<19818> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n0<19824> ssi:boot: Opening
n0<19824> ssi:boot: opening module globus
n0<19824> ssi:boot: initializing module globus
n0<19824> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<19824> ssi:boot: module not available: globus
n0<19824> ssi:boot: opening module rsh
n0<19824> ssi:boot: initializing module rsh
n0<19824> ssi:boot:rsh: module initializing
n0<19824> ssi:boot:rsh:agent: rsh
n0<19824> ssi:boot:rsh:username: <same>
n0<19824> ssi:boot:rsh:verbose: 1000
n0<19824> ssi:boot:rsh:algorithm: linear
n0<19824> ssi:boot:rsh:priority: 10
n0<19824> ssi:boot: module available: rsh, priority: 10
n0<19824> ssi:boot: finalizing module globus
n0<19824> ssi:boot:globus: finalizing
n0<19824> ssi:boot: closing module globus
n0<19824> ssi:boot: Selected boot module rsh
n0<19824> ssi:boot:base: looking for boot schema in following directories:
n0<19824> ssi:boot:base: <current directory>
n0<19824> ssi:boot:base: $TROLLIUSHOME/etc
n0<19824> ssi:boot:base: $LAMHOME/etc
n0<19824> ssi:boot:base: /opt/lam/etc
n0<19824> ssi:boot:base: looking for boot schema file:
n0<19824> ssi:boot:base: hostpar
n0<19824> ssi:boot:base: found boot schema: hostpar
n0<19824> ssi:boot:rsh: found the following hosts:
n0<19824> ssi:boot:rsh: n0 xeon2 (cpu=2)
n0<19824> ssi:boot:rsh: n1 xeon3 (cpu=2)
n0<19824> ssi:boot:rsh: n2 apidell (cpu=2)
n0<19824> ssi:boot:rsh: resolved hosts:
n0<19824> ssi:boot:rsh: n0 xeon2 --> 192.168.0.28 (origin)
n0<19824> ssi:boot:rsh: n1 xeon3 --> 192.168.0.38
n0<19824> ssi:boot:rsh: n2 apidell --> 192.168.0.220
n0<19824> ssi:boot:rsh: starting RTE procs
n0<19824> ssi:boot:base:linear: starting
n0<19824> ssi:boot:base:linear: booting n0 (xeon2)
n0<19824> ssi:boot:rsh: starting wipe on (xeon2)
n0<19824> ssi:boot:rsh: starting on n0 (xeon2): tkill -d -v
n0<19824> ssi:boot:rsh: launching locally
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-abaquspar_at_xeon2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-abaquspar_at_xeon2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon2/lam-io-socket
tkill: f_kill = "/tmp/lam-abaquspar_at_xeon2/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 19821 ...
tkill: killed
tkill: all finished
n0<19824> ssi:boot:rsh: successfully launched on n0 (xeon2)
n0<19824> ssi:boot:base:linear: booting n1 (xeon3)
n0<19824> ssi:boot:rsh: starting wipe on (xeon3)
n0<19824> ssi:boot:rsh: starting on n1 (xeon3): tkill -d -v
n0<19824> ssi:boot:rsh: launching remotely
n0<19824> ssi:boot:rsh: attempting to execute "rsh xeon3 -n echo $SHELL"
n0<19824> ssi:boot:rsh: remote shell /bin/bash
n0<19824> ssi:boot:rsh: attempting to execute "rsh xeon3 -n tkill -d -v"
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-abaquspar_at_xeon3/lam-killfiletkill:
removing socket file ...
tkill: socket file: /tmp/lam-abaquspar_at_xeon3/lam-kernel-socketdtkill:
removing IO daemon socket file
...
tkill: IO daemon socket file: /tmp/lam-abaquspar_at_xeon3/lam-io-socket
tkill: f_kill = "/tmp/lam-abaquspar_at_xeon3/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 22404 ...
tkill: killed
tkill: all finished
n0<19824> ssi:boot:rsh: successfully launched on n1 (xeon3)
n0<19824> ssi:boot:base:linear: booting n2 (apidell)
n0<19824> ssi:boot:rsh: starting wipe on (apidell)
n0<19824> ssi:boot:rsh: starting on n2 (apidell): tkill -d -v
n0<19824> ssi:boot:rsh: launching remotely
n0<19824> ssi:boot:rsh: attempting to execute "rsh apidell -n echo $SHELL"
n0<19824> ssi:boot:rsh: remote shell /bin/bash
n0<19824> ssi:boot:rsh: attempting to execute "rsh apidell -n tkill -d -v"
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-abaquspar_at_apidell/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-abaquspar_at_apidell/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-abaquspar_at_apidell/lam-io-socket
tkill: f_kill = "/tmp/lam-abaquspar_at_apidell/lam-killfile"
tkill: nothing to kill: "/tmp/lam-abaquspar_at_apidell/lam-killfile"
n0<19824> ssi:boot:rsh: successfully launched on n2 (apidell)
n0<19824> ssi:boot:base:linear: finished
n0<19824> ssi:boot:rsh: all RTE procs started
n0<19824> ssi:boot:rsh: finalizing
n0<19824> ssi:boot: Closing
lamboot did NOT complete successfully

abaquspar_at_pc5036:~> laminfo
LAM/MPI: 7.0.4
Prefix: /opt/lam
Architecture: i686-pc-linux-gnu
Configured by: root
Configured on: Wed Feb 1 17:37:21 CET 2006
Configure host: xeon2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0.1)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
abaquspar_at_pc5036:~>

Thanks