LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Zeljko Sljivancanin (Zeljko.Sljivancanin_at_[hidden])
Date: 2005-05-25 05:18:23


Hi list,

I compiled lam-7.1.1 on our opteron cluster with myrinet network.
I login to the nodes using PBSpro (qsub -I ), and when I try lamboot
it fails.
My config.log, and outputs from 'laminfo' and 'lamboot -d' are attached.
I would appreciate very much you suggestions.

Best regards,
Zeljko Sljivancanin
   


             LAM/MPI: 7.1.1
              Prefix: /u11/sljivanc/local
        Architecture: x86_64-unknown-linux-gnu
       Configured by: sljivanc
       Configured on: Wed May 25 11:11:52 CEST 2005
      Configure host: login1
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: gcc
        C++ compiler: g++
    Fortran compiler: g77
     Fortran symbols: double_underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: no
      Thread support: yes
       ROMIO support: yes
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI boot: tm (API v1.1, Module v1.1)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: gm (API v1.1, Module v1.2)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: self (API v1.0, Module v1.0)

n-1<30185> ssi:boot:open: opening
n-1<30185> ssi:boot:open: opening boot module globus
n-1<30185> ssi:boot:open: opened boot module globus
n-1<30185> ssi:boot:open: opening boot module rsh
n-1<30185> ssi:boot:open: opened boot module rsh
n-1<30185> ssi:boot:open: opening boot module slurm
n-1<30185> ssi:boot:open: opened boot module slurm
n-1<30185> ssi:boot:open: opening boot module tm
n-1<30185> ssi:boot:open: opened boot module tm
n-1<30185> ssi:boot:select: initializing boot module tm
n-1<30185> ssi:boot:tm: module initializing
n-1<30185> ssi:boot:tm:verbose: 1000
n-1<30185> ssi:boot:tm:priority: 50
n-1<30185> ssi:boot:select: boot module available: tm, priority: 50
n-1<30185> ssi:boot:select: initializing boot module globus
n-1<30185> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<30185> ssi:boot:select: boot module not available: globus
n-1<30185> ssi:boot:select: initializing boot module slurm
n-1<30185> ssi:boot:slurm: not running under SLURM
n-1<30185> ssi:boot:select: boot module not available: slurm
n-1<30185> ssi:boot:select: initializing boot module rsh
n-1<30185> ssi:boot:rsh: module initializing
n-1<30185> ssi:boot:rsh:agent: rsh
n-1<30185> ssi:boot:rsh:username: <same>
n-1<30185> ssi:boot:rsh:verbose: 1000
n-1<30185> ssi:boot:rsh:algorithm: linear
n-1<30185> ssi:boot:rsh:no_n: 0
n-1<30185> ssi:boot:rsh:no_profile: 0
n-1<30185> ssi:boot:rsh:fast: 0
n-1<30185> ssi:boot:rsh:ignore_stderr: 0
n-1<30185> ssi:boot:rsh:priority: 10
n-1<30185> ssi:boot:select: boot module available: rsh, priority: 10
n-1<30185> ssi:boot:select: finalizing boot module globus
n-1<30185> ssi:boot:globus: finalizing
n-1<30185> ssi:boot:select: closing boot module globus
n-1<30185> ssi:boot:select: finalizing boot module slurm
n-1<30185> ssi:boot:slurm: finalizing
n-1<30185> ssi:boot:select: closing boot module slurm
n-1<30185> ssi:boot:select: finalizing boot module rsh
n-1<30185> ssi:boot:rsh: finalizing
n-1<30185> ssi:boot:select: closing boot module rsh
n-1<30185> ssi:boot:select: selected boot module tm
n-1<30185> ssi:boot:tm: found the following 4 hosts:
n-1<30185> ssi:boot:tm: n0 node001 (cpu=1)
n-1<30185> ssi:boot:tm: n1 node002 (cpu=1)
n-1<30185> ssi:boot:tm: n2 node003 (cpu=1)
n-1<30185> ssi:boot:tm: n3 node004 (cpu=1)
n-1<30185> ssi:boot:tm: starting RTE procs
n-1<30185> ssi:boot:base:linear_windowed: starting
n-1<30185> ssi:boot:base:linear_windowed: window size: 5
n-1<30185> ssi:boot:base:server: opening server TCP socket
n-1<30185> ssi:boot:base:server: opened port 54279
n-1<30185> ssi:boot:base:linear_windowed: booting n0 (node001)
n-1<30185> ssi:boot:tm: starting wipe on (node001)
n-1<30185> ssi:boot:tm: starting on n0 (node001): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30185> ssi:boot:tm: successfully launched on n0 (node001)
n-1<30185> ssi:boot:tm: waiting for completion on n0 (node001)
n-1<30185> ssi:boot:tm: finished on n0 (node001)
n-1<30185> ssi:boot:tm: starting lamd on (node001)
n-1<30185> ssi:boot:tm: starting on n0 (node001): /home/sljivanc/local/bin/lamd -H 10.2.2.1 -P 54279 -n 0 -o 0 -d
n-1<30185> ssi:boot:tm: successfully launched on n0 (node001)
n-1<30185> ssi:boot:base:linear_windowed: booting n1 (node002)
n-1<30185> ssi:boot:tm: starting wipe on (node002)
n-1<30185> ssi:boot:tm: starting on n1 (node002): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30185> ssi:boot:tm: successfully launched on n1 (node002)
n-1<30185> ssi:boot:tm: waiting for completion on n1 (node002)
n-1<30185> ssi:boot:tm: finished on n1 (node002)
n-1<30185> ssi:boot:tm: starting lamd on (node002)
n-1<30185> ssi:boot:tm: starting on n1 (node002): /home/sljivanc/local/bin/lamd -H 10.2.2.1 -P 54279 -n 1 -o 0 -d
n-1<30185> ssi:boot:tm: successfully launched on n1 (node002)
n-1<30185> ssi:boot:base:linear_windowed: booting n2 (node003)
n-1<30185> ssi:boot:tm: starting wipe on (node003)
n-1<30185> ssi:boot:tm: starting on n2 (node003): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30185> ssi:boot:tm: successfully launched on n2 (node003)
n-1<30185> ssi:boot:tm: waiting for completion on n2 (node003)
n-1<30185> ssi:boot:tm: finished on n2 (node003)
n-1<30185> ssi:boot:tm: starting lamd on (node003)
n-1<30185> ssi:boot:tm: starting on n2 (node003): /home/sljivanc/local/bin/lamd -H 10.2.2.1 -P 54279 -n 2 -o 0 -d
n-1<30185> ssi:boot:tm: successfully launched on n2 (node003)
n-1<30185> ssi:boot:base:linear_windowed: booting n3 (node004)
n-1<30185> ssi:boot:tm: starting wipe on (node004)
n-1<30185> ssi:boot:tm: starting on n3 (node004): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30185> ssi:boot:tm: successfully launched on n3 (node004)
n-1<30185> ssi:boot:tm: waiting for completion on n3 (node004)
n-1<30185> ssi:boot:tm: finished on n3 (node004)
n-1<30185> ssi:boot:tm: starting lamd on (node004)
n-1<30185> ssi:boot:tm: starting on n3 (node004): /home/sljivanc/local/bin/lamd -H 10.2.2.1 -P 54279 -n 3 -o 0 -d
n-1<30185> ssi:boot:tm: successfully launched on n3 (node004)
n-1<30185> ssi:boot:base:linear_windowed: finished launching
n-1<30185> ssi:boot:base:server: expecting connection from finite list
n-1<30185> ssi:boot:base:server: got connection from 10.2.2.1
n-1<30185> ssi:boot:base:server: this connection is expected (n0)
n-1<30185> ssi:boot:base:server: remote lamd is at 10.2.2.1:32820
n-1<30185> ssi:boot:base:server: expecting connection from finite list
n-1<30185> ssi:boot:base:server: got connection from 10.2.2.2
n-1<30185> ssi:boot:base:server: unexpected connection; dropping
n-1<30185> ssi:boot:base:server: got connection from 10.2.2.3
n-1<30185> ssi:boot:base:server: unexpected connection; dropping
n-1<30185> ssi:boot:base:server: got connection from 10.2.2.4
n-1<30185> ssi:boot:base:server: unexpected connection; dropping
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n-1<30185> ssi:boot:base:server: failed to connect to remote lamd!
n-1<30185> ssi:boot:base:server: closing server socket
n-1<30185> ssi:boot:base:linear_windowed: aborted!
n-1<30188> ssi:boot:open: opening
n-1<30188> ssi:boot:open: opening boot module globus
n-1<30188> ssi:boot:open: opened boot module globus
n-1<30188> ssi:boot:open: opening boot module rsh
n-1<30188> ssi:boot:open: opened boot module rsh
n-1<30188> ssi:boot:open: opening boot module slurm
n-1<30188> ssi:boot:open: opened boot module slurm
n-1<30188> ssi:boot:open: opening boot module tm
n-1<30188> ssi:boot:open: opened boot module tm
n-1<30188> ssi:boot:select: initializing boot module tm
n-1<30188> ssi:boot:tm: module initializing
n-1<30188> ssi:boot:tm:verbose: 1000
n-1<30188> ssi:boot:tm:priority: 50
n-1<30188> ssi:boot:select: boot module available: tm, priority: 50
n-1<30188> ssi:boot:select: initializing boot module globus
n-1<30188> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<30188> ssi:boot:select: boot module not available: globus
n-1<30188> ssi:boot:select: initializing boot module slurm
n-1<30188> ssi:boot:slurm: not running under SLURM
n-1<30188> ssi:boot:select: boot module not available: slurm
n-1<30188> ssi:boot:select: initializing boot module rsh
n-1<30188> ssi:boot:rsh: module initializing
n-1<30188> ssi:boot:rsh:agent: rsh
n-1<30188> ssi:boot:rsh:username: <same>
n-1<30188> ssi:boot:rsh:verbose: 1000
n-1<30188> ssi:boot:rsh:algorithm: linear
n-1<30188> ssi:boot:rsh:no_n: 0
n-1<30188> ssi:boot:rsh:no_profile: 0
n-1<30188> ssi:boot:rsh:fast: 0
n-1<30188> ssi:boot:rsh:ignore_stderr: 0
n-1<30188> ssi:boot:rsh:priority: 10
n-1<30188> ssi:boot:select: boot module available: rsh, priority: 10
n-1<30188> ssi:boot:select: finalizing boot module globus
n-1<30188> ssi:boot:globus: finalizing
n-1<30188> ssi:boot:select: closing boot module globus
n-1<30188> ssi:boot:select: finalizing boot module slurm
n-1<30188> ssi:boot:slurm: finalizing
n-1<30188> ssi:boot:select: closing boot module slurm
n-1<30188> ssi:boot:select: finalizing boot module rsh
n-1<30188> ssi:boot:rsh: finalizing
n-1<30188> ssi:boot:select: closing boot module rsh
n-1<30188> ssi:boot:select: selected boot module tm
n-1<30188> ssi:boot:tm: found the following 4 hosts:
n-1<30188> ssi:boot:tm: n0 node001 (cpu=1)
n-1<30188> ssi:boot:tm: n1 node002 (cpu=1)
n-1<30188> ssi:boot:tm: n2 node003 (cpu=1)
n-1<30188> ssi:boot:tm: n3 node004 (cpu=1)
n-1<30188> ssi:boot:tm: starting RTE procs
n-1<30188> ssi:boot:base:linear_windowed: starting
n-1<30188> ssi:boot:base:linear_windowed: no startup protocol
n-1<30188> ssi:boot:base:linear_windowed: invoking linear
n-1<30188> ssi:boot:base:linear: starting
n-1<30188> ssi:boot:base:linear: booting n0 (node001)
n-1<30188> ssi:boot:tm: starting wipe on (node001)
n-1<30188> ssi:boot:tm: starting on n0 (node001): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30188> ssi:boot:tm: successfully launched on n0 (node001)
n-1<30188> ssi:boot:tm: waiting for completion on n0 (node001)
n-1<30188> ssi:boot:tm: finished on n0 (node001)
n-1<30188> ssi:boot:base:linear: booting n1 (node002)
n-1<30188> ssi:boot:tm: starting wipe on (node002)
n-1<30188> ssi:boot:tm: starting on n1 (node002): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30188> ssi:boot:tm: successfully launched on n1 (node002)
n-1<30188> ssi:boot:tm: waiting for completion on n1 (node002)
n-1<30188> ssi:boot:tm: finished on n1 (node002)
n-1<30188> ssi:boot:base:linear: booting n2 (node003)
n-1<30188> ssi:boot:tm: starting wipe on (node003)
n-1<30188> ssi:boot:tm: starting on n2 (node003): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30188> ssi:boot:tm: successfully launched on n2 (node003)
n-1<30188> ssi:boot:tm: waiting for completion on n2 (node003)
n-1<30188> ssi:boot:tm: finished on n2 (node003)
n-1<30188> ssi:boot:base:linear: booting n3 (node004)
n-1<30188> ssi:boot:tm: starting wipe on (node004)
n-1<30188> ssi:boot:tm: starting on n3 (node004): /home/sljivanc/local/bin/tkill -setsid -d
n-1<30188> ssi:boot:tm: successfully launched on n3 (node004)
n-1<30188> ssi:boot:tm: waiting for completion on n3 (node004)
n-1<30188> ssi:boot:tm: finished on n3 (node004)
n-1<30188> ssi:boot:base:linear: finished
n-1<30188> ssi:boot:tm: all RTE procs started
n-1<30188> ssi:boot:tm: finalizing
n-1<30188> ssi:boot: Closing
lamboot did NOT complete successfully

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University