LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Van-Khiem Truong (Khiem-Van.Truong_at_[hidden])
Date: 2007-03-29 04:01:27


  Hi,

    I would like to get help for the "lamboot" procedure. I have
installed the code LAM-MPI on two machines HP-UX, the first one is a
PA-Risc 2.0, the second
one is a multiprocessor HP Itanium.

    The installation seems to be fine, except for the module ptmalloc2
(/share/memory/ptmalloc2) where I need to change the "Makefile " to
remove the file
malloc.c, otherwise the code tells me that variables are already declared.

    So on the machine HP PA-Risc , I can start the procedure "lamboot"
and connect to another PA-Risc HP. However for the machine HP
multiprocessor, it tells me that it boots but the call back doesn't
work. I attach hereby the file containing the error message.

      Thank you in advance and best regards,

      Truong V.K.
      research engineer
      ONERA-France

------------------------------------------------------------------------

===============================================================
nanopus 112 : lamboot -v -d hostfile

================================================================
n-1<1698> ssi:boot:open: opening
n-1<1698> ssi:boot:open: opening boot module globus
n-1<1698> ssi:boot:open: opened boot module globus
n-1<1698> ssi:boot:open: opening boot module rsh
n-1<1698> ssi:boot:open: opened boot module rsh
n-1<1698> ssi:boot:open: opening boot module slurm
n-1<1698> ssi:boot:open: opened boot module slurm
n-1<1698> ssi:boot:select: initializing boot module slurm
n-1<1698> ssi:boot:slurm: not running under SLURM
n-1<1698> ssi:boot:select: boot module not available: slurm
n-1<1698> ssi:boot:select: initializing boot module rsh
n-1<1698> ssi:boot:rsh: module initializing
n-1<1698> ssi:boot:rsh:agent: /usr/bin/remsh
n-1<1698> ssi:boot:rsh:username: <same>
n-1<1698> ssi:boot:rsh:verbose: 1000
n-1<1698> ssi:boot:rsh:algorithm: linear
n-1<1698> ssi:boot:rsh:no_n: 0
n-1<1698> ssi:boot:rsh:no_profile: 0
n-1<1698> ssi:boot:rsh:fast: 0
n-1<1698> ssi:boot:rsh:ignore_stderr: 0
n-1<1698> ssi:boot:rsh:priority: 10
n-1<1698> ssi:boot:select: boot module available: rsh, priority: 10
n-1<1698> ssi:boot:select: initializing boot module globus
n-1<1698> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<1698> ssi:boot:select: boot module not available: globus
n-1<1698> ssi:boot:select: finalizing boot module slurm
n-1<1698> ssi:boot:slurm: finalizing
n-1<1698> ssi:boot:select: closing boot module slurm
n-1<1698> ssi:boot:select: finalizing boot module globus
n-1<1698> ssi:boot:globus: finalizing
n-1<1698> ssi:boot:select: closing boot module globus
n-1<1698> ssi:boot:select: selected boot module rsh

LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University

n-1<1698> ssi:boot:base: looking for boot schema in following directories:
n-1<1698> ssi:boot:base: <current directory>
n-1<1698> ssi:boot:base: $TROLLIUSHOME/etc
n-1<1698> ssi:boot:base: $LAMHOME/etc
n-1<1698> ssi:boot:base: /home/truong/Local/lam-mpi-install/etc
n-1<1698> ssi:boot:base: looking for boot schema file:
n-1<1698> ssi:boot:base: hostfile
n-1<1698> ssi:boot:base: found boot schema: hostfile
n-1<1698> ssi:boot:rsh: found the following hosts:
n-1<1698> ssi:boot:rsh: n0 nanopus (cpu=1) n-1<1698> ssi:boot:rsh:
n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh: resolved hosts:
n-1<1698> ssi:boot:rsh: n0 nanopus --> 125.1.5.218 (origin)
n-1<1698> ssi:boot:rsh: n1 hudson --> 125.1.7.17
n-1<1698> ssi:boot:rsh: starting RTE procs
n-1<1698> ssi:boot:base:linear: starting
n-1<1698> ssi:boot:base:server: opening server TCP socket
n-1<1698> ssi:boot:base:server: opened port 49939
n-1<1698> ssi:boot:base:linear: booting n0 (nanopus)
n-1<1698> ssi:boot:rsh: starting lamd on (nanopus)
n-1<1698> ssi:boot:rsh: starting on n0 (nanopus): hboot -t -c
lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939 -n 0 -o 0
n-1<1698> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-truong_at_nanopus/lam-killfile
tkill: f_kill = "/tmp/lam-truong_at_nanopus/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 24192 ...
tkill: already dead
tkill: removing socket file ...
tkill: socket file: /tmp/lam-truong_at_nanopus/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-truong_at_nanopus/lam-io-socket
tkill: all finished
hboot: booting...
hboot: fork /home/truong/Local/lam-mpi-install/bin/lamd
hboot: attempting to execute [1] 1701 lamd -H 125.1.5.218 -P 49939 -n
0 -o 0 -d
n-1<1698> ssi:boot:rsh: successfully launched on n0 (nanopus)
n-1<1698> ssi:boot:base:server: expecting connection from finite list
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

       - There are network filters between the lamboot agent host and
         the remote host such that communication on random TCP ports
         is blocked
       - Network routing from the remote host to the local host isn't
         properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
  one is the IP address of where the lamboot agent was invoked, the
  other is the port number that the lamboot agent is expecting the
  newly-booted process to call back on (this will be a random
  integer).

2. Manually login to the remote machine and try to telnet to the port
  indicated on the hboot command line. For example, telnet
<ipnumber> <portnumber>
  If all goes well, you should get a "Connection refused" error. If
  you get any other kind of error, it could indicate either of the
  two conditions above. Consult with your system/network
  administrator.
-----------------------------------------------------------------------------
n-1<1698> ssi:boot:base:server: failed to connect to remote lamd!
n-1<1698> ssi:boot:base:server: closing server socket
n-1<1698> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
=============================================================================

*******************************************************************************************

    When I try the procedure telnet on the remote machine
(multiprocessor): the answer is
===========================
nanopus 116 : telnet 125.1.7.17 2000
Trying...
Connected to ::ffff:125.1.7.17.
Escape character is '^]'.
============================