LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: zayar (zayar43_at_[hidden])
Date: 2008-03-06 10:13:30


Dear members,
         I have problem in lamboot. I also found this topic on the FAQs page. I have tried possible solutions but still the error. When booting lam-mpi on openSUSE 10.3, I got the following error messages:

zayar_at_HPC-3:~>lamboot -v bhost
LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University

n-1<25538> ssi:boot:base:linear: booting n0 (HPC-3)
n-1<25538> ssi:boot:base:linear: booting n1 (HPC-2)
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n-1<25538> ssi:boot:base:linear: aborted!
n-1<25544> ssi:boot:base:linear: booting n0 (HPC-3)
n-1<25544> ssi:boot:base:linear: booting n1 (HPC-2)
n-1<25544> ssi:boot:base:linear: finished
lamboot did NOT complete successfullyzayar_at_HPC-3:~> telnet (my-remote-ip) 23451
Trying (my-remote-ip)...
telnet: connect to address (my-remote-ip): Connection refused
zayar_at_HPC-3:~> telnet 127.0.0.1 32154
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
zayar_at_HPC-3:~> ssh -x hpc-2 hostname
HPC-2
zayar_at_HPC-3:~>
Please advise me.
 
Thanks.
       
---------------------------------
Looking for last minute shopping deals? Find them fast with Yahoo! Search.