LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ross Heikes (ross_at_[hidden])
Date: 2005-07-29 12:13:08


Well Brain ,
Here is what my lamhostfile looks like

172.30.1.130 cpu=1
172.20.1.130 cpu=1
172.30.1.131 cpu=1
172.20.1.131 cpu=1
slikrock cpu=2

Where Slikrock is master node
and node 30 and node 31 are accessed by two subnetworks (172.20, and
172.30)

If i do lamboot lamhostfile, this is what i get

The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

         - There are network filters between the lamboot agent host and
           the remote host such that communication on random TCP ports
           is blocked
         - Network routing from the remote host to the local host isn't
           properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
    one is the IP address of where the lamboot agent was invoked, the
    other is the port number that the lamboot agent is expecting the
    newly-booted process to call back on (this will be a random
    integer).

2. Manually login to the remote machine and try to telnet to the port
    indicated on the hboot command line. For example,
        telnet <ipnumber> <portnumber>
    If all goes well, you should get a "Connection refused" error. If
    you get any other kind of error, it could indicate either of the
    two conditions above. Consult with your system/network
    administrator.

I have read LAMBOOT FAQ and even have done
Lamboot -ssi mpi_hostname
It runs. but a ps shows that most execs are running on Master node
and not on intended nodes
  and hence cluster performance degrades
Is there any other way , I can use both subnetworks?

thanks