LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: 460853_at_[hidden]
Date: 2006-11-14 13:11:16


Hello!!

This is my first message to the list. I'd like to say that I'm not only a newbie
in the list, but also a newbie in the LAM/MPI using, so perhaps many of you will
find my questions a little bit... too simple, but is because of that... I'm
starting.

My question now is the following:

I have two machines in a cluster. I created the lamhosts file below:

--------------------- lamhosts -------
155.210.155.67
155.210.155.70
--------------------------------------

I am in machine .67 (whose name is rcp13) and I want to launch LAM in .67 and
.70 (the remote one, whose name is venus2). If I execute recon - v lamhosts, it
says that "recon has been completed succesfully", but when I reach to execute
lamboot -v lamhosts, it says this:

hector_at_rdp13:~/Pa aprendé/Pruebas MPI> lamboot -v lamhosts

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n-1<12668> ssi:boot:base:linear: booting n0 (155.210.155.67)
n-1<12668> ssi:boot:base:linear: booting n1 (155.210.155.70)
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:

        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from "lamboot -d".

1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).

2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------

I know there's a firewall in each machine that only opens the SSH (22) port, so
I guess the problem comes from that. So, what ports do I have to open in order
to boot LAM?.

Executing the lamboot with the -d option, I've read (among many other things)
this:

   lamd -H 155.210.155.67 -P 6459 -n 1 -o 0 -d

So, I guess that this means that the .155.70 machine should be able to reach the
port 6459 in the .155.67 machine. Am I right? So the solution comes by opening
the 6459 port in the .155.67 machine? Should I open this port also in the
.155.70 machine? Otherwise, which ports should I open? Because I don't know if
it will be enough with opening only these ports.

Thank you in advance!!

PD: I tried to Google a little, but I didn't find anything, so if this question
has already been asked (I guess it will have been) it'd be enough a link to an
useful page (obviously).