On Mar 29, 2005, at 10:55 AM, Vinicius de Lima wrote:
> Anybody knows what it is made a mistake?
>
> log_error:
>
> [swingle_at_swingle ~]$ lamboot -d
> <snip>
> n-1<18345> ssi:boot:rsh: starting on n1 (swingle3): hboot -t -c
> lam-conf.lamd -d -s -I "-H 200.144.120.137 -P 36658 -n 1 -o 0"
> n-1<18345> ssi:boot:rsh: launching remotely
> n-1<18345> ssi:boot:rsh: attempting to execute: ssh -x swingle3 -n
> 'echo $SHELL'
> swingle_at_swingle3's password:
> n-1<18345> ssi:boot:rsh: remote shell /bin/bash
> n-1<18345> ssi:boot:rsh: attempting to execute: ssh -x swingle3 -n
> hboot -t -c lam-conf.lamd -d -s -I '"-H 200.144.120.137 -P 36658 -n 1
> -o 0"'
> swingle_at_swingle3's password:
It looks like when your rsh/ssh into your remote machine it is
prompting you for a password. LAM requires that
"The user needs to be able to execute command on remote nodes without
being prompted for a password, and with no extraneous output on
stderr."
http://www.lam-mpi.org/faq/category3.php3#question2
Try to execute:
ssh -x swingle3 -n 'echo $SHELL'
and make sure it doesn't prompt you for any input.
You may want to take a look at the FAQ's on the LAM/MPI site if you
need some pointers on how to set this up.
Hope this helps,
Josh
> <snip>
> [1] 17254 lamd -H 200.144.120.137 -P 36658 -n 1 -o 0 -d
> n-1<18345> ssi:boot:rsh: successfully launched on n1 (swingle3)
> n-1<18345> ssi:boot:base:server: expecting connection from finite list
> -----------------------------------------------------------------------
> ------
> The lamboot agent timed out while waiting for the newly-booted process
> to call back and indicated that it had successfully booted.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> As far as LAM could tell, the remote process started properly, but
> then never called back. Possible reasons that this may happen:
>
> - There are network filters between the lamboot agent host and
> the remote host such that communication on random TCP ports
> is blocked
> - Network routing from the remote host to the local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from "lamboot -d".
>
> 1. On the command line for hboot, there are two important parameters:
> one is the IP address of where the lamboot agent was invoked, the
> other is the port number that the lamboot agent is expecting the
> newly-booted process to call back on (this will be a random
> integer).
>
> 2. Manually login to the remote machine and try to telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection refused" error. If
> you get any other kind of error, it could indicate either of the
> two conditions above. Consult with your system/network
> administrator.
> -----------------------------------------------------------------------
> ------
> n-1<18345> ssi:boot:base:server: failed to connect to remote lamd!
> n-1<18345> ssi:boot:base:server: closing server socket
> n-1<18345> ssi:boot:base:linear: aborted!
> <snip>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
----
Josh Hursey
jjhursey_at_[hidden]
http://www.lam-mpi.org/
|