Hi Irshad,
I'm not a LAM developer or anything, but my dumb
thought about this problem was about your choice of
rsh versus ssh. I had been using ssh with MPICH and
I have switched to rsh with LAM_MPI without any loss
in efficiency. If you are possibly on a close LAN
then I think you should consider using RSH instead of
SSH.
SSH presents constant problems with configuration, and
its security implementation is really quite paranoid.
I think its a bit over board and brings quite a bit of
overhead in set up time.
--- Irshad Ahmed <irshi2000_at_[hidden]> wrote:
>
> Hello,
>
> I am using Redhat 9,i face following error, when
> booting lam using "lamhost" file
>
> which contains two nodes
>
> 1- Master, 2- Slave
>
> >>> both machines have same account,
>
> >>> .rhosts file contains " + + "
>
> >>> No cshrc instead have .bashrc in "/home/ahmed/"
>
> >>> /etc/hosts.equive contains " + +
> "
>
> >>> The following error comes
>
> >>> ssh is also not working even with password, it
> (ssh) was allright before i accidently change the
> "known_hosts" file in .ssh folder
>
>
>
*************************************************************************
>
> [ahmed_at_Slave ahmed]$ lamboot -v -ssi boot rsh
> /home/ahmed/lamhost
>
> LAM 7.0.3/MPI 2 C++/ROMIO - Indiana University
>
> n0<2906> ssi:boot:base:linear: booting n0 (Slave)
>
> n0<2906> ssi:boot:base:linear: booting n1 (Master)
>
> ERROR: LAM/MPI unexpectedly received the following
> on stderr:
>
> Master: Connection refused
>
>
-----------------------------------------------------------------------------
>
> LAM failed to execute a process on the remote node
> "Master".
>
> LAM was not trying to invoke any LAM-specific
> commands yet -- we were
>
> simply trying to determine what shell was being used
> on the remote
>
> host.
>
> LAM tried to use the remote agent command "rsh"
>
> to invoke "echo $SHELL" on the remote node.
>
> This usually indicates an authentication problem
> with the remote
>
> agent, or some other configuration type of error in
> your .cshrc or
>
> .profile file. The following is a list of items that
> you may wish to
>
> check on the remote node:
>
> - You have an account and can login to the remote
> machine
>
> - Incorrect permissions on your home directory
> (should
>
> probably be 0755)
>
> - Incorrect permissions on your $HOME/.rhosts file
> (if you are
>
> using rsh -- they should probably be 0644)
>
> - You have an entry in the remote $HOME/.rhosts file
> (if you
>
> are using rsh) for the machine and username that you
> are
>
> running from
>
> - Your .cshrc/.profile must not print anything out
> to the
>
> standard error
>
> - Your .cshrc/.profile should set a correct TERM
> type
>
> - Your .cshrc/.profile should set the SHELL
> environment
>
> variable to your default shell
>
> Try invoking the following command at the unix
> command line:
>
> rsh Master -n echo $SHELL
>
> You will need to configure your local setup such
> that you will *not*
>
> be prompted for a password to invoke this command on
> the remote node.
>
> No output should be printed from the remote node
> before the output of
>
> the command is displayed.
>
> When you can get this command to execute
> successfully by hand, LAM
>
> will probably be able to function properly.
>
>
-----------------------------------------------------------------------------
>
> n0<2906> ssi:boot:base:linear: Failed to boot n1
> (Master)
>
> n0<2906> ssi:boot:base:linear: aborted!
>
>
-----------------------------------------------------------------------------
>
> lamboot encountered some error (see above) during
> the boot process,
>
> and will now attempt to kill all nodes that it was
> previously able to
>
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this
> process, you may
>
> have LAM daemons still running on remote nodes.
>
>
-----------------------------------------------------------------------------
>
> n0<2911> ssi:boot:base:linear: booting n0 (Slave)
>
> n0<2911> ssi:boot:base:linear: booting n1 (Master)
>
> ERROR: LAM/MPI unexpectedly received the following
> on stderr:
>
> Master: Connection refused
>
>
-----------------------------------------------------------------------------
>
> LAM failed to execute a process on the remote node
> "Master".
>
> LAM was not trying to invoke any LAM-specific
> commands yet -- we were
>
> simply trying to determine what shell was being used
> on the remote
>
> host.
>
> LAM tried to use the remote agent command "rsh"
>
> to invoke "echo $SHELL" on the remote node.
>
> This usually indicates an authentication problem
> with the remote
>
> agent, or some other configuration type of error in
> your .cshrc or
>
> .profile file. The following is a list of items that
> you may wish to
>
> check on the remote node:
>
> - You have an account and can login to the remote
> machine
>
> - Incorrect permissions on your home directory
> (should
>
> probably be 0755)
>
> - Incorrect permissions on your $HOME/.rhosts file
> (if you are
>
> using rsh -- they should probably be 0644)
>
> - You have an entry in the remote $HOME/.rhosts file
> (if you
>
=== message truncated ===>
_______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
__________________________________
Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway
http://promotions.yahoo.com/design_giveaway/
|