LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Terry Frankcombe (T.Frankcombe_at_[hidden])
Date: 2005-02-04 05:37:09


You need to make sure that all nodes have made an ssh connection to all other
nodes at least once before you start running your multi-node jobs (as the user
that lam executes as). This should add all required entries to the relevant
known_hosts files, suppress the stderr output that's stopping lam.

I feel sure that this is in the user guide somewhere. Did you read that?
(If it's not, it should be. It's a common enough situation.)

> Hi,
> We have a MAC OS X cluster system and we are using Sun
> Grid Engine batch system.Lam-7.1.1 is installed on our
> system. When I submit mpi jobs on 2 cpu jobs, the jobs
> work ( on a dual cpu machine). But when I try to
> increase the no. of processes (try to distribute to
> the other machines in the cluster system), my job
> crashes during the lamboot process.Below is the error
> message I obtained during lamboot. Error may be due to
> boot_rsh_ignore_stderr. It complains that it should be
> set to 1. I checked it using laminfo -param all all
> and I saw that the default value for
> boot_rsh_ignore_stderr is set to 0. Is there a way to
> change this definition during the execution time i.e.
> in the folowing line
> /usr/local/bin/lamboot -v -ssi boot rsh -ssi rsh_agent
> "ssh -x -q" $TMPDIR/machines
>
> or should it be assigned during the installation of
> LAM? if so how?
>
> I checked the system for passwordless login, and it
> works fine. The crash of lamboot is not due to login
> problem.
> Thanks
> Mustafa Uludogan
>
>
> n-1<22335> ssi:boot:base:linear: booting n0 (node045)
> n-1<22335> ssi:boot:base:linear: booting n1 (node049)
> ERROR: LAM/MPI unexpectedly received the following on
> stderr:
> Warning: Permanently added 'node049' (RSA) to the list
> of known hosts.^M
> -----------------------------------------------------------------------------
> LAM attempted to execute a process on the remote node
> "node049",
> but received some output on the standard error. This
> heuristic
> assumes that any output on the standard error
> indicates a fatal error,
> and therefore aborts. You can disable this behavior
> (i.e., have LAM
> ignore output on standard error) in the rsh boot
> module by setting the
> SSI parameter boot_rsh_ignore_stderr to 1.
>
> LAM tried to use the remote agent command "ssh"
> to invoke "echo $SHELL" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS
> SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI
> FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO
> THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This can indicate an authentication error with the
> remote agent, or
> can indicate an error in your $HOME/.cshrc,
> $HOME/.login, or
> $HOME/.profile files. The following is a
> (non-inclusive) list of items
> that you should check on the remote node:
>
> - You have an account and can login to the
> remote machine
> - Incorrect permissions on your home directory
> (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts
> file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote
> $HOME/.rhosts file (if you
> are using rsh) for the machine and username
> that you are
> running from
> - Your .cshrc/.profile must not print anything
> out to the
> standard error
> - Your .cshrc/.profile should set a correct
> TERM type
> - Your .cshrc/.profile should set the SHELL
> environment
> variable to your default shell
>
> Try invoking the following command at the unix command
> line:
>
> ssh -x node049 -n 'echo $SHELL'
>
> You will need to configure your local setup such that
> you will *not*
> be prompted for a password to invoke this command on
> the remote node.
> No output should be printed from the remote node
> before the output of
> the command is displayed.
>
> When you can get this command to execute successfully
> by hand, LAM
> will probably be able to function properly.
>
>
>
>
> __________________________________
> Do you Yahoo!?
> The all-new My Yahoo! - What will yours do?
> http://my.yahoo.com
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/