LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Mustafa Uludogan (uludoganmustafa_at_[hidden])
Date: 2005-02-03 17:20:11


Hi,
We have a MAC OS X cluster system and we are using Sun
Grid Engine batch system.Lam-7.1.1 is installed on our
system. When I submit mpi jobs on 2 cpu jobs, the jobs
work ( on a dual cpu machine). But when I try to
increase the no. of processes (try to distribute to
the other machines in the cluster system), my job
crashes during the lamboot process.Below is the error
message I obtained during lamboot. Error may be due to
boot_rsh_ignore_stderr. It complains that it should be
set to 1. I checked it using laminfo -param all all
and I saw that the default value for
boot_rsh_ignore_stderr is set to 0. Is there a way to
change this definition during the execution time i.e.
in the folowing line
/usr/local/bin/lamboot -v -ssi boot rsh -ssi rsh_agent
"ssh -x -q" $TMPDIR/machines

or should it be assigned during the installation of
LAM? if so how?

I checked the system for passwordless login, and it
works fine. The crash of lamboot is not due to login
problem.
Thanks
Mustafa Uludogan

n-1<22335> ssi:boot:base:linear: booting n0 (node045)
n-1<22335> ssi:boot:base:linear: booting n1 (node049)
ERROR: LAM/MPI unexpectedly received the following on
stderr:
Warning: Permanently added 'node049' (RSA) to the list
of known hosts.^M
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node
"node049",
but received some output on the standard error. This
heuristic
assumes that any output on the standard error
indicates a fatal error,
and therefore aborts. You can disable this behavior
(i.e., have LAM
ignore output on standard error) in the rsh boot
module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "ssh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS
SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI
FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO
THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the
remote agent, or
can indicate an error in your $HOME/.cshrc,
$HOME/.login, or
$HOME/.profile files. The following is a
(non-inclusive) list of items
that you should check on the remote node:

        - You have an account and can login to the
remote machine
        - Incorrect permissions on your home directory
(should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts
file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote
$HOME/.rhosts file (if you
          are using rsh) for the machine and username
that you are
          running from
        - Your .cshrc/.profile must not print anything
out to the
          standard error
        - Your .cshrc/.profile should set a correct
TERM type
        - Your .cshrc/.profile should set the SHELL
environment
          variable to your default shell

Try invoking the following command at the unix command
line:

        ssh -x node049 -n 'echo $SHELL'

You will need to configure your local setup such that
you will *not*
be prompted for a password to invoke this command on
the remote node.
No output should be printed from the remote node
before the output of
the command is displayed.

When you can get this command to execute successfully
by hand, LAM
will probably be able to function properly.

                
__________________________________
Do you Yahoo!?
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com