LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: wzlu (wzlu_at_[hidden])
Date: 2008-06-20 01:39:49


HI All,

I have problem to tried lamboot and recon.
The rsh test is ok, but lamboot and recon were not work.
The message is follow, please tell me how to boot lam. thanks a lot.

Best Regards
Wei-Zhao Lu

rsh message
[wzlu_at_in04033 lam_test]$ rsh 10.0.4.34 -n 'echo $SHELL'
connect to address 10.0.4.34: Connection refused
Trying krb4 rsh...
connect to address 10.0.4.34: Connection refused
trying normal rsh (/usr/bin/rsh)
/bin/bash

lamboot message
[wzlu_at_in04033 lam_test]$ lamboot -v host

LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University

n-1<4940> ssi:boot:base:linear: booting n0 (10.0.4.33)
n-1<4940> ssi:boot:base:linear: booting n1 (10.0.4.34)
ERROR: LAM/MPI unexpectedly received the following on stderr:
connect to address 10.0.4.34: Connection refused
connect to address 10.0.4.34: Connection refused
trying normal rsh (/usr/bin/rsh)
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "10.0.4.34",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "rsh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

rsh 10.0.4.34 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<4940> ssi:boot:base:linear: Failed to boot n1 (10.0.4.34)
n-1<4940> ssi:boot:base:linear: aborted!
n-1<4945> ssi:boot:base:linear: booting n0 (10.0.4.33)
n-1<4945> ssi:boot:base:linear: booting n1 (10.0.4.34)
ERROR: LAM/MPI unexpectedly received the following on stderr:
connect to address 10.0.4.34: Connection refused
connect to address 10.0.4.34: Connection refused
trying normal rsh (/usr/bin/rsh)
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "10.0.4.34",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "rsh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

rsh 10.0.4.34 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<4945> ssi:boot:base:linear: Failed to boot n1 (10.0.4.34)
n-1<4945> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

recon message:
[wzlu_at_in04033 lam_test]$ recon -v host
n-1<4949> ssi:boot:base:linear: booting n0 (10.0.4.33)
n-1<4949> ssi:boot:base:linear: booting n1 (10.0.4.34)
ERROR: LAM/MPI unexpectedly received the following on stderr:
connect to address 10.0.4.34: Connection refused
connect to address 10.0.4.34: Connection refused
trying normal rsh (/usr/bin/rsh)
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "10.0.4.34",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.

LAM tried to use the remote agent command "rsh"
to invoke "echo $SHELL" on the remote node.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:

- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell

Try invoking the following command at the unix command line:

rsh 10.0.4.34 -n 'echo $SHELL'

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n-1<4949> ssi:boot:base:linear: Failed to boot n1 (10.0.4.34)
n-1<4949> ssi:boot:base:linear: aborted!