Hi,
It is better to have a look at FAQ in http://www.lam-mpi/faq to get your answer. I had a similar problem
but it was solved by looking at the FAQ under Booting LAM!
Cheers
Farschad
-----Original Message-----
From: "Sergei Lisenkov" <proffess_at_[hidden]>
To: lam_at_[hidden]
Date: Sun, 14 Sep 2003 17:45:57 +0400 (MSD)
Subject: LAM: Problem with running lamboot
Dear LAM users,
If you have any idea, please help. I have Beowulf clusters (16 dual Pentium 3 = 32 CPU). I have installed lam-7.0 and tried to run lamboot. I put the path to lamboot in .bashrc file. I got:
[proffess_at_panda work]$ lamboot -v -ssi boot rsh mynodes
LAM 7.0.1b6/MPI 2 C++/ROMIO - Indiana University
n0<4914> ssi:boot:base:linear: booting n0 (panda)
n0<4914> ssi:boot:base:linear: booting n1 (panda2)
ERROR: LAM/MPI unexpectedly received the following on stderr:
bash: line 1: hboot: command not found
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "panda2",
but received some output on the standard error.
The "hboot" is in the PATH, I do'nt know why LAM cannot find him.
LAM tried to use the remote agent command "rsh"
to invoke "hboot" on the remote node.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
rsh panda2 -n hboot -t -c lam-conf.lamd -v -s -I "-H 195.208.40.134 -P 35032 -n 1 -o 0"
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<4914> ssi:boot:base:linear: Failed to boot n1 (panda2)
n0<4914> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n0<4920> ssi:boot:base:linear: booting n0 (panda)
n0<4920> ssi:boot:base:linear: booting n1 (panda2)
ERROR: LAM/MPI unexpectedly received the following on stderr:
bash: line 1: tkill: command not found
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "panda2",
but received some output on the standard error.
LAM tried to use the remote agent command "rsh"
to invoke "tkill" on the remote node.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
rsh panda2 -n tkill -v
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<4920> ssi:boot:base:linear: Failed to boot n1 (panda2)
n0<4920> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully
The file "hboot" is in the PATH, I don't know why LAM cannot find him. I use the rsh and go on another nodes without password. I have the writable /tmp . What is wrong?
Thanks,
Best wishes,
Sergey
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
-------------------------------------
Farschad Torabi
Ph. D. Student
Mechanical Eng. Dept.
University of Tehran
ftorabi_at_[hidden]
|