Hello, everybody
I am new to LAM MPI. At first I run on Redhat 9.0 + LAM 6.5, I can run
example programs with:
recon -v lamhosts (two hosts in file lamhosts)
lamboot -v lamhosts
mpirun -np 2 fpi
It's ok. But, when I downloaded and install the newest LAM 7.0, all failed
to run.
The following is the error infomation from "recon -v lamhost":
>>>>>>>
n0<5849> ssi:boot:base:linear: booting n0 (192.168.0.81)
n0<5849> ssi:boot:base:linear: booting n1 (192.168.0.37)
ERROR: LAM/MPI unexpectedly received the following on stderr:
192.168.0.37: Connection refused
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "192.168.0.37".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.
LAM tried to use the remote agent command "rsh"
to invoke "echo $SHELL" on the remote node.
This usually indicates an authentication problem with the remote
agent, or some other configuration type of error in your .cshrc or
profile file. The following is a list of items that you may wish to
check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
rsh 192.168.0.37 -n echo $SHELL
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
n0<5849> ssi:boot:base:linear: Failed to boot n1 (192.168.0.37)
n0<5849> ssi:boot:base:linear: aborted!
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.
Any error message above may present a more detailed description of the
actual problem.
Here is general a list of prerequisites that *must* be fulfilled
before recon can work:
- Each machine in the hostfile must be reachable and operational.
- You must have an account on each machine.
- You must be able to rsh(1) to the machine (permissions
are typically set in the user's $HOME/.rhosts file).
*** Sidenote: If you compiled LAM to use a remote shell program
other than rsh (with the --with-rsh option to ./configure;
e.g., ssh), or if you set the LAMRSH environment variable
to an alternate remote shell program, you need to ensure
that you can execute programs on remote nodes with no
password. For example:
unix% ssh -x pinky uptime
3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10
- The LAM executables must be locatable on each machine, using
the shell's search path and possibly the LAMHOME environment
variable.
- The shell's start-up script must not print anything on standard
error. You can take advantage of the fact that rsh(1) will
start the shell non-interactively. The start-up script (such
as .profile or .cshrc) can exit early in this case, before
executing many commands relevant only to interactive sessions
and likely to generate output.
-----------------------------------------------------------------------------
<<<<<<<
I am sure I can run the same program with LAM 6.5. And I have made no change
to any configuration except "rpm -e lam" and "rpm -ihv lam-7.0-i586.rpm".
Thanks a lot,
kenny
|