LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: damien_at_[hidden]
Date: 2004-06-10 14:25:17


The first thing you need to do is make sure LAM is installed or available
on all the nodes. I've never tried to boot LAM that way though. You
might want to try the basic lamboot with a host definition file.

The second thing you need to do, if LAM is installed, is make sure that
you can run commands remotely on the other nodes, either through ssh or
rsh. If you're running rsh, try this command:

rsh nodename ls -laF

where nodename is the name of one of the nodes, or you can put the ip
address in instead. You should get a directory listing from the node with
file sizes and dates etc. If you get any other response at all, like an
error message or a request for a password, then your remote command
execution is not set up properly and LAM won't work.

The documentation on the website covers this in great detail, you should
look there.

Damien

>
> The machine on which I run my codes has 20 nodes, each node has distinct
> IP address and 2 processors.
>
> When I tried to boot LAM on those nodes, an error occurred.
> I have no idea about this error.
> I guess maybe LAM is not installed on those nodes except node 0.
>
> When I tried the command /usr/bin/rsh 192.168.0.1 -n hboot -t -c lam-
> conf.lam -v -s -I "-H 169.237.129.129 -P 52716 -n 1 -o 0", error
> message "bash: hboot: command not found" occurred.
>
> Again, I am totally confused.
>
> Anyway, could you give me some solutions to fix this problem?
> Thank you
>
> Below is the error message
>
> [cycchou_at_matrx demos]$ lamboot -v hostfile
>
> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
>
> Executing hboot on n0 (matrx.engr.ucdavis.edu - 2 CPUs)...
> Executing hboot on n1 (192.168.0.1 - 2 CPUs)...
> bash: hboot: command not found
> --------------------------------------------------------------------------
> ---
> LAM attempted to execute a process on the remote node "192.168.0.1",
> but received some output on the standard error.
>
> LAM tried to use the remote agent command "/usr/bin/rsh"
> to invoke "hboot" on the remote node.
>
> This can indicate an authentication error with the remote agent, or
> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> $HOME/.profile files. The following is a list of items that you may
> wish to check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> /usr/bin/rsh 192.168.0.1 -n hboot -t -c lam-conf.lam -v -s -I "-H
> 169.237.129.129 -P 52716 -n 1 -o 0"
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> --------------------------------------------------------------------------
> ---
> --------------------------------------------------------------------------
> ---
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
> --------------------------------------------------------------------------
> ---
> wipe ...
>
> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
>
> Executing tkill on n0 (matrx.engr.ucdavis.edu)...
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>