LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: RANGI, JAI (Jai.Rangi_at_[hidden])
Date: 2004-05-12 09:10:50


 Hi I am getting very strange error while booting lam
Here is the error,
rangij_at_sd1:~> lamboot -v hostfile

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

Executing hboot on n0 (sd1 - 1 CPU)...
Executing hboot on n1 (sd2 - 1 CPU)...
bash: line 1: hboot: command not found
----------------------------------------------------------------------------
-
LAM attempted to execute a process on the remote node "sd2",
but received some output on the standard error.

LAM tried to use the remote agent command "/usr/bin/rsh"
to invoke "hboot" on the remote node.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /usr/bin/rsh sd2 -n hboot -t -c lam-conf.lam -v -s -I "-H
192.168.1.101 -P 34259 -n 1 -o 0 "

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
----------------------------------------------------------------------------
-
----------------------------------------------------------------------------
-
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
----------------------------------------------------------------------------
-
wipe ...

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

Executing tkill on n0 (sd1)...

All the file permissions and rsh/ssh configurations are OK, I an logon on
any node with password prompt..
Then I run

rangij_at_sd1:~> /usr/bin/rsh sd2 -n hboot -t -c lam-conf.lam -v -s -I "-H
192.168.1.101 -P 34259 -n 1 -o 0 "
bash: line 1: hboot: command not found
rangij_at_sd1:~> rsh sd2 which hboot
which: no hboot in (/usr/bin:/bin)

So from it looks like hboot is not in the path....But this is what $PATH
variable displays..

rangij_at_sd1:~> rsh sd2 echo $PATH
/home/rangij/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt
/gnome/bin:/opt/kde3/bin:/usr/local/lam-6.5.9/bin:/usr/local/gamess:/usr/lib
/SmallEiffel/bin:/usr/lib/java/bin

rangij_at_sd1:~> which hboot
/usr/local/lam-6.5.9/bin/hboot

If I log on on sd2, I can see the command..
rangij_at_sd1:~> rsh sd2
Last login: Wed May 12 08:57:37 from sd1.sdcluster.jacks.local
Have a lot of fun...
rangij_at_sd2:~> which hboot
/usr/local/lam-6.5.9/bin/hboot
rangij_at_sd2:~>

Any idea what I am missing ??? Why is it searching for commnd in only /bin
and /usr/bin

Thanks

-Jai Rangi