The hboot command "does not exist" on other nodes except node 0.
What am I supposed to do ?
> The problem is that when you do a password-less ssh in bash, the
> ".bashrc" file in your home directory does not get sourced. Therefore
> you are unable to find the search "PATH" to any of the command. To
work
> around this:
>
> 1. Either put all your PATH variables in /etc/bashrc file or
> 2. Create a .bashrc file with all the PATH in your home directory AND
> create a ".bash_profile" file containing the line "source .bashrc"
>
> This should do the trick.
>
> Yusuf
>
>
> On Jun 10, 2004, at 12:25 PM, damien_at_[hidden] wrote:
>
> > The first thing you need to do is make sure LAM is installed or
> > available
> > on all the nodes. I've never tried to boot LAM that way though. You
> > might want to try the basic lamboot with a host definition file.
> >
> > The second thing you need to do, if LAM is installed, is make sure
that
> > you can run commands remotely on the other nodes, either through ssh
or
> > rsh. If you're running rsh, try this command:
> >
> > rsh nodename ls -laF
> >
> > where nodename is the name of one of the nodes, or you can put the ip
> > address in instead. You should get a directory listing from the
node
> > with
> > file sizes and dates etc. If you get any other response at all,
like
> > an
> > error message or a request for a password, then your remote command
> > execution is not set up properly and LAM won't work.
> >
> > The documentation on the website covers this in great detail, you
> > should
> > look there.
> >
> > Damien
> >
> >>
> >> The machine on which I run my codes has 20 nodes, each node has
> >> distinct
> >> IP address and 2 processors.
> >>
> >> When I tried to boot LAM on those nodes, an error occurred.
> >> I have no idea about this error.
> >> I guess maybe LAM is not installed on those nodes except node 0.
> >>
> >> When I tried the command /usr/bin/rsh 192.168.0.1 -n hboot -t -c lam-
> >> conf.lam -v -s -I "-H 169.237.129.129 -P 52716 -n 1 -o 0", error
> >> message "bash: hboot: command not found" occurred.
> >>
> >> Again, I am totally confused.
> >>
> >> Anyway, could you give me some solutions to fix this problem?
> >> Thank you
> >>
> >> Below is the error message
> >>
> >> [cycchou_at_matrx demos]$ lamboot -v hostfile
> >>
> >> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
> >>
> >> Executing hboot on n0 (matrx.engr.ucdavis.edu - 2 CPUs)...
> >> Executing hboot on n1 (192.168.0.1 - 2 CPUs)...
> >> bash: hboot: command not found
> >> ---------------------------------------------------------------------
-
> >> ----
> >> ---
> >> LAM attempted to execute a process on the remote node "192.168.0.1",
> >> but received some output on the standard error.
> >>
> >> LAM tried to use the remote agent command "/usr/bin/rsh"
> >> to invoke "hboot" on the remote node.
> >>
> >> This can indicate an authentication error with the remote agent, or
> >> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> >> $HOME/.profile files. The following is a list of items that you may
> >> wish to check on the remote node:
> >>
> >> - You have an account and can login to the remote machine
> >> - Incorrect permissions on your home directory (should
> >> probably be 0755)
> >> - Incorrect permissions on your $HOME/.rhosts file (if you
are
> >> using rsh -- they should probably be 0644)
> >> - You have an entry in the remote $HOME/.rhosts file (if you
> >> are using rsh) for the machine and username that you are
> >> running from
> >> - Your .cshrc/.profile must not print anything out to the
> >> standard error
> >> - Your .cshrc/.profile should set a correct TERM type
> >> - Your .cshrc/.profile should set the SHELL environment
> >> variable to your default shell
> >>
> >> Try invoking the following command at the unix command line:
> >>
> >> /usr/bin/rsh 192.168.0.1 -n hboot -t -c lam-conf.lam -v -s -I "-H
> >> 169.237.129.129 -P 52716 -n 1 -o 0"
> >>
> >> You will need to configure your local setup such that you will *not*
> >> be prompted for a password to invoke this command on the remote node.
> >> No output should be printed from the remote node before the output of
> >> the command is displayed.
> >>
> >> When you can get this command to execute successfully by hand, LAM
> >> will probably be able to function properly.
> >> ---------------------------------------------------------------------
-
> >> ----
> >> ---
> >> ---------------------------------------------------------------------
-
> >> ----
> >> ---
> >> lamboot encountered some error (see above) during the boot process,
> >> and will now attempt to kill all nodes that it was previously able to
> >> boot (if any).
> >>
> >> Please wait for LAM to finish; if you interrupt this process, you may
> >> have LAM daemons still running on remote nodes.
> >> ---------------------------------------------------------------------
-
> >> ----
> >> ---
> >> wipe ...
> >>
> >> LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
> >>
> >> Executing tkill on n0 (matrx.engr.ucdavis.edu)...
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
Best Regards,
Yu-Cheng Chou
Integration Engineering Lab
Mechanical and Aeronautical Engineering
University of California, Davis
|