You actually do need the LAM executables installed on all nodes, and
they need to be able to be found in your PATH.
Check the LAM FAQ in the "Booting LAM" section. There's also
information about this topic in the LAM/MPI User's Guide (http://
www.lam-mpi.org/using/docs/).
On Jan 27, 2006, at 1:19 PM, hayden wrote:
> HI LAM community
>
> I fixed the problem with my superfluous text, but I am greeted
> sumararily
> with new problems:
>
> When I type lamboot -v lamhostfile I now get the following output
> (when
> trying to boot to one distil node):
>
>
> ______________________________________________________________________
> ____
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12538> ssi:boot:base:linear: booting n0 (192.168.1.1)
> n-1<12538> ssi:boot:base:linear: booting n1 (192.168.1.2)
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> base: cannot find process schema (null): No such file or directory
> ----------------------------------------------------------------------
> ------
> -
> *** Oops -- I cannot open the LAM help file.
> *** I tried looking for it in the following places:
> ***
> *** $HOME/lam-helpfile
> *** $HOME/lam-7.1.1-helpfile
> *** $HOME/etc/lam-helpfile
> *** $HOME/etc/lam-7.1.1-helpfile
> *** $LAMHELPDIR/lam-helpfile
> *** $LAMHELPDIR/lam-7.1.1-helpfile
> *** $LAMHOME/etc/lam-helpfile
> *** $LAMHOME/etc/lam-7.1.1-helpfile
> *** $SYSCONFDIR/lam-helpfile
> *** $SYSCONFDIR/lam-7.1.1-helpfile
> ***
> *** You were supposed to get help on the program "hboot"
> *** about the topic "cant-parse-config"
> ***
> *** Sorry!
> ----------------------------------------------------------------------
> ------
> -
> ----------------------------------------------------------------------
> ------
> -
> LAM attempted to execute a process on the remote node "192.168.1.2",
> but received some output on the standard error. This heuristic
> assumes that any output on the standard error indicates a fatal error,
> and therefore aborts. You can disable this behavior (i.e., have LAM
> ignore output on standard error) in the rsh boot module by setting the
> SSI parameter boot_rsh_ignore_stderr to 1.
>
> LAM tried to use the remote agent command "rsh"
> to invoke "hboot" on the remote node.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
> *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> This can indicate an authentication error with the remote agent, or
> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> $HOME/.profile files. The following is a (non-inclusive) list of
> items
> that you should check on the remote node:
>
> - You have an account and can login to the remote machine
> - Incorrect permissions on your home directory (should
> probably be 0755)
> - Incorrect permissions on your $HOME/.rhosts file (if you are
> using rsh -- they should probably be 0644)
> - You have an entry in the remote $HOME/.rhosts file (if you
> are using rsh) for the machine and username that you are
> running from
> - Your .cshrc/.profile must not print anything out to the
> standard error
> - Your .cshrc/.profile should set a correct TERM type
> - Your .cshrc/.profile should set the SHELL environment
> variable to your default shell
>
> Try invoking the following command at the unix command line:
>
> rsh 192.168.1.2 -n hboot -t -c lam-conf.lamd -v -s -I '"-H
> 192.168.1.1 -P 33693 -n 1 -o 0"'
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
> ----------------------------------------------------------------------
> ------
> -
> n-1<12538> ssi:boot:base:linear: Failed to boot n1 (192.168.1.2)
> n-1<12538> ssi:boot:base:linear: aborted!
> n-1<12544> ssi:boot:base:linear: booting n0 (192.168.1.1)
> n-1<12544> ssi:boot:base:linear: booting n1 (192.168.1.2)
> n-1<12544> ssi:boot:base:linear: finished
> lamboot did NOT complete successfully
>
> ______________________________________________________________________
> ______
>
>
> I am rather confused by the error message, specifically because it
> implies
> that lam is trying to run hboot on the distant node, and hboot only
> exists
> on the server...i tried installing of lam binaries on the node
> also, but
> this didn't change the error message.
>
> I'm totally at sea as to why this may be. Please help!
>
> Thankyou
>
> Hayden Eastwood
>
> ______________________________________________
> Hayden Eastwood
> Perdita Barran Research Group
> Joseph Black Building
> Edinburgh University
> West Mains Road
> EH9 3JJ
>
> Tel: 0131 650 4773
> e-mail: s0237717_at_[hidden]
> Research page:http://homepages.ed.ac.uk/pbarran/PBRG/
> "You have to be an academic to believe some things; no ordinary
> person would
> be so stupid." -George Orwell
>
> -----Original Message-----
> From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On
> Behalf Of
> Jeff Squyres
> Sent: 27 January 2006 13:47
> To: General LAM/MPI mailing list
> Subject: Re: LAM: lamboot problems
>
> On Jan 26, 2006, at 10:15 AM, hayden wrote:
>
>> 1. Why the hell does each machine echo "Which manual page do you
>> want" every
>> time I run a remote command? Do you have any idea where
>> instructions for
>> generating such text might lie (so I can get rid of it)? I;ve had
>> a look at
>> the .bashrc files and they definitely contain no "echo" statements.
>> This
>> text occurs only when I run rsh <machineName> <command> and occurs
>> as a
>> single line immediately before the command is executed.
>
> I wouldn't look for echo statements, I'd look for "man" statements
> (i.e., that looks like an error from the "man" command).
>
> This is something that is going to be specific to your local setup,
> and there isn't much that we can do to help you find it -- look again
> in your .bashrc files (if you're a bash user) and perhaps in the
> various shell startup files in /etc. You might want to put echo
> statements in your own .bashrc (etc.) as a search method to try to
> pin down where/when the errant command is coming from.
>
>> 2. In the absence of getting rid of this text can I just tell
>> lamboot to
>> ignore superfluous messages some how? I tried using the "-x" flag
>> (for
>> ignoring errors - I got this command from typing "man lamboot"),
>> but it
>> doesn't recognise this command.
>
> -x is for fault tolerant mode, meaning errors on the network
> transport -- not errors in startup. You can, however, instruct
> lamboot to ignore output on stderr during rsh-based boots by setting
> the SSI parameter
>
> boot_rsh_ignore_stderr
>
> to 1. You can do this with:
>
> shell$ lamboot -ssi boot_rsh_ignore_stderr 1 ...
> or
> shell$ export LAM_MPI_SSI_boot_rsh_ignore_stderr 1
> shell$ lamboot ...
>
> Hope that helps.
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|