LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-09-29 20:36:47


2 quick things:

- can you upgrade your LAM version? 6.5.8 is very old -- the 6.5
series is no longer supported.

- The operative phrase of the error message is:

> No output should be printed from the remote node before the output of
> the command is displayed.

In your case, "stty: ..." is being displayed. I suspect that one of
your shell setup files (e.g., $HOME/.profile) is erroneously invoking
the stty command for non-interactive remote logins. LAM saw the first
few bytes of the stty error message on stderr and assumed that it was
an error, and therefore aborted.

Eliminate that error and you should be good. A better test than what
is shown in the error message would be:

        /usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n uptime

And see if you see that "stty" error.

On Sep 28, 2004, at 5:27 PM, Sriram Rallabhandi wrote:

>
> Hi all,
>
> I'm new to LAM_MPI set-up although I have done MPI programming
> before. I have a Beowulf cluster with shared memory nodes. I created a
> lam_machines file and invoking the command "lamboot -dv lam_machines".
> I get the following output:
>
>
> -----------------------------------------------------------------------
> -----------------------------------------------------------------------
> ---------------------------
> [sriramr_at_p02 basic]$ lamboot -dv lam_machines
>
> LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
>
> lamboot: boot schema file: lam_machines
> lamboot: opening hostfile lam_machines
> lamboot: found the following hosts:
> lamboot:   n0 p02.asdl.ae.gatech.edu
> lamboot:   n1 p03.asdl.ae.gatech.edu
> lamboot:   n2 p04.asdl.ae.gatech.edu
> lamboot: resolved hosts:
> lamboot:   n0 p02.asdl.ae.gatech.edu --> 172.16.3.102
> lamboot:   n1 p03.asdl.ae.gatech.edu --> 172.16.3.103
> lamboot:   n2 p04.asdl.ae.gatech.edu --> 172.16.3.104
> lamboot: found 3 host node(s)
> lamboot: origin node is 0 (p02.asdl.ae.gatech.edu)
> Executing hboot on n0 (p02.asdl.ae.gatech.edu - 1 CPU)...
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I "
> -H 172.16.3.102 -P 32850 -n 0 -o 0     ""
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1]  12357 lamd -H 172.16.3.102 -P 32850 -n 0 -o 0 -d
> Executing hboot on n1 (p03.asdl.ae.gatech.edu - 1 CPU)...
> lamboot: attempting to execute "/usr/bin/ssh -x -a
> p03.asdl.ae.gatech.edu -n echo $SHELL"
> lamboot: got remote shell /bin/ksh
> lamboot: attempting to execute "/usr/bin/ssh -x -a
> p03.asdl.ae.gatech.edu -n (. ./.profile; hboot -t -c lam-conf.lam -d
> -v -s -I "-H 172.16.3.102 -P 32850 -n 1 -o 0    " )"
> stty:
> -----------------------------------------------------------------------
> ------
> LAM attempted to execute a process on the remote node
> "p03.asdl.ae.gatech.edu",
> but received some output on the standard error.
>
> LAM tried to use the remote agent command "/usr/bin/ssh"
> to invoke "hboot" on the remote node.
>
> This can indicate an authentication error with the remote agent, or
> can indicate an error in your $HOME/.cshrc, $HOME/.login, or
> $HOME/.profile files.  The following is a list of items that you may
> wish to check on the remote node:
>
>         - You have an account and can login to the remote machine
>         - Incorrect permissions on your home directory (should
>           probably be 0755)
>         - Incorrect permissions on your $HOME/.rhosts file (if you are
>           using rsh -- they should probably be 0644)
>         - You have an entry in the remote $HOME/.rhosts file (if you
>           are using rsh) for the machine and username that you are
>           running from
>         - Your .cshrc/.profile must not print anything out to the
>           standard error
>         - Your .cshrc/.profile should set a correct TERM type
>         - Your .cshrc/.profile should set the SHELL environment
>           variable to your default shell
>
> Try invoking the following command at the unix command line:
>
>         /usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n (. ./.profile;
> hboot -t -c lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32850 -n 1 -o
> 0    " )
>
> You will need to configure your local setup such that you will *not*
> be prompted for a password to invoke this command on the remote node.
> No output should be printed from the remote node before the output of
> the command is displayed.
>
> When you can get this command to execute successfully by hand, LAM
> will probably be able to function properly.
>
> -----------------------------------------------------------------------
> ------
>
> -----------------------------------------------------------------------
> ------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on remote nodes.
>
> -----------------------------------------------------------------------
> ------
> wipe ...
>
> LAM 6.5.8/MPI 2 C++/ROMIO - Indiana University
>
> Executing tkill on n0 (p02.asdl.ae.gatech.edu)...
> lamboot did NOT complete successfully
> -----------------------------------------------------------------------
> -----------------------------------------------------------------------
> -----------
>
> From the output above, the root node is attempting to invoke the
> following command:
>
> /usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n (. ./.profile; hboot -t
> -c lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32850 -n 1 -o 0    " )
>
> I don't know why I have parenthesis in the above command. With those
> parenthesis, I get "badly placed ()'s" error. So I removed the
> parenthesis
> and invoked the command from the root node (P02)
>
> /usr/bin/ssh -x -a p03.asdl.ae.gatech.edu -n . ./.profile; hboot -t -c
> lam-conf.lam -d -v -s -I "-H 172.16.3.102 -P 32857 -n 1 -o 0    "
>
> and got the following output without any errors.
>
> stty: standard input: Invalid argument
> hboot: process schema = "/etc/lam/lam-conf.lam"
> hboot: found /usr/bin/lamd
> hboot: performing tkill
> hboot: tkill
> hboot: booting...
> hboot: fork /usr/bin/lamd
> [1]  12495 lamd -H 172.16.3.102 -P 32857 -n 1 -o 0 -d
>
> However, when I do this and then invoke mpirun command, the other
> nodes are not recognized and get the following output.
>
> -----------------------------------------------------------------------
> ------
> It seems that [at least] one of processes that was started with mpirun
> did not invoke MPI_INIT before quitting (it is possible that more than
> one process did not invoke MPI_INIT -- mpirun was only notified of the
> first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> -----------------------------------------------------------------------
> ------
>
> I have set up the nodes so that I can ssh into any of them without
> entering the password. I know there have been many posts about lamboot
> problems in these
> archives, but none specifically could clear my problem. 
>
> Could someone help me set-up LAM and MPI on my cluster?
>
> Thanks
> Sriram
>
>
>
>
> -----------------------------------------------------------------------
> --------
> Sriram K. Rallabhandi
> Graduate Research Assistant       Work: 404 385 2789
> Aerospace Engineering                 Res:  404 603 9160
> Georgia Inst. of Technology
>
> -----------------------------------------------------------------------
> --------
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/