LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-17 15:00:25


On Oct 17, 2005, at 3:50 PM, James Dorsey wrote:

> BACKGROUND:
> 1) Using lam-7.1.1 under FreeBSD 5.4
>
> 2) recon works on all nodes by rsh from the master node.
>
> 3) I can't seem to get $LAMHOME to "stick" after rebooting the master
> or
> the nodes, and the non-interactive sh shell invoked by rsh doesn't seem
> to search the usual places for things to add to my path. So I went for
> a
> quick dirty fix - I wrote a script to put links to all the files in
> usr/local/lam-mpi/bin into /usr/local/bin on each node. I don't think
> this should be causing the problem, but thought I'd mention it as it's
> a
> bit "non-standard".

This should not be necessary.

> 4) Lamboot is launched from a script in my path called "slam" (Start
> LAM), which contains the following line only:
>
> lamboot -v -d -ssi boot rsh ~/.pantheonmap
>
> where ~/.pantheonmap is the boot schema file containing the name of the
> master and three slave nodes.

This also should not be necessary.

> 5) /usr/local/lam-mpi/bin is present on each node, as there's a mix of
> processor types (AMD64, AMD32, P-3) so I wasn't sure if an nfs share
> would work.

One possible way to do this (there are many) is to have multiple
installations of LAM on NS in different directories. Then just set the
PATH as appropriate on each node -- LAM uses $PATH resolution for its
executable, so as long as it's Right on each node, the Right Things
will happen.

> 6) The home directory for each slave node is an nfs mount of the master
> home directory.
>
> PROBLEM:
>
> Lamboot fails due to output on stderr from the slave nodes. I changed

Excellent amount of detail; thanks!

> THE QUESTION (AT LAST):
>
> Is it likely that I've messed up the configuration, or is it possible
> that lam-7.1.1 is getting the rsh commands wrong (in the placement of
> quotation marks)? Is there something I can do about this?

I *think* that many of your problems is that LAM 7.1.1 is putting the ]
in the wrong place -- it's missing a space, causing the parsing on the
remote node to go badly.

The latest beta of 7.1.2 fixes this issue -- could you give that a
while? It might also fix your $LAMHOME/PATH issues (if the shell
parsing is wrong right off the bat, other things can go wrong).

        http://www.lam-mpi.org/beta/

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/