LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-06-04 07:26:18


As you've noted, this is a complicated issue. :-(

In short, the situation is the same. However, in LAM 7.1, it should be at
least slightly easier for this kind of situation. I believe that you and
I have had a similar conversation a while ago, but a) I could be
remembering that wrong, and b) I might as well provide a full explanation
here for the list and web archives.

So the issue is that LAM heavily uses POSIX semantics. Our philosophy has
always been to try to extend user-expected serial POSIX semantics into the
parallel world. One aspect of this is using the $PATH. Specifically, on
remote nodes, LAM uses the $PATH during lamboot and mpirun to find
executables to run. For example, when you lamboot, each lamd will get its
own environmnet (if you're using rsh/ssh, this means that your "dot" files
will be sourced, your typical environment is setup, and the lamd simply
inherits it. If you're using a different boot module, your environment
may be constructed differently -- tm, for example, simply copies the
environment from where lamboot was executed). Specifically, each lamd
will have its own $PATH.

So when you "mpirun C foo", the lamd will look for a "foo" in your $PATH,
and will execute the first one that it finds (note that mpirun/the lamd
usually automatically appends "." to the searched $PATH so that the common
case of "foo" in your current working directory will work as expected).

Combine this philosophy with the fact that we really make no effort for
binary compatability between versions. I realize that this creates a
hardship for some people, but please remember that we *are* academics
after all, and LAM/MPI is about parallel computing research. We are not a
commercially supported venture, and simply do not have the resources to
ensure binary compatability between releases. We try our hardest to make
rock-solid software that combines cutting edge research with
production-quality code, but enforcing binary compatability is simply too
much for us to manage. I appologize, but that's the way it is.

Ok, I'll stop whining. :-)

So as a specific consequence of this, as Richard pointed out, you *must*
ensure that the same version of lamboot and lamds are used on all nodes in
the universe, and your MPI application must be compiled with the same LAM
version, too. This typically means modifying your "dot" files to change
your $PATH, or sourcing an additional $HOME/.lamrc file, or some other
similar scheme. Let me give an example. Say you have multiple LAM
installations on all your nodes:

         /home/software/lam-6.5.9
         /home/software/lam-7.0.5
         /home/software/lam-7.0.6

Each of those has a "/bin" subdirectory containing lamboot, lamd,
mpirun, mpicc, and all the rest.

Somewhere in your .tcshrc (I'm a tcsh kind of guy -- other shells are
similar but with slight different syntax), you'll have likely something
similar to:

-----
   set path = (/home/software/lam-6.5.9/bin $path)
# set path = (/home/software/lam-7.0.5/bin $path)
# set path = (/home/software/lam-7.0.6/bin $path)
-----

And just uncomment the one you need. Or perhaps you have just one line
and edit it as appropriate; both styles are functionally identical.

An alternative method is to keep the decision outside of .tcshrc in a
separate file, such as $HOME/.lamrc. So in your .tcshrc, you simply have:

-----
   source $HOME/.lamrc
-----

And in $HOME/.lamrc, you have a "set path" line (or some variation
thereof). This is slightly nicer because then you can modify which LAM
installation you want to use simply by editing .lamrc -- you can even have
automated scripts do this, because presumably .lamrc could even be a
single line, and is therefore trivial to edit or outright replace -- and
avoid editing (and potentially damaging) your .tcshrc file.

<sidenote> The OSCAR cluster project uses something like this for
switching between LAM/MPI and MPICH. It uses the environment
"modules" project (http://modules.sf.net/) with a thin layer for
persistence called "switcher" that effects a two-layer set of default:
what the cluster sysadmin sets as the default implementation/version
of MPI and then what the user sets. For example, the sysadmin can set
"everyone will use LAM/MPI 7.0.5." But a user locally downloads their
own copy of LAM/MPI 7.0.6 and therefore sets their personal settings
to "use LAM/MPI 7.0.6" (therefore overriding the system defaults).
Although a .lamrc-like method is used to effect setting the $PATH (and
$MANPATH), the user doesn't edit "dot" files at all -- commands such
as "switcher mpi = lam-7.0.6" do the trick. </sidenote>

But even with this OSCAR/switch system, it still boils down to editing
something that gets sourced during .tcshrc. This can be unattractive
for ISV's who really don't want to touch user "dot" files at all.

So after all this explanation -- here's how 7.1 will make this a
little better:

In LAM 7.1, there will be a "prefix" option to lamboot where you can
specify a prefix path directory for the lamd (and therefore PATH) on
remote nodes using the rsh module. This effectively allows you to
specify the directory where LAM is installed on remote nodes without
touching "dot" files.

On Wed, 2 Jun 2004, Richard Hadsell wrote:

> I realize from the FAQ and from past experience that an MPI application
> must be compiled for a specific version of LAM. I am trying to plan how
> we will make the transition from our old version (lam-6.6b2) to a new
> version (e.g., lam-7.0.6). We release a new version of our application
> every week or two, but the old versions hang around and continue to be
> used.
>
> I recall from a previous upgrade a few years ago that we can continue to
> use multiple versions of LAM as long as we set LAMHOME and PATH
> appropriately. The problem is that this is not very selectable
> dynamically. It used to require that we set those environment variables
> in each user's startup scripts (.cshrc or related scripts), so that the
> lamd's started by lamboot would be the correct ones.
>
> Is the situation still as restrictive? For instance, if lamboot were to
> start up lamd's on remote machines using an absolute path and
> communicating LAMHOME directly, it would only require the environment to
> be correct in the shell where the application startup script (the one
> that calls lamboot) is executed.
>
> I'm looking for an easier way to have our application select the LAM
> version when the user starts it. Right now it seems that the startup
> script will have to write a resource file (e.g., .lamrc) in the user's
> home directory that his .cshrc file will source to set the environment
> variables.
>
> Any other ideas?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/