LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-10-05 08:31:46


On Oct 5, 2004, at 9:22 AM, Ricardo Nishikido Pereira wrote:

> I'm trying to run mpi applications in a hetereogeneous cluster, with
> pc-linux and macintoshes nodes.
>
> I can lamboot correctly, but when I attempt to run an application I
> get an error saying that linux nodes are using tcp while mac nodes are
> using usysv:
>
> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
> MPI_COMM_WORLD rank 9: usysv (v7.1.0)

This is unfortunately a known problem -- LAM does not do well
coordinating when there are different modules available on different
nodes (or, more specifically, when one module would be better than
another on a given node).

> Then, I try to invoke mpirun telling it to use tcp and it says:

You correctly deduced the answer: adding -ssi rpi tcp to the mpirun
command line will force all ranks to use tcp. You could also -ssi rpi
usysv, since usysv also uses TCP for off-node communication.

> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
> MPI_COMM_WORLD rank 9: tcp (v7.1.0)
>
> I've installed lam-7.1.1 in all nodes, so I don't know why there are
> different versions of tcp. When I run programs only in the mac nodes
> or only in the linux nodes everything is fine.

Double check your paths when running non-interactive jobs on these
nodes. Somehow its finding an older TCP module -- perhaps a prior LAM
installation? For example, compare the output of:

ssh otherhost
laminfo

(i.e., an interactive login) vs. the following:

ssh otherhost laminfo

Check the path shown in the output of laminfo as well as the version
numbers of the modules.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/