On Oct 5, 2004, at 9:22 AM, Ricardo Nishikido Pereira wrote:
> I'm trying to run mpi applications in a hetereogeneous cluster, with
> pc-linux and macintoshes nodes.
>
> I can lamboot correctly, but when I attempt to run an application I
> get an error saying that linux nodes are using tcp while mac nodes are
> using usysv:
>
> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
> MPI_COMM_WORLD rank 9: usysv (v7.1.0)
This is unfortunately a known problem -- LAM does not do well
coordinating when there are different modules available on different
nodes (or, more specifically, when one module would be better than
another on a given node).
> Then, I try to invoke mpirun telling it to use tcp and it says:
You correctly deduced the answer: adding -ssi rpi tcp to the mpirun
command line will force all ranks to use tcp. You could also -ssi rpi
usysv, since usysv also uses TCP for off-node communication.
> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
> MPI_COMM_WORLD rank 9: tcp (v7.1.0)
>
> I've installed lam-7.1.1 in all nodes, so I don't know why there are
> different versions of tcp. When I run programs only in the mac nodes
> or only in the linux nodes everything is fine.
Double check your paths when running non-interactive jobs on these
nodes. Somehow its finding an older TCP module -- perhaps a prior LAM
installation? For example, compare the output of:
ssh otherhost
laminfo
(i.e., an interactive login) vs. the following:
ssh otherhost laminfo
Check the path shown in the output of laminfo as well as the version
numbers of the modules.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|