LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-06-14 13:25:19


Just to clarify -- you mean more than one *process*, not more than one
*thread*, right? LAM is not thread safe (i.e., you can't have multiple
threads in an MPI call at the same time). See the FAQ for more info
here.

So something is getting borked when your processes startup -- I'm
guessing it has to do with shared memory. usysv should be the default
RPI, which uses shared memory for on-node message passing (see the LAM
User Guide for more details). It's possible that it's not able to
allocate enough shared memory for multiple processes on a single node,
or perhaps the second process is not able to attach the shmem, etc.

Try running with a different RPI to see what happens. For example:

        mpirun C -ssi rpi tcp hello

And see if that succeeds (the TCP RPI explicitly does not use shared
memory). Then try with explicitly usysv:

        mpirun C -ssi rpi usysv hello

And see if that fails. If it does, run with the debug flags enabled
and see if more information is provided about the cause of the failure:

        mpirun C -ssi rpi usysv -ssi rpi_verbose 1000 hello

On Jun 14, 2005, at 12:45 AM, John McCorquodale wrote:

> I've got a little bproc4.0.0pre8 cluster and I'm trying to get LAM
> going on
> it (over gigE). I get blowups when I try to use the second CPU on my
> 2-way
> nodes. No frills Debian gcc/g77 3.3.6. Here's lam-bhost.def:
>
> ---lam-bhost.def---
> strongbad.strongbadia
> 0.strongbadia cpu=2
> 1.strongbadia cpu=2
> ---end lam-bhost.def---
>
> $ lamboot
> $ lamnodes
> n0 strongbad.strongbadia:1:no_schedule,origin,this_node
> n1 0.strongbadia:2:
> n2 1.strongbadia:2:
>
> And when I run one thread per node it works fine:
>
> $ mpirun N hello
> Hello, world! I am 0 of 2
> Hello, world! I am 1 of 2
>
> But when I try to run on all the CPUs (or, in fact, when I try _any_
> mpirun
> syntax that would start more than one thread on any one or more
> physical
> nodes), things go awry. I get the same behavior in LAM 7.1.1 and
> today's
> (13 June) snapshot. This is such a fundamental problem (and no useful
> hits
> in Google) that I must just be missing a something important that
> everybody
> else in the world thinks is obvious. Anybody care to clue me in on
> what I'm
> doing wrong? Only using half my processors makes me a sad panda.
>
> Here's the blowup:
>
> $ mpirun C hello
>
> -----------------------------------------------------------------------
> ------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host 1 (n2).
> The PID of failed process was 31617 (MPI_COMM_WORLD rank: 2)
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 31615 failed on node n1 (192.168.1.100) with exit status 1.
> -----------------------------------------------------------------------
> ------
>
> Does this ring any bells for anybody? Does this "just work fine" for
> anybody?
>
> Thanks!
>
> -mcq
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/