LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: daniel.egloff_at_[hidden]
Date: 2004-11-04 12:03:44


Dear Jeff

thanks for the reply. I have no idea what the cause of our erratic
behaviour was.
After a recompile / reinstall of lam 7.1.1 (with slighly different
config options, in part. building shared libs)
and a full reboot of all cluster nodes it works again.

I wonder when I get the "The selected RPI failed to initialize..."I
error again.

It might be caused by a buggy MPI application which crashes often and
which we currently
debugging. Is it possible that this causes sharded memory / semaphores
to be exhausted?

PS: Thanks for the quick reply, particularly in such busy times, I
really appreciate the support of the LAM team :-)

Regards, Daniel.

                                                                                                                                        
                      Jeff Squyres
                      <jsquyres_at_lam-mpi An: General LAM/MPI mailing list <lam_at_[hidden]>
                      .org> Kopie:
                      Gesendet von: Thema: Re: LAM: Runtime Error: The selected RPI failed to initialize during
                      lam-bounces_at_lam-m MPI_INIT. This is a fatal error; I must abort.
                      pi.org
                                                                                                                                        
                                                                                                                                        
                      04.11.04 14:54
                      Bitte antworten
                      an General
                      LAM/MPI mailing
                      list
                                                                                                                                        
                                                                                                                                        

Sometimes it takes us a few days to reply to mails; we get busy too
(particularly since next week is the SC conference!). :-)

What I'm guessing here is that you're running out of shared memory /
semaphores -- I don't think we give good help messages when this
happens. Can you look in the LAM/MPI User guide for the shared memory

RPI sections and look at the description of how many semaphores and
how
much shared memory you'll need for 4 processes on one node? You might

want to check that against what your system resources actually are;
there are also some tips in there about reducing shared memory usage.

Additionally, you might want to try running 3 or 2 processes and see
if
the problem goes away -- this could potentially indicate the above
issue (running out of resources). Finally, in a worst case scenario,

you might explicitly try the tcp RPI -- communication would
potentially
be slower than the shmem RPIs (I'm assuming you have a 4-way SMP?),
but
I'm guessing it would work.

On Nov 3, 2004, at 9:17 AM, daniel.egloff_at_[hidden] wrote:

> Dear lam/mpi list
>
> I use LAM/MPI 7.0.6-4 (url for source see below) which I recompiled
> from source, because the Debian package strips symbol information
and
> therefore does not work with the TotalView debugger. (Would be a
good
> idea to mention that to the Debian package builders too)
>
> I observe the following odd behaviour (which I did not have with the
> 7.0.6-2 Debian binary package from the Debian package archive). FYI:
> ring is the ring example application from the lam examples:
>
> ******************************************
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> e3050_at_platosrv:~/workspace/lam-examples/examples/main/ring$ mpirun
> n0,0,0,0 ring
>
-----------------------------------------------------------------------

> ------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host
>
platosrv---------------------------------------------------------------

> --------------
> (n0The selected RPI failed to initialize during MPI_INIT. This is
a
> ).
> fatal error; I must abort.
> The PID of failed process was 1076
> (MPI_COMM_WORLD rank: 0)
>
-----------------------------------------------------------------------

> ------
> This occurred on host platosrv (n0).
> The PID of failed process was 1077 (MPI_COMM_WORLD rank: 1)
>
-----------------------------------------------------------------------

> ------
>
-----------------------------------------------------------------------

> ------
> One of the processes started by mpirun has exited with a nonzero
exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a
"return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 1078 failed on node n0 (147.50.18.157) with exit status 1.
>
-----------------------------------------------------------------------

> ------
>
> ********************************************
>
> If I use only 3 processes on the same node, i.e, mpirun n0,0,0 ring
> things work.
> I had applications which even only runned with 2 processes on the
> "root node".
>
> I also have to do a lamhalt / lamboot hostfile sequence to do mpirun
> again without errors once I got errors like above.
>
> I somewhere stumbeled over a not of Jeff that such errors are going
to
> be fixed in lam 7.1.x.
> Do I need to switch to version 7.1.x.
>
> A quick answer will be very much appreciated.
>
> Binary source which I recompiled:
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6-4.dsc
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6.orig.tar.gz
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6-4.diff.gz
>

___________________________________________________________________

Disclaimer:

Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger
bestimmt.

Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird,
bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden
und anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige
Kopien zu vernichten bzw. zu loeschen. Der Gebrauch der Information
ist verboten.

This message is intended only for the named recipient and may contain
confidential or privileged information.

If you have received it in error, please advise the sender by return
e-mail and delete this message and any attachments. Any unauthorised
use or dissemination of this information is strictly prohibited.