Sometimes it takes us a few days to reply to mails; we get busy too
(particularly since next week is the SC conference!). :-)
What I'm guessing here is that you're running out of shared memory /
semaphores -- I don't think we give good help messages when this
happens. Can you look in the LAM/MPI User guide for the shared memory
RPI sections and look at the description of how many semaphores and how
much shared memory you'll need for 4 processes on one node? You might
want to check that against what your system resources actually are;
there are also some tips in there about reducing shared memory usage.
Additionally, you might want to try running 3 or 2 processes and see if
the problem goes away -- this could potentially indicate the above
issue (running out of resources). Finally, in a worst case scenario,
you might explicitly try the tcp RPI -- communication would potentially
be slower than the shmem RPIs (I'm assuming you have a 4-way SMP?), but
I'm guessing it would work.
On Nov 3, 2004, at 9:17 AM, daniel.egloff_at_[hidden] wrote:
> Dear lam/mpi list
>
> I use LAM/MPI 7.0.6-4 (url for source see below) which I recompiled
> from source, because the Debian package strips symbol information and
> therefore does not work with the TotalView debugger. (Would be a good
> idea to mention that to the Debian package builders too)
>
> I observe the following odd behaviour (which I did not have with the
> 7.0.6-2 Debian binary package from the Debian package archive). FYI:
> ring is the ring example application from the lam examples:
>
> ******************************************
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> e3050_at_platosrv:~/workspace/lam-examples/examples/main/ring$ mpirun
> n0,0,0,0 ring
> -----------------------------------------------------------------------
> ------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host
> platosrv---------------------------------------------------------------
> --------------
> (n0The selected RPI failed to initialize during MPI_INIT. This is a
> ).
> fatal error; I must abort.
> The PID of failed process was 1076
> (MPI_COMM_WORLD rank: 0)
> -----------------------------------------------------------------------
> ------
> This occurred on host platosrv (n0).
> The PID of failed process was 1077 (MPI_COMM_WORLD rank: 1)
> -----------------------------------------------------------------------
> ------
> -----------------------------------------------------------------------
> ------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 1078 failed on node n0 (147.50.18.157) with exit status 1.
> -----------------------------------------------------------------------
> ------
>
> ********************************************
>
> If I use only 3 processes on the same node, i.e, mpirun n0,0,0 ring
> things work.
> I had applications which even only runned with 2 processes on the
> "root node".
>
> I also have to do a lamhalt / lamboot hostfile sequence to do mpirun
> again without errors once I got errors like above.
>
> I somewhere stumbeled over a not of Jeff that such errors are going to
> be fixed in lam 7.1.x.
> Do I need to switch to version 7.1.x.
>
> A quick answer will be very much appreciated.
>
> Binary source which I recompiled:
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6-4.dsc
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6.orig.tar.gz
> http://ftp.debian.org/debian/pool/main/l/lam/lam_7.0.6-4.diff.gz
>
>
>
> Best regards,
>
> Daniel Egloff
> Zürcher Kantonalbank, VFK
> Lagerstrasse 47, 8004 Zürich
> Tel. +41 (0) 1 292 45 33, Fax +41 (0) 1 292 45 93
> Briefadresse: Postfach, 8010 Zürich, http://www.zkb.ch
> ___________________________________________________________________
>
> Disclaimer:
>
>
> Diese Mitteilung ist nur fuer die Empfaengerin / den Empfaenger
> bestimmt.
>
> Fuer den Fall, dass sie von nichtberechtigten Personen empfangen wird,
> bitten wir diese hoeflich, die Mitteilung an die ZKB zurueckzusenden
> und anschliessend die Mitteilung mit allen Anhaengen sowie allfaellige
> Kopien zu vernichten bzw. zu loeschen. Der Gebrauch der Information
> ist verboten.
>
>
> This message is intended only for the named recipient and may contain
> confidential or privileged information.
>
> If you have received it in error, please advise the sender by return
> e-mail and delete this message and any attachments. Any unauthorised
> use or dissemination of this information is strictly prohibited.
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|