I'm guessing that you have run out of SYSV semaphores. These are OS-
managed resources that can unfortunately persist even after a process
dies. For example, if you have an MPI process that is using sysv (or
usysv) that dies badly, it can orphan SYSV resources in the OS. If
this happens a few times, the OS may run out of SYSV resources and
you won't be able to run any new sysv/usysv processes.
The lamclean command (and lamhalt) should release all of them, and
you should be able to run again. If that doesn't work, run the
"ipcs" command and see if there are excessive resources being
claimed; the "iprm" command should be able to remove them.
On Mar 12, 2006, at 12:45 PM, Simon Prunet wrote:
> Hello all,
>
> I used (successfully) for some time the lam suite
> version 7.1.1 on an 4-way SMP 64bit linux box.
>
> All of a sudden, mpirun, even on very simple codes,
> stopped running on more than one processor, with
> the following error:
>
> ----------------------------------------------------------------------
> -------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host node-01 (n0).
> The PID of failed process was 27008 (MPI_COMM_WORLD rank: 0)
> ----------------------------------------------------------------------
> -------
>
> Going through past posts, there were hints that it was a problem
> related with the rpi used by default, and indeed this problem
> disappears
> when I used the tcp rpi, and only appears with the sysv or usysv
> rpi's...
>
> Any idea ?
>
> Thanks for your help,
>
> Simon
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|