LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2007-02-24 14:20:04


On Feb 23, 2007, at 12:09 PM, Javier Fernández wrote:

> Michael Creel wrote:

>> 16000 data points and 3 compute nodes: 13.253614
>> 16000 data points and 4 compute nodes: 10.133724
>> 20000 data points and 1 compute nodes: 60.665225
>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>> Rank (0, MPI_COMM_WORLD): - MPI_Intercomm_merge()
>> Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
>> Rank (0, MPI_COMM_WORLD): - main()
>> MPI_Intercomm_merge: internal MPI error: out of descriptors (rank 0,
>> comm 82)
>> MPI_Intercomm_merge: internal MPI error: out of descriptors (rank 0,
>> MPI_COMM_PARENT)
>
> I fear my ignorance will show up here... you really obtained _that_
> call
> stack? MPI_Intercomm_merge called from within MPI_Comm_spawn?

I'm pretty sure this is just an artifact of how we do some
communicator setup and not a big deal.

>> This is reproducible - it always happens when the problem gets large
>> enough. Any ideas what the problem might be? Thanks, Michael
>>
> comm 82 is a rather high number. Have you already tried to free those
> merged communicators after use? I'm not sure MPI_Finalize can
> automagically free them... in fact I'm not sure you MPI_Finalize
> between
> epochs :-)

'out of descriptors' in this case means that LAM can not find a
communicator identifier that isn't already in use. The number of
communicators that can be in use at one time can be pretty low in LAM
(especially if you are using the lamd communicator mechanism). If I
had to guess, the problem is that the system was not able to find a
communicator identifier not in use because either you are creating a
huge number of communicators all in use at once or you are not
freeing unused communicators you've created.

I'd look at the code to make sure you're freeing communicators when
you are done with them -- that should help with the identifier
allocation issues.

Hope that helps,

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/