Brian Barrett wrote:
> On Feb 23, 2007, at 12:09 PM, Javier Fernández wrote:
>
>> Michael Creel wrote:
>
>>> 16000 data points and 3 compute nodes: 13.253614
>>> 16000 data points and 4 compute nodes: 10.133724
>>> 20000 data points and 1 compute nodes: 60.665225
>>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (0, MPI_COMM_WORLD): - MPI_Intercomm_merge()
>>> Rank (0, MPI_COMM_WORLD): - MPI_Comm_spawn()
>>> Rank (0, MPI_COMM_WORLD): - main()
>>> MPI_Intercomm_merge: internal MPI error: out of descriptors (rank 0,
>>> comm 82)
>>> MPI_Intercomm_merge: internal MPI error: out of descriptors (rank 0,
>>> MPI_COMM_PARENT)
>> I fear my ignorance will show up here... you really obtained _that_
>> call
>> stack? MPI_Intercomm_merge called from within MPI_Comm_spawn?
>
> I'm pretty sure this is just an artifact of how we do some
> communicator setup and not a big deal.
>
>>> This is reproducible - it always happens when the problem gets large
>>> enough. Any ideas what the problem might be? Thanks, Michael
>>>
>> comm 82 is a rather high number. Have you already tried to free those
>> merged communicators after use? I'm not sure MPI_Finalize can
>> automagically free them... in fact I'm not sure you MPI_Finalize
>> between
>> epochs :-)
>
> 'out of descriptors' in this case means that LAM can not find a
> communicator identifier that isn't already in use. The number of
> communicators that can be in use at one time can be pretty low in LAM
> (especially if you are using the lamd communicator mechanism). If I
> had to guess, the problem is that the system was not able to find a
> communicator identifier not in use because either you are creating a
> huge number of communicators all in use at once or you are not
> freeing unused communicators you've created.
>
> I'd look at the code to make sure you're freeing communicators when
> you are done with them -- that should help with the identifier
> allocation issues.
>
>
> Hope that helps,
>
> Brian
>
Thanks Brian,
I'm corresponding off-list with Javier about this. At this point I think it's my
code, not LAM/MPI, that's the problem. I'll go back on-list if I start to think
otherwise.
Michael
|