On Feb 1, 2006, at 7:13 PM, Brian Wainscott wrote:
> We are running lam-6.5.9 and our application is getting this error
> message (seems to also happen on lam-7.x):
My first thought is "please upgrade if possible!" :-), but if this is
also happening with 7.1.1, then this might actually be a problem in LAM.
> MPI_Comm_dup: internal MPI error: out of descriptors (rank 0, comm
> 4087)
>
>> From looking at the source code I can see there is a limit of
>> about 4096
> or so communicators. The thing is, we have checked carefully and we
> only have about 20 or so communicators at any given time -- they
> regularly get created and freed.
>
> So my question is this: is it possible that, even though we call
> MPI_COMM_FREE, the communicator is not freed? I suspect an
> unwaited for
> ISEND or IRECV somewhere, that is causing a communicator to be kept
> internally after we free it. We are checking on this now, but I
> wonder
> if there is something else that might be going on?
A communicator should not be holding a file descriptor open; the way
the data structures are setup, communicators are not the entities
that "own" network resources (i.e., communicators have links to the
underlying reference-counted data structures that "own" network
resources such as file descriptors).
Can you describe the situation a little more?
- Does this always happen at the same point in your code? I.e., is
it reproducible in a regular fashion?
- If you're on an operating system with /proc, can you look during a
run and see what all the fd's are being used for?
- Can you attach a debugger in see exactly where in MPI_COMM_DUP this
error is occurring? There are several places in share/mpi/cdup.c
where MPI_ERR_INTERN could be returned; knowing which one it is might
be helpful in tracking down the cause.
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|