On Tuesday, July 15, 2003, at 08:58 AM, Damien Declat wrote:
> we are using the MPI_COMM_SPAWN primitive, on different machines in
> order to launch executables. We've tried to spawn a huge number of
> executables but found a limit for this number.
> In fact on a COMPAQ, we crashed after more than 2000 calls to
> MPI_COMM_SPAWN whereas this number was equal to 180 on an SGI o2000.
> So we just wonder how this number is set. Is it a machine dependant
> number that can't be control or is it possible to set a LAM constant to
> increase this number ?
There are no explicit restrictions in LAM on the number of times that
you can call MPI_COMM_SPAWN. However, there are some implicit
restrictions that may be causing problems for you - most of which are
not adjustable. I can't tell exactly what the problem is without some
better idea of how your application is using MPI_COMM_SPAWN and what
error messages LAM is producing.
First is the number of available communicators - in LAM 6.5.x and
earlier, there was a rather small limit on the number of communicators
available. Each call to MPI_COMM_SPAWN results in the creation of a
communicator on the caller's side. If you are using LAM 7.0, this
limit is probably not causing problems for you. The limit was
increased for most RPIs (the lamd RPI being the exception) to close to
INT_MAX.
There is also a limit on the number of file descriptors that a process
can have open at any given time. Each call to MPI_COMM_SPAWN results
in the creation of a new file descriptor for each process spawned. I
wouldn't be surprised if this was causing problems. Nothing LAM can do
with this limit, other than die on the spawn.
Finally, there is an internal limit in the lamd as to the number of LAM
processes that can be running on a single node. This limit is
somewhere between 32 and 128, depending on which version of LAM you are
using. While you can adjust this value at compile time, doing so is
not recommended. It can occasionally cause the lamd to behave in "odd"
ways.
Hope this helps,
Brian
|