LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-06-11 11:16:24


On Wed, 11 Jun 2003, Joao Pestana Ferreira wrote:

> I have a MPI program that crashes if submitted under PBS with the
> following error:
>
> "Too many open files in system"
>
> Any help?

This means you are hitting a file descriptor limit somewhere along the
line. It's pretty hard to tell what fd limit you are hitting - systems
can have multiple levels of fd limits (per process, per user, system-wide,
etc.). LAM opens a fully-connected set of TCP connections, so if you are
running a 256 process job, the per-process limit needs to be over 255
(because there are 4-6 fds for things like stdin, talking to the lamd,
etc.). And the per-user and system-wide limits need to be a bit over that
times the number of processes per node. So you can see how these limits
grow really fast :).

On most systems, it is possible to increase the fd limits, at the cost of
slightly higher memory usage. If you are running one process per node,
that is about your only option. If you are running multiple processes per
node, you may be able to use the sysv or usysv RPIs to reduce your fd
count - rather than using TCP locally, they use shared memory.

One other thing I didn't mention. I kind of assumed you were trying to
run really big jobs. If you are trying to run something like 64 or fewer
processes over 16 or more machines, you really shouldn't be hitting fd
limits on a modern OS. In this case, I might use a tool like lsof to make
sure you aren't leaking file descriptors somewhere. With the exception of
the MPI-2 dynamic processes functions, LAM shouldn't be adding more file
descriptors after MPI_INIT, so an increase in open fds after MPI_INIT
probably means your app is leaking fds.

Hope this helps,

Brian

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/