I'm running on a PC cluster with 16 pentium 2.4 GHz and the OS is linux.
The application is an ocean model, but I think the code is not very well
implemented, although the author told me that ocean models are very
sensitive to latency. Nevertheless I cannot get more than 20% CPU
utilization when I'm running just with only 4 processors!....(the more
processors the less scalability)
I think that he "too many open files in system" problem that I'm having
is very similar to the one reported in
http://lam-mpi.lzu.edu.cn/MailArchives/lam/msg02505.php
If I use the front node I won't have this problem as he said in the
discussion.
Maybe my app is leaking fds...
Can you be more specific on using the tool lsof?
Joao
This means you are hitting a file descriptor limit somewhere along the
line. It's pretty hard to tell what fd limit you are hitting - systems
can have multiple levels of fd limits (per process, per user,
system-wide,
etc.). LAM opens a fully-connected set of TCP connections, so if you
are
running a 256 process job, the per-process limit needs to be over 255
(because there are 4-6 fds for things like stdin, talking to the lamd,
etc.). And the per-user and system-wide limits need to be a bit over
that
times the number of processes per node. So you can see how these limits
grow really fast :).
On most systems, it is possible to increase the fd limits, at the cost
of
slightly higher memory usage. If you are running one process per node,
that is about your only option. If you are running multiple processes
per
node, you may be able to use the sysv or usysv RPIs to reduce your fd
count - rather than using TCP locally, they use shared memory.
One other thing I didn't mention. I kind of assumed you were trying to
run really big jobs. If you are trying to run something like 64 or
fewer
processes over 16 or more machines, you really shouldn't be hitting fd
limits on a modern OS. In this case, I might use a tool like lsof to
make
sure you aren't leaking file descriptors somewhere. With the exception
of
the MPI-2 dynamic processes functions, LAM shouldn't be adding more file
descriptors after MPI_INIT, so an increase in open fds after MPI_INIT
probably means your app is leaking fds.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|