Karl Forner wrote:
>
> I have set-up a batch-system that seems similar to yours, and in my case
> the failure with the persistent lamds is tied to two bugs :
> - mpirun does not kill all the processes if a LAM application fails, so
> some files remain open
> - some files stay open too if the job is interrupted by, for instance, a
> SIGINT (CTRL+C)
>
> so I found a kind of work-around that you can find in the mailing-list :
> "LAM: Re: mpirun (set_stdio): Too many open files in system"
>
Hi Karl,
We dealt with that one quite a while ago. We found that we needed
to go on a killing spree (ssh/rsh kill -9 pid) if lamclean does
not return within 20 seconds. This is a last resort that we seldom
need to exercise, however, since lamclean seems to do the right
thing most of the time.
We found that it is most important to kill the lamd's FIRST, before
killing other processes, since generally killing the lamd's causes
the other controlling and monitoring processes to return. If
you kill mpirun, or your own controlling/monitoring processes first,
you can wind up with scary bound unattached sockets and unkillable zombie
processes and other halloween horrors ;^)
In any case, we have not seen '(set_stdio): Too many open files in system'
for a long time.
Phil
--
Phil Ehrens <pehrens_at_[hidden]>| Fun stuff:
The LIGO Laboratory, MS 18-34 | http://www.ralphmag.org
California Institute of Technology | http://www.yellow5.com
1200 East California Blvd. | http://www.total.net/~fishnet/
Pasadena, CA 91125 USA | http://slashdot.org
Phone:(626)395-8518 Fax:(626)793-9744 | http://kame56.homepage.com
|