LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Karl Forner (Karl.Forner_at_[hidden])
Date: 2004-03-19 04:21:03


Hello,

I've been using LAM on production on two clusters for years, and there's
a very annoying bug that is still present even
in the last version.

When you kill a lam job, by example by typing 'CTRL+C' in the terminal,
some files stay open by the lam daemon.
Then the number of open files reach 71, and at this point, you can not
any longer launch new jobs, you get an error message like :

lamexec (set_stdio): Too many open files in system

It is easy to reproduce : for example on a linux cluster, with redhat
7.2 running lam 7.0.4.

% lamboot -b -v

the get the pid of the lam daemon : e.g
% PID=`pgrep lamd -u $USER`

then count the number of open files (plus one) :
% ls -l /proc/$PID/fd | wc -l
you should have 11 open files

then repeat the following process

launch a simple lam command
% lamexec N sleep 10
and interrupt it with one or two 'CTRL+C'
you can check with " ls -l /proc/$PID/fd | wc -l" that the number of
open files is increasing.

repeat it until you reach 71 open files, then you should have the error
message.

Is this bug already referenced ?
Do you need some help to fix it ?

Thanks Karl FORNER