Phil Ehrens wrote:
>Hi All,
>
>This question is directed to LAM users, I have already received
>help from the LAM team with this, but I need the attention of
>folks who are using LAM within a batch processing environment.
>
>We have wrapped the LAM environment for use in a data processing
>batch pipeline. We typically run at a rate of one pipeline every
>7 seconds, or about 12,000 jobs every 24 hours.
>
>We do this using persistent lamd's. We configure the comm-world
>using per-job schema files in /tmp that we pass as mpirun
>command line arguments.
>
>We see an MPI_INIT failure about once in every 2,000 jobs
>processed. In most cases we can recover and retry after
>running lamboot again, but we still have to abort about
>1 job in 5,000 because the retry fails.
>
>Is anybody on this list doing batch processing with LAM and
>having a higher rate of success? If so, can you tell me
>what strategy you are using to deal with cleanup and retry
>after mpirun fails? We cannot retry repeatedly without
>the pipeline getting backed up. The ideal solution is, of
>course, one that allows querying the state of the LAM
>universe, but we have not found a safe way to do that.
>
>Phil
>
>
Have you an idea about what causes the failure ?
I have set-up a batch-system that seems similar to yours, and in my case
the failure with the persistent lamds is tied to two bugs :
- mpirun does not kill all the processes if a LAM application fails, so
some files remain open
- some files stay open too if the job is interrupted by, for instance, a
SIGINT (CTRL+C)
so I found a kind of work-around that you can find in the mailing-list :
"LAM: Re: mpirun (set_stdio): Too many open files in system"
Karl
|