Hi All,
This question is directed to LAM users, I have already received
help from the LAM team with this, but I need the attention of
folks who are using LAM within a batch processing environment.
We have wrapped the LAM environment for use in a data processing
batch pipeline. We typically run at a rate of one pipeline every
7 seconds, or about 12,000 jobs every 24 hours.
We do this using persistent lamd's. We configure the comm-world
using per-job schema files in /tmp that we pass as mpirun
command line arguments.
We see an MPI_INIT failure about once in every 2,000 jobs
processed. In most cases we can recover and retry after
running lamboot again, but we still have to abort about
1 job in 5,000 because the retry fails.
Is anybody on this list doing batch processing with LAM and
having a higher rate of success? If so, can you tell me
what strategy you are using to deal with cleanup and retry
after mpirun fails? We cannot retry repeatedly without
the pipeline getting backed up. The ideal solution is, of
course, one that allows querying the state of the LAM
universe, but we have not found a safe way to do that.
Phil
--
Phil Ehrens <pehrens_at_[hidden]>| Fun stuff:
The LIGO Laboratory, MS 18-34 | http://www.ralphmag.org
California Institute of Technology | http://www.yellow5.com
1200 East California Blvd. | http://www.total.net/~fishnet/
Pasadena, CA 91125 USA | http://slashdot.org
Phone:(626)395-8518 Fax:(626)793-9744 | http://kame56.homepage.com
|