LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Phil Ehrens (pehrens_at_[hidden])
Date: 2004-10-06 10:23:01


Hi All,

This is the kind of problem I hate to come to a list with,
because there is no clear evidence of what is going wrong,
but here goes.

We have started testing our batch processing system, which
uses multiple (16-32) lam users with persistent LAM universes,
with LAM 7.1.1.

On our first-pass test system, a single node system with
16 users and 9 virtual nodes (1 master and 8 slaves),
we are seeing lamd's exiting without warning when we
run at our normal cycling rate of 1000 jobs/hr for that
system. When the cycling raet is dropped to 800 jobs/hr
the lamd's stay up inerminably, as they do with LAM 7.0.X
at 1000 jobs/hr.

I suspect some issue with LAM 7.1.1 and reusing a lam
universe too quickly in succession. That is, I think
there is some state persisting in the lam universe
for some short time that causes lamd to exit if a new
multicomputer is carved out 'too soon'.

Since we normally 'round-robin' through the 16 users,
it is only under VERY high load that a single user can
get reused quickly, essentially at random.

Any ideas? Any changes in the cleanup code when a lam
job completes that might step on the 'next job'.

Once again, we are seeing this anomaly when switching
from LAM 7.0.X to lam 7.1.1. It does not exist in
the 7.0.X branch, which we are currently using to process
roughly 70,000 batch jobs per day successfully.

Phil

-- 
Phil Ehrens <pehrens_at_[hidden]>| Fun stuff:
The LIGO Laboratory, MS 18-34         | http://www.ralphmag.org
California Institute of Technology    | http://www.yellow5.com
1200 East California Blvd.            | http://www.total.net/~fishnet/
Pasadena, CA 91125 USA                | http://slashdot.org
Phone:(626)395-8518 Fax:(626)793-9744 | http://kame56.homepage.com