LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Phil Ehrens (pehrens_at_[hidden])
Date: 2004-09-21 11:08:08


Jeff Squyres wrote:
> On Sep 20, 2004, at 1:26 PM, Phil Ehrens wrote:
>
> >We recently had some lam-killfile files persist across the
> >reboot of a machine, so that the processes i.d. no longer
> >existed or was owned by a different user than the one
> >expected.
> >This caused a problem that, while not difficult to diagnose,
> >was difficult to diagnose automatically via our usual heuristics.
>
> Let me make sure I understand the problems...
>
> >Since the failures were due to:
> >
> > a.) no such process
>
> So are you talking about the equivalent of:
>
> lamboot
> kill -9 `ps -eadf | grep lamd | grep my_user_id | grep -v grep | awk '{
> print $2 }'`
> lamhalt
>
> Or:
>
> lamboot
> kill -9 ....
> lamboot

I'm not sure we are on the same page. What happens is lamboot
runs into the following error, caused by the pid contained
in the lam-killfile not being a pid killable by the user, and
as a result lamboot fails to produce a usable universe:

tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-search07_at_node113/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-search07_at_node113/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-search07_at_node113/lam-io-socket
tkill: f_kill = "/tmp/lam-search07_at_node113/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 2799 ...
tkill:
ERROR: LAM/MPI unexpectedly received the following on stderr:
tkill (kill): Operation not permitted

So, from my perspective, the solution is:

lamboot (fails)
kill -9 ... (does nothing)
rm -rf /tmp/lam-${USER}*
lamboot (succeeds)

> Both of those scenarios work fine for me -- the session directory is
> either removed or the killfile is reset. Either way, the desired
> result is achieved.
>
> What I don't remember, however, is if this is something we fixed in 7.1
> (i.e., if it was a problem in 7.0).

Oops. We are using lam-7.0.6. Alas, I cannot update the LAM version
independent of our entire certified suite of tools, and there is a
two week long testing regime that must occur before we can have a
point release. I will recommend that we begin testing with 7.1 ASAP.

> So are you talking about that kind of scenario, or something else?
>
> > b.) pid owned by different user
>
> So is this a scenario something like:
>
> - user A lamboots
> - machine reboots
> - user A is deleted
> - user B -- with the same username, but a different uid -- is added
> - user B lamboots
>
> I can see how this would be a problem; we didn't anticipate that kind
> of scenario at all.
>
> Is this what you're talking about?

Ah. No. We use a stable of 16 LAM users named search01 - search16.
After a machine reboot there may be a new lamd running with the pid
found in user search05's lam-killfile, but it in fact is user
search03's lamd. So when lamboot runs for user search05, the tkill
will fail.

The obvious solution to this second case is for us to delete existing
lam /tmp directories at reboot, but it is a condition that lamboot
could handle more elegantly (and maybe it does in LAM 7.1).

We will start qualifying LAM 7.1 immediately.

Thanks Jeff!

Phil

-- 
Phil Ehrens <pehrens_at_[hidden]>| Fun stuff:
The LIGO Laboratory, MS 18-34         | http://www.ralphmag.org
California Institute of Technology    | http://www.yellow5.com
1200 East California Blvd.            | http://www.total.net/~fishnet/
Pasadena, CA 91125 USA                | http://slashdot.org
Phone:(626)395-8518 Fax:(626)793-9744 | http://kame56.homepage.com