LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-09-21 10:23:23


On Sep 20, 2004, at 1:26 PM, Phil Ehrens wrote:

> We recently had some lam-killfile files persist across the
> reboot of a machine, so that the processes i.d. no longer
> existed or was owned by a different user than the one
> expected.
> This caused a problem that, while not difficult to diagnose,
> was difficult to diagnose automatically via our usual heuristics.

Let me make sure I understand the problems...

> Since the failures were due to:
>
> a.) no such process

So are you talking about the equivalent of:

lamboot
kill -9 `ps -eadf | grep lamd | grep my_user_id | grep -v grep | awk '{
print $2 }'`
lamhalt

Or:

lamboot
kill -9 ....
lamboot

Both of those scenarios work fine for me -- the session directory is
either removed or the killfile is reset. Either way, the desired
result is achieved.

What I don't remember, however, is if this is something we fixed in 7.1
(i.e., if it was a problem in 7.0).

So are you talking about that kind of scenario, or something else?

> b.) pid owned by different user

So is this a scenario something like:

- user A lamboots
- machine reboots
- user A is deleted
- user B -- with the same username, but a different uid -- is added
- user B lamboots

I can see how this would be a problem; we didn't anticipate that kind
of scenario at all.

Is this what you're talking about?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/