On Sep 20, 2004, at 1:26 PM, Phil Ehrens wrote:
> We recently had some lam-killfile files persist across the
> reboot of a machine, so that the processes i.d. no longer
> existed or was owned by a different user than the one
> expected.
> This caused a problem that, while not difficult to diagnose,
> was difficult to diagnose automatically via our usual heuristics.
Let me make sure I understand the problems...
> Since the failures were due to:
>
> a.) no such process
So are you talking about the equivalent of:
lamboot
kill -9 `ps -eadf | grep lamd | grep my_user_id | grep -v grep | awk '{
print $2 }'`
lamhalt
Or:
lamboot
kill -9 ....
lamboot
Both of those scenarios work fine for me -- the session directory is
either removed or the killfile is reset. Either way, the desired
result is achieved.
What I don't remember, however, is if this is something we fixed in 7.1
(i.e., if it was a problem in 7.0).
So are you talking about that kind of scenario, or something else?
> b.) pid owned by different user
So is this a scenario something like:
- user A lamboots
- machine reboots
- user A is deleted
- user B -- with the same username, but a different uid -- is added
- user B lamboots
I can see how this would be a problem; we didn't anticipate that kind
of scenario at all.
Is this what you're talking about?
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|