LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2004-09-21 16:38:19


On Sep 21, 2004, at 11:55 AM, Phil Ehrens wrote:

> Jeff Squyres wrote:
>> On Sep 21, 2004, at 12:08 PM, Phil Ehrens wrote:
>>
>>> [snipped]
>>> Ah. No. We use a stable of 16 LAM users named search01 - search16.
>>> After a machine reboot there may be a new lamd running with the pid
>>> found in user search05's lam-killfile, but it in fact is user
>>> search03's lamd. So when lamboot runs for user search05, the tkill
>>> will fail.
>>
>> Ahh... now I understand. Yes, this is a scenario that we did not
>> anticipate -- 7.1 won't handle it any better than 7.0. :-\
>
> I am so disappointed... shame grips me like a giant clam.
>
>> Lemme investigate this; it may not be hard to fix.
>
> Great! Thanks!

Ok, I have a quick fix that I think makes sense. The cause of the
failure in lamboot was that tkill was sending something out on stderr,
rather than just being quiet when it got a permission denied for trying
to kill another user's process. We now don't emit anything on stderr
in that case, so life should be good.

I've attached a patch against tkill.c that should apply to the 7.0.x
tree and 7.1. Can you give it a shot and see if it works for you?
I've committed the change into both the LAM svn trunk and 7.1 branch,
so it will sneak into LAM 7.1.1, whenever that happens.

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/