On Oct 11, 2006, at 12:21 PM, Josh Lehan wrote:
> Hi. I sent this message to the lam-devel list a few days ago, but
> that list appears dead. I'm assuming the developers have moved on
> to OpenMPI.
It's not dead dead dead :-), but only viewed with quite low frequency
because most of us are spending 99% of our time on Open MPI. :-)
> In LAM 7.1.2, I found a segfault in lamd when "lamhalt" is used to
> tear down a LAM network.
>
> It happens if the "tkill" executable is not found.
How exactly does this happen, actually? The code as it stands
searches $LAMHOME/bin and the compiled-in default $bindir; is tkill
not found there? Do you not have the tkill binary distributed to the
back-end bproc nodes?
> It's in the appropriately named function diediedie() in
> otb/sys/haltd/haltd.c
>
> I traced it out, and what's going on is this:
>
> It's building a list of locations to search for "tkill", and passes
> that
> to sfh_path_findv().
I'd actually augment your patch to not even search $PATH in the first
place (since it's going to be searched wrong). Something like the
attached.
Thanks for the patch -- I'm no longer a LAM maintainer, but this
patch meets with my approval. :-)
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
|