LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-10-13 17:01:24


On Oct 11, 2006, at 12:21 PM, Josh Lehan wrote:

> Hi. I sent this message to the lam-devel list a few days ago, but
> that list appears dead. I'm assuming the developers have moved on
> to OpenMPI.

It's not dead dead dead :-), but only viewed with quite low frequency
because most of us are spending 99% of our time on Open MPI. :-)

> In LAM 7.1.2, I found a segfault in lamd when "lamhalt" is used to
> tear down a LAM network.
>
> It happens if the "tkill" executable is not found.

How exactly does this happen, actually? The code as it stands
searches $LAMHOME/bin and the compiled-in default $bindir; is tkill
not found there? Do you not have the tkill binary distributed to the
back-end bproc nodes?

> It's in the appropriately named function diediedie() in
> otb/sys/haltd/haltd.c
>
> I traced it out, and what's going on is this:
>
> It's building a list of locations to search for "tkill", and passes
> that
> to sfh_path_findv().

I'd actually augment your patch to not even search $PATH in the first
place (since it's going to be searched wrong). Something like the
attached.

Thanks for the patch -- I'm no longer a LAM maintainer, but this
patch meets with my approval. :-)

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems