I'm not too familiar with LAM's SGE integration, unfortunately (more
specifically: it has been explained to me many times, but I have no systems
with SGE, so the information stays resident in my brain for about 24-48
hours and then experiences a page fault). I seem to recall that "loose
integration" means that you're getting the host list from SGE but not using
SGE's controls to launch/kill LAM.
Can you indicate how you're verifying that the lamd is not killed? In some
of the 7.x versions, the lamd on the origin node would remain around for up
to 2 seconds after the others were killed (and after lambooted), but it
would then kill itself. Hence, if you do something like this:
lamboot
mpirun ....
lamhalt
ps -eadf | grep lamd
You would still see the lamd running on the origin, but this is somewhat of
a false positive because the lamd will shortly kill itself.
I believe that we updated 7.1.2 to have lamboot not return until *all* lamds
were dead (even the one on the origin).
On 8/2/06 3:29 PM, "Richard Bohn" <rxbeee_at_[hidden]> wrote:
> Being new to MPI and Sun Grid Engine, I need help.
>
> I'm trying to run LAM-MPI (7.1.1) under SGE (6.0u6) using SSH, my little
> helloworld job runs ok on multiple nodes but I noticed that the lamd process
> is not being shutdown on the first node. All the other nodes have lamd killed
> and this is repeatable every time. Right now the cluster is kind of setup for
> loose integration.
>
> Thank you in advance for your help!
>
> Rick Bohn
> Rochester Institue of Technology
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
|