Hi Jeff,
You're memory serves you will in regards to loose integration.
I'm verifying that lamd is still running by ssh'ing into each compute node and doing a ps command. I have done this after several minutes of waiting to make sure the dust has settled and that's when I see the lone lamd still running on the first node.
I will look into installing 7.1.2 and see if this has any effect. I'm trying tight integration but I'm having some issues getting it to work, not sure if it's ssh related because we don't use rsh on the system. I get it to run but the lamd is not a child of the sge process.
Thanks for your help.
Rick
________________________________
From: lam-bounces_at_[hidden] on behalf of Jeff Squyres
Sent: Thu 8/3/2006 7:37 AM
To: General LAM/MPI mailing list
Subject: Re: LAM: lamd on first node not shutting down
I'm not too familiar with LAM's SGE integration, unfortunately (more
specifically: it has been explained to me many times, but I have no systems
with SGE, so the information stays resident in my brain for about 24-48
hours and then experiences a page fault). I seem to recall that "loose
integration" means that you're getting the host list from SGE but not using
SGE's controls to launch/kill LAM.
Can you indicate how you're verifying that the lamd is not killed? In some
of the 7.x versions, the lamd on the origin node would remain around for up
to 2 seconds after the others were killed (and after lambooted), but it
would then kill itself. Hence, if you do something like this:
lamboot
mpirun ....
lamhalt
ps -eadf | grep lamd
You would still see the lamd running on the origin, but this is somewhat of
a false positive because the lamd will shortly kill itself.
I believe that we updated 7.1.2 to have lamboot not return until *all* lamds
were dead (even the one on the origin).
On 8/2/06 3:29 PM, "Richard Bohn" <rxbeee_at_[hidden]> wrote:
> Being new to MPI and Sun Grid Engine, I need help.
>
> I'm trying to run LAM-MPI (7.1.1) under SGE (6.0u6) using SSH, my little
> helloworld job runs ok on multiple nodes but I noticed that the lamd process
> is not being shutdown on the first node. All the other nodes have lamd killed
> and this is repeatable every time. Right now the cluster is kind of setup for
> loose integration.
>
> Thank you in advance for your help!
>
> Rick Bohn
> Rochester Institue of Technology
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|