If you really want to work on it, the proper way would be to write a
boot SSI module that understands SGE. hboot would likely not be used
(e.g., it's not used in a TM environment) -- it's a holdover from the
bad-old rsh/ssh days -- I can explain more if you care. The boot SSI
docs are on our documentation web page.
We had extensive discussions with the SGE guys about this a while ago,
and their feeling was that the script-based approach was simpler, which
is why we never bothered to write a boot SSI module. But I'm a purist;
if I had the cycles, I'd like to see an SGE boot SSI module (it would
be an easier experience for the sysadmin, too).
More specifically, I'd like to see this kind of support in Open MPI --
we're not doing too much new work in LAM these days.
I'd also like to see something better than a linear startup mechanism
in SGE -- just my $0.02. ;-)
On Feb 14, 2005, at 4:59 PM, Anthony J. Ciani wrote:
> Hi Reuti,
>
> I never had any success with hboot (qrsh or ssh) in SGE 6.0 during the
> prologue. My current implementation is a wrapper around mpirun which
> runs inside the job script (so accounting is wrong), but it works.
> The LAM is then killed with a simple "lamhalt" in the epilogue (at
> least this works). Only once did "lamhalt" fail to clean up an errant
> task on one node. I have not determined why it failed to kill the
> child on that one node, but it seems to be a fairly rare, and random
> occurrence.
>
> Another option is to combine and rewrite lamboot/hboot into a
> "lamboot_sge". Actually, this could probably just be a modified
> lamboot which detects the SGE environment and natively reads the SGE
> hostfile (which is the same as a LAM hostfile with added junk on each
> line), and then calls a hboot which detects the SGE environment and
> uses qrsh correctly to start the children lamd daemons. Even better
> would be to use qrsh to start lamd directly without using hboot, but I
> don't know what hboot needs to do before starting lamd. As I am not
> familiar with the internals of hboot or lamd, it would take me some
> time to reverse engineer them, when I do get the time.
>
> The greatest hurdle to this is that I saw (on some list) the hboot
> problem on SGE 6.0 was somehow related to SGE's improper handling of
> some signal associated with the fork() and/or system() calls (or maybe
> hboot's handling of said signal). I never had the time to really look
> into this, but it "seems" to be this way. On the face, hboot simply
> fails to execute the first ssh or qrsh and hangs. It would REALLY be
> nice then if we could get sge_shepherd or sge_execd to directly start
> hboot/lamd, which is what using qrsh was supposed to accomplish, so
> why doesn't it work!?. It made the problem harder to diagnose too,
> because the job hangs and you have to manually clean it up by ssh'ing
> to each node (the job is too hung for even qdel to clear it). This
> could be related to the fact that lamd doesn't exit.
>
> Of course, this may all only be temporary, as SGE is rumored to be
> getting proper support for either Globus and/or TM in the
> not-so-distant future.
>
> On Mon, 14 Feb 2005, Reuti wrote:
>> Hi Anthony and Jeff,
>>
>> first I started on my own without looking into the supplied Perl
>> script from Christoper. For now with 5.3, but I will also look into
>> it on another cluster I have access to with 6.0. I found, that the
>> problem is two fold:
>>
>> 1. I used qrsh to start the lamd, and of course SGE will loose
>> control of the daemons and the accounting will be wrong (therefore I
>> modified hboot.c). As we don't need accounting, this is a working
>> version for us, and with a simple lamhalt in the PE of SGE, all jobs
>> are killed nicely and semaphores/shared memory segments are removed
>> according to the lam-killfile and lam-registry. To get this working,
>> I made a really small modification to the SGE supplied
>> startmpi.sh/stopmpi.sh scripts and routed the $TMPDIR for LAM to
>> /tmp. Otherwise after the qrsh returns the SGE created TMPDIR will be
>> too early deleted and lamhalt will fail. I see this working with
>> normal finishing jobs or also when I use qdel.
>>
>> 2. I tried Chirstopers script, and I understand the idea to use
>> always two times qrsh. This way you get correct accounting, don't
>> need to modify hboot.c, and have control of the daemons.
>> Restrictions: setup must be some kind of allocation_rule set to two
>> (or calculate the correct amount of qrshs which should be allowed).
>> But: you will have semaphores and shared memory segements left over.
>>
>> Currently I thinking of a way to get both working.
>>
>> Maybe the problem is with the allocation of processes under Linux. A
>> 'perfect' solution would start a virtual machine on each allocated
>> node of a parallel job. After the job, the virtual machine will be
>> destroyed and you don't have file, process or semaphores left over...
>>
>>
>> With Christopher's scripts, I got it working only on one machine, I'm
>> still looking into it, why it's not running across the nodes.
>>
>>
>> When I got a final Howto, I will publish it again on the
>> sunsource.net site, and put a link here for further reference. I just
>> played around with catching the semget() and shmget() calls of
>> dynamically linked applications with some kind of "ipc_wrapper.so"
>> and loading it with LD_PRELOAD before the application starts. Then I
>> could delete the stuff on my own. I'm still researching...
>>
>>
>> Maybe I come back to one of you in the next days.
>>
>> Cheers - Reuti
>>
>>
>> BTW: Where and how can I reach Christopher? His supplied eMail
>> address is no longer working.
>>
>>
>> Anthony J. Ciani wrote:
>>> Hi Reuti,
>>> Was this with SGE 5.3 or 6.0 or both? Are you still using the
>>> already published tight integration scripts (sge-lam and qrsh-lam),
>>> or did you modify them? As I recall, the "sge-lam" script had a
>>> problem with 6.0 in that the TMPDIR hadn't been created before the
>>> prologue ran...
>>> Actually, could you go ahead and publish the integration scripts and
>>> an example PE either here or on the SGE list (or both)?
>>>> From: Reuti <reuti_at_[hidden]>
>>>> Hi all,
>>>> I integrated LAM 7.1.1 in SGE with still using only qrsh. I found
>>>> already the
>>>> discussion in the emailing list archive about the setsid() in
>>>> hboot.c. It seems
>>>> to me, that the already present setsid() in hboot.h is not working,
>>>> since under
>>>> SGE it's already the session leader.
>>>> Instead I put a setsid() after the creation of the child in line
>>>> 317:
>>>> else if (pid == 0) { /* child */
>>>> setsid();
>>>> This seems working, and I can use qrsh to start the daemons. Are
>>>> there any
>>>> sideeffects, which could hit me? Also the shutdown is proper
>>>> working and I
>>>> could be happy with this soluton.
>>>> Cheers - Reuti
>>> ------------------------------------------------------------
>>> Anthony Ciani (aciani1_at_[hidden])
>>> Computational Condensed Matter Physics
>>> Department of Physics, University of Illinois, Chicago
>>> http://ciani.phy.uic.edu/~tony
>>> ----------------------------------------------------------
>>> --
>>
>
> ------------------------------------------------------------
> Anthony Ciani (aciani1_at_[hidden])
> Computational Condensed Matter Physics
> Department of Physics, University of Illinois, Chicago
> http://ciani.phy.uic.edu/~tony
> ------------------------------------------------------------
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|