I have tested LAM alone works fine, this problem only occurs in SGE + LAM
The firewall is off on that machine, and lamboot -d looks fine.
All that makes me think either the script does not work or I may have
messed up the script.
Thanks,
Alan.
#!/usr/bin/perl
$LAMHOME="/usr";
$verbose=1;
#$debug=0;
$debug=1;
# close STDIN to avoid stdio race conditions and tty issues
close(STDIN);
if( $debug eq 1){
open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
select(SGEDEBUG); $|=1;
open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
}
# set output for stderr and stdout to be unbuffered
select(STDERR); $|=1;
select(STDOUT); $|=1;
$lamboot="$LAMHOME/bin/lamboot";
$lamhalt="$LAMHOME/bin/lamhalt";
#$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
# read in the args to figure out our task
$func=shift @ARGV;
$SGE_ROOT="$ENV{SGE_ROOT}";
$sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
$arch=`${SGE_ROOT}/util/arch`;
chomp($arch);
$qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
# add LAM and SGE to path
$ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
$ENV{'PATH'}.=":${LAMHOME}/bin";
#debug_print("TMPDIR = $ENV{TMPDIR}");
debug_print("LAMHOME = $LAMHOME");
debug_print("SGE_ROOT = $SGE_ROOT");
debug_print("PATH = $ENV{PATH}");
debug_print("qrsh = $qrsh");
debug_print("ARGV = \"".join("\" \"",@ARGV)."\"");
debug_print("sgelamconf = $sgelamconf");
if("$func" eq "start"){
debug_print("func=start");
print "Starting SGE + LAM Integration\n";
print "\t using tight integration scheme\n";
start_proc_args();
}elsif("$func" eq "stop"){
debug_print("func=stop");
print "Stoping SGE + LAM Integration\n";
stop_proc_args();
}elsif("$func" eq "qrsh-remote"){
debug_print("func=qrsh-remote");
qrsh_remote();
}elsif("$func" eq "qrsh-local"){
debug_print("func=qrsh-local");
qrsh_local();
}else{
print STDERR "\nusage: $0 {start|stop}\n\n";
exit(-1);
}
sub start_proc_args()
{
# we currently place the LAM host file in the TMPDIR that SGE uses.
# if we place it elsewhere we need to clean it up
$lamhostsfile="$ENV{TMPDIR}/lamhostfile";
# flags and options for lamboot (-x, -s and -np may be useful in some
envs)
@lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
qrsh-remote","-c","$sgelamconf");
if($verbose){ push(@lambootargs,"-v"); }
if($debug){ push(@lambootargs,"-d"); }
push(@lambootargs,"$lamhostsfile");
debug_print("LAMBOOT ARGS: @lambootargs $lamhostsfile");
### Need to convert the SGE hostfile to a LAM hostfile format
# open and read the PE hostfile
open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
# convert to LAM bhost file format
@lamhostslist=();
while(<SGEHOSTFILE>){
($host,$ncpu,$junk)=split(/\s+/);
push( @lamhostslist,"$host cpu=$ncpu");
}
close(SGEHOSTFILE);
debug_print("LAMHOSTSLIST: @lamhostslist");
# create the new lam bhost file
open(LAMHOSTFILE,"> $lamhostsfile");
print LAMHOSTFILE join("\n",@lamhostslist);
print LAMHOSTFILE "\n";
close(LAMHOSTFILE);
if($debug){ close(SGEDEBUG); }
debug_print("Exec Lamboot: $lamboot @lambootargs");
exec($lamboot,@lambootargs);
}
sub stop_proc_args(){
if($verbose){ push(@lamhaltargs,"-v"); }
if($debug){ push(@lamhaltargs,"-d"); }
if($debug){ close(SGEDEBUG); }
debug_print("Exec Lamhalt: $lamhalt @lamhaltargs");
exec($lamhalt,@lamhaltargs);
}
sub qrsh_remote()
{
@myargs=("-inherit","-nostdin","-V",@ARGV);
debug_print("QRSH REMOTE CONFIG: @myargs");
if($debug){ close(SGEDEBUG); }
debug_print("Exec qrsh-remote: $qrsh @myargs");
exec($qrsh,@myargs);
}
sub qrsh_local()
{
# we are running a local qrsh to add the lamd into the current job
# on the local node using the LAM boot schema
# get the hostname to pass to qrsh
$hostname=`/bin/hostname`;
chomp($hostname);
# tell SGE to add this command into the JOB_ID job by using qrsh
-inherit
# the hostname is not passed in this case in ARGV by lamboot
@myargs=("-inherit","-nostdin","-V","$hostname",@ARGV);
debug_print("QRSH LOCAL CONFIG: @myargs");
if($debug){ close(SGEDEBUG); }
debug_print("Exec qrsh-local: $qrsh @myargs");
exec($qrsh,@myargs);
}
sub debug_print()
{
if($debug){
print SGEDEBUG "SGE-LAM DEBUG: @_\n";
}
}
On Tue, 10 Aug 2004, Jeff Squyres wrote:
> On Aug 10, 2004, at 4:01 PM, C.L. Lai [ALAN] wrote:
>
> > I am trying to get that script working on SGE 6 + LAM 7
> > However, I got some errors, I wonder if it's the script's problem or
> > it's
> > my setting.
>
> I can't answer the question about the SGE stuff, but the error you're
> getting is quite odd and may not be related. Let's investigate that
> first, and if that doesn't work out, we'll ping the SGE guys and see
> what they say. :-)
>
> > Here is my error
> > %cat sgedebug.528.7715
> > SGE-LAM DEBUG: LAMHOME = /usr
> <snipped]
> > n0<7715> ssi:boot:rsh: starting lamd on (jardine2.math.uwo.ca)
> > n0<7715> ssi:boot:rsh: starting on n0 (jardine2.math.uwo.ca): hboot -t
> > -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-528-0
> > -I
> > -H 129.100.75.78 -P 36671 -n 0 -o 0
> > n0<7715> ssi:boot:rsh: launching locally
> > n0<7715> ssi:boot:rsh: successfully launched on n0
> > (jardine2.math.uwo.ca)
> > n0<7715> ssi:boot:base:server: expecting connection from finite list
> > n0<7715> ssi:boot:base:server: got connection from 0.0.0.0
>
> What's happening here is that LAM forked off the lamd locally, but then
> the lamd didn't call back to lamboot and say "I'm ok!". lamboot
> eventually got tired of waiting and gave up.
>
> So the question is: why didn't the lamd call back to lamboot?
>
> The most common reason for this is firewalling software -- LAM uses
> random TCP and UDP ports assigned by the OS. Hence, you either need to
> disable firewalling software or allow TCP and UDP traffic on random
> ports from your trusted set of nodes (including the localhost).
>
> You might want to look in the syslogs -- "lamboot -d" causes the lamd
> to output some information to the syslogs; there may be information in
> there about why the lamd died before connecting back to lamboot. Also
> look for a corefile indicating that the lamd aborted improperly.
>
> Let me know what you find.
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|