LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: C.L. Lai [ALAN] (clai33_at_[hidden])
Date: 2004-08-11 08:30:47


I have tested LAM alone works fine, this problem only occurs in SGE + LAM
The firewall is off on that machine, and lamboot -d looks fine.

All that makes me think either the script does not work or I may have
messed up the script.

Thanks,
Alan.

#!/usr/bin/perl

$LAMHOME="/usr";

$verbose=1;
#$debug=0;
$debug=1;

# close STDIN to avoid stdio race conditions and tty issues
close(STDIN);

if( $debug eq 1){
        open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
        select(SGEDEBUG); $|=1;
        open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
}

# set output for stderr and stdout to be unbuffered
select(STDERR); $|=1;
select(STDOUT); $|=1;
 
$lamboot="$LAMHOME/bin/lamboot";
$lamhalt="$LAMHOME/bin/lamhalt";
#$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";

# read in the args to figure out our task
$func=shift @ARGV;

$SGE_ROOT="$ENV{SGE_ROOT}";
$sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";

$arch=`${SGE_ROOT}/util/arch`;
chomp($arch);
$qrsh="${SGE_ROOT}/bin/${arch}/qrsh";

# add LAM and SGE to path
$ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
$ENV{'PATH'}.=":${LAMHOME}/bin";

#debug_print("TMPDIR = $ENV{TMPDIR}");
debug_print("LAMHOME = $LAMHOME");
debug_print("SGE_ROOT = $SGE_ROOT");
debug_print("PATH = $ENV{PATH}");
debug_print("qrsh = $qrsh");
debug_print("ARGV = \"".join("\" \"",@ARGV)."\"");
debug_print("sgelamconf = $sgelamconf");

if("$func" eq "start"){
        debug_print("func=start");
        print "Starting SGE + LAM Integration\n";
        print "\t using tight integration scheme\n";
        start_proc_args();
}elsif("$func" eq "stop"){
        debug_print("func=stop");
        print "Stoping SGE + LAM Integration\n";
        stop_proc_args();
}elsif("$func" eq "qrsh-remote"){
        debug_print("func=qrsh-remote");
        qrsh_remote();
}elsif("$func" eq "qrsh-local"){
        debug_print("func=qrsh-local");
        qrsh_local();
}else{
        print STDERR "\nusage: $0 {start|stop}\n\n";
        exit(-1);
}

sub start_proc_args()
{

  # we currently place the LAM host file in the TMPDIR that SGE uses.
  # if we place it elsewhere we need to clean it up
  $lamhostsfile="$ENV{TMPDIR}/lamhostfile";

  # flags and options for lamboot (-x, -s and -np may be useful in some
envs)

@lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
qrsh-remote","-c","$sgelamconf");
  if($verbose){ push(@lambootargs,"-v"); }
  if($debug){ push(@lambootargs,"-d"); }
  push(@lambootargs,"$lamhostsfile");
  debug_print("LAMBOOT ARGS: @lambootargs $lamhostsfile");

  ### Need to convert the SGE hostfile to a LAM hostfile format
  # open and read the PE hostfile
  open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
  # convert to LAM bhost file format
  @lamhostslist=();
  while(<SGEHOSTFILE>){
        ($host,$ncpu,$junk)=split(/\s+/);
        push( @lamhostslist,"$host cpu=$ncpu");
  }
  close(SGEHOSTFILE);

  debug_print("LAMHOSTSLIST: @lamhostslist");
  # create the new lam bhost file
  open(LAMHOSTFILE,"> $lamhostsfile");
  print LAMHOSTFILE join("\n",@lamhostslist);
  print LAMHOSTFILE "\n";
  close(LAMHOSTFILE);

  if($debug){ close(SGEDEBUG); }
  debug_print("Exec Lamboot: $lamboot @lambootargs");
  exec($lamboot,@lambootargs);
}

sub stop_proc_args(){

  if($verbose){ push(@lamhaltargs,"-v"); }
  if($debug){ push(@lamhaltargs,"-d"); }

  if($debug){ close(SGEDEBUG); }
  debug_print("Exec Lamhalt: $lamhalt @lamhaltargs");
  exec($lamhalt,@lamhaltargs);
}

sub qrsh_remote()
{

  @myargs=("-inherit","-nostdin","-V",@ARGV);

  debug_print("QRSH REMOTE CONFIG: @myargs");
  if($debug){ close(SGEDEBUG); }
  debug_print("Exec qrsh-remote: $qrsh @myargs");
  exec($qrsh,@myargs);
}

sub qrsh_local()
{
  # we are running a local qrsh to add the lamd into the current job
  # on the local node using the LAM boot schema

  # get the hostname to pass to qrsh
  $hostname=`/bin/hostname`;
  chomp($hostname);

  # tell SGE to add this command into the JOB_ID job by using qrsh
-inherit
  # the hostname is not passed in this case in ARGV by lamboot
  @myargs=("-inherit","-nostdin","-V","$hostname",@ARGV);

  debug_print("QRSH LOCAL CONFIG: @myargs");
  if($debug){ close(SGEDEBUG); }
  debug_print("Exec qrsh-local: $qrsh @myargs");
  exec($qrsh,@myargs);
}

sub debug_print()
{
  if($debug){
    print SGEDEBUG "SGE-LAM DEBUG: @_\n";
  }
}

On Tue, 10 Aug 2004, Jeff Squyres wrote:

> On Aug 10, 2004, at 4:01 PM, C.L. Lai [ALAN] wrote:
>
> > I am trying to get that script working on SGE 6 + LAM 7
> > However, I got some errors, I wonder if it's the script's problem or
> > it's
> > my setting.
>
> I can't answer the question about the SGE stuff, but the error you're
> getting is quite odd and may not be related. Let's investigate that
> first, and if that doesn't work out, we'll ping the SGE guys and see
> what they say. :-)
>
> > Here is my error
> > %cat sgedebug.528.7715
> > SGE-LAM DEBUG: LAMHOME = /usr
> <snipped]
> > n0<7715> ssi:boot:rsh: starting lamd on (jardine2.math.uwo.ca)
> > n0<7715> ssi:boot:rsh: starting on n0 (jardine2.math.uwo.ca): hboot -t
> > -c
> > /home/compute/sge/lam/sge-lam-conf.lamd -d -v -sessionsuffix sge-528-0
> > -I
> > -H 129.100.75.78 -P 36671 -n 0 -o 0
> > n0<7715> ssi:boot:rsh: launching locally
> > n0<7715> ssi:boot:rsh: successfully launched on n0
> > (jardine2.math.uwo.ca)
> > n0<7715> ssi:boot:base:server: expecting connection from finite list
> > n0<7715> ssi:boot:base:server: got connection from 0.0.0.0
>
> What's happening here is that LAM forked off the lamd locally, but then
> the lamd didn't call back to lamboot and say "I'm ok!". lamboot
> eventually got tired of waiting and gave up.
>
> So the question is: why didn't the lamd call back to lamboot?
>
> The most common reason for this is firewalling software -- LAM uses
> random TCP and UDP ports assigned by the OS. Hence, you either need to
> disable firewalling software or allow TCP and UDP traffic on random
> ports from your trusted set of nodes (including the localhost).
>
> You might want to look in the syslogs -- "lamboot -d" causes the lamd
> to output some information to the syslogs; there may be information in
> there about why the lamd died before connecting back to lamboot. Also
> look for a corefile indicating that the lamd aborted improperly.
>
> Let me know what you find.
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>