A few things -
- Thanks for the super-comprehensive mail! It helped explain your
situation a lot.
- We actually didn't write the sge-lam script; it was written and is
maintained by the SGE folks. They might be the best people to talk to
about this. As such, I've CC'ed Chris Duncan of the SGE team on this mail
-- he's been my main contact with them whenever I've had SGE questions,
and he's been more than patient with me. :-)
- All that being said, it was my understanding (we unfortunately don't
have access to any machines with SGE in order to do any testing) that the
sge-lam stuff worked just fine with the 6.5 series.
- The SGE team has written new scripts for the 7.0.x LAM series, but they
haven't been published on their site yet. Indeed, there's a fix that will
be in the upcoming 7.0.5 release that is necessary for it to work (or you
can easily patch an existing 7.0.x installation).
- Even with the patch, the new SGE scripts don't seem to be working
properly yet. There's a very recent thread on lam-devel that is
discussing this:
http://www.lam-mpi.org/MailArchives/lam-devel/msg00100.php
- This is a sidenote to the topic (especially in light of the new SGE
scripts not working 100% with 7.0.x yet), but is there any chance that you
can upgrade to LAM 7.0.x? 6.5.6 is *VERY* old (2.5 years). There have
been a *lot* of bug fixes and improvements since then (indeed, the entire
6.5 line is now deprecated).
On Mon, 8 Mar 2004, Kevin Wells wrote:
> Hallo,
>
> I have a problem getting LAM MPI to work with Sun Grid Engine.
>
> We have successfuly running LAM version 6.5.6/MPI on a Rocks Linux Cluster
> for several months, on a setup with 10 nodes including front-end.
>
> I now want to run LAM-MPI together with Sun Grid Engine on the cluster with
> the front-end acting as master-server and all the nodes as execution hosts.
> To do this I`ve just installed Grid Engine version 5.3p5.
>
> However we have the problem that hboot will not execute via the Grid Engine
> and I`m unable to understand why this is.
>
> I have set up the LAM and Grid Engine environment as described in the readme
> for the Sun Parallel Environment Integration Package for SGE with LAM.
>
> Grid Engine creates the machines files for itself and for lam, but it doesnt
> start lam-mpi. LAM creates the following messages, which are repeated in an
> endless cycle for all the nodes in the cluster:
>
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: /tmp/344.1.compute-0-3_smalljobs/lamhostfile
> lamboot: opening hostfile /tmp/344.1.compute-0-3_smalljobs/lamhostfile
> lamboot: found the following hosts:
> lamboot: n0 compute-0-3
> lamboot: resolved hosts:
> lamboot: n0 compute-0-3 --> 192.168.150.250
> lamboot: found 1 host node(s)
> lamboot: origin node is 0 (compute-0-3)
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -s -I " -H
> 192.168.150.250 -P 49096 -n 0 -o 0 ""
> lamboot did NOT complete successfully
>
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
>
> LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
>
> lamboot: boot schema file: /tmp/344.1.compute-0-4_smalljobs/lamhostfile
> lamboot: opening hostfile /tmp/344.1.compute-0-4_smalljobs/lamhostfile
> lamboot: found the following hosts:
> lamboot: n0 compute-0-4
> lamboot: resolved hosts:
> lamboot: n0 compute-0-4 --> 192.168.150.249
> lamboot: found 1 host node(s)
> lamboot: origin node is 0 (compute-0-4)
> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -s -I " -H
> 192.168.150.249 -P 48405 -n 0 -o 0 ""
>
>
>
> LAM failed to fork/exec a process to launch the local LAM daemon
> (lamd). LAM first launches hboot to launch the local LAM daemon, so
> several things could have gone wrong:
>
> - "hboot" itself could not be found (check your $PATH)
> - "hboot" failed for some reason (consult previous error messages,
> if any)
> - Too many processes exist and Unix could not
> fork
> another
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> lamboot encountered some error (see above) during the boot process,
> and will now attempt to kill all nodes that it was previously able to
> boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process, you may
> have LAM daemons still running on
> remote
> nodes.
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> It seems that there is no lamd running on this host, which indicates
> that the LAM/MPI runtime environment is not operating. The LAM/MPI
> runtime environment is necessary for the "lamhalt" command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment. See the LAM/MPI documentation for how to invoke
> "lamboot" across
> multiple
> machines.
> -----------------------------------------------------------------------------
>
>
> Obviously the LAM environment is not being set up, but I can`t understand
> why.
>
>
> I`ve defined a queue on each node, with no calendar entry. A Parallel
> Environment has also been created which contains all the queues and the user
> list,
> together with the sge-lam script provided by Sun as start and stop
> arguements.
>
> I start the Grid Engine job using the following command (Ive also tried it
> with QMON GUI, but with same result):
>
> qsub -v -V -pe cluster_sgelam 4 /home/scripts/sge_mpi/start_mpi_p2002_sge
>
>
>
> The script start_mpi_p2002_sge simply contains the binary that Grid Engine
> and lam-mpi are to run.
>
>
> /opt/lam-eth-gnu/6.5.6/bin/mpirun -np $NP -O -nger -c2c -v -x
> PAMHOME=$PAMHOME $EXE $JOB > $JOB.out
>
> #/opt/lam-eth-gnu/6.5.6/bin/mpirun -np $NSLOTS -O -nger -c2c -v -x
> PAMHOME=$PAMHOME $EXE $JOB > $JOB.out
>
>
> The Grid Engine path file settings.sh is a file from Sun which the user runs
> at the start of the session, it simply contains the paths:
>
> LAMBINDIR=/opt/lam-eth-gnu/6.5.6/bin
> SGEBINDIR=$SGE_ROOT/bin/linux
> PATH=$SGE_ROOT/bin/$ARCH:$LAMBINDIR:/opt/lam-eth-gnu/6.5.6/etc:$PATH; export
> PATH
>
> (LAMBINDIR is necessary for qrsh-lam, sge-lam, as well as hboot, lamd and
> mpirun amongst other things).
>
>
> The configuration for the queues looks like this:
>
> root> qconf -sq compute-0-0_smalljobs
> qname compute-0-0_smalljobs
> hostname compute-0-0
> seq_no 1
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors 2
> qtype BATCH INTERACTIVE PARALLEL
> rerun FALSE
> slots 20
> tmpdir /tmp
> shell /bin/bash
> shell_start_mode NONE
> prolog NONE
> epilog NONE
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists listofusers
> xuser_lists NONE
> subordinate_list NONE
> complex_list NONE
> complex_values NONE
> calendar NONE
> initial_state enabled
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
>
>
> This is the configuration for the PE (Parallel Environment):
>
> simcompute-0:/home/testuser
> root> qconf -sp cluster_sgelam
> pe_name cluster_sgelam
> queue_list compute-0-0_smalljobs compute-0-1_smalljobs
> compute-0-2_smalljobs compute-0-3_smalljobs compute-0-4_smalljobs
> compute-0-5_smalljobs
> compute-0-6_smalljobs compute-0-7_smalljobs compute-0-8_smalljobs
> slots 6
> user_lists listofusers
> xuser_lists NONE
> start_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam start
> stop_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam stop
> allocation_rule $fill_up
> control_slaves FALSE
> job_is_first_task TRUE
>
>
> stop_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam stop
> allocation_rule $fill_up
> control_slaves FALSE
> job_is_first_task TRUE
>
>
> Ive tried different allocation rules - $round_robin and $pe_slots - but
> there is no difference.
>
>
> The file lam-conf.lam simply contains the following:
>
> /opt/lam-eth-gnu/6.5.6/bin/qrsh-lam local /opt/lam-eth-gnu/6.5.6/bin/lamd
> $inet_topo $debug
>
>
> qrsh-lam is installed. Ive also tried "qrsh-lam remote".
>
> As sge-lam and qrsh-lam are perl scripts I have made sure Perl is
> path-accessible ok.
>
> The start command qsub is run as a normal user, ie not root. This user is
> defined in the "listofusers" which is included in the Parallel Environment
> config.
>
> If anyone has an idea from this as to why LAM is not working with Grid
> Engine, or what we might be doing wrong here, I`d be very grateful for your
> help
>
> Thanks in advance!
> Kevin
>
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|