Hallo,
I have a problem getting LAM MPI to work with Sun Grid Engine.
We have successfuly running LAM version 6.5.6/MPI on a Rocks Linux Cluster
for several months, on a setup with 10 nodes including front-end.
I now want to run LAM-MPI together with Sun Grid Engine on the cluster with
the front-end acting as master-server and all the nodes as execution hosts.
To do this I`ve just installed Grid Engine version 5.3p5.
However we have the problem that hboot will not execute via the Grid Engine
and I`m unable to understand why this is.
I have set up the LAM and Grid Engine environment as described in the readme
for the Sun Parallel Environment Integration Package for SGE with LAM.
Grid Engine creates the machines files for itself and for lam, but it doesnt
start lam-mpi. LAM creates the following messages, which are repeated in an
endless cycle for all the nodes in the cluster:
LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
lamboot: boot schema file: /tmp/344.1.compute-0-3_smalljobs/lamhostfile
lamboot: opening hostfile /tmp/344.1.compute-0-3_smalljobs/lamhostfile
lamboot: found the following hosts:
lamboot: n0 compute-0-3
lamboot: resolved hosts:
lamboot: n0 compute-0-3 --> 192.168.150.250
lamboot: found 1 host node(s)
lamboot: origin node is 0 (compute-0-3)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -s -I " -H
192.168.150.250 -P 49096 -n 0 -o 0 ""
lamboot did NOT complete successfully
LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre Dame
lamboot: boot schema file: /tmp/344.1.compute-0-4_smalljobs/lamhostfile
lamboot: opening hostfile /tmp/344.1.compute-0-4_smalljobs/lamhostfile
lamboot: found the following hosts:
lamboot: n0 compute-0-4
lamboot: resolved hosts:
lamboot: n0 compute-0-4 --> 192.168.150.249
lamboot: found 1 host node(s)
lamboot: origin node is 0 (compute-0-4)
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -s -I " -H
192.168.150.249 -P 48405 -n 0 -o 0 ""
LAM failed to fork/exec a process to launch the local LAM daemon
(lamd). LAM first launches hboot to launch the local LAM daemon, so
several things could have gone wrong:
- "hboot" itself could not be found (check your $PATH)
- "hboot" failed for some reason (consult previous error messages,
if any)
- Too many processes exist and Unix could not
fork
another
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on
remote
nodes.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that there is no lamd running on this host, which indicates
that the LAM/MPI runtime environment is not operating. The LAM/MPI
runtime environment is necessary for the "lamhalt" command.
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across
multiple
machines.
-----------------------------------------------------------------------------
Obviously the LAM environment is not being set up, but I can`t understand
why.
I`ve defined a queue on each node, with no calendar entry. A Parallel
Environment has also been created which contains all the queues and the user
list,
together with the sge-lam script provided by Sun as start and stop
arguements.
I start the Grid Engine job using the following command (Ive also tried it
with QMON GUI, but with same result):
qsub -v -V -pe cluster_sgelam 4 /home/scripts/sge_mpi/start_mpi_p2002_sge
The script start_mpi_p2002_sge simply contains the binary that Grid Engine
and lam-mpi are to run.
/opt/lam-eth-gnu/6.5.6/bin/mpirun -np $NP -O -nger -c2c -v -x
PAMHOME=$PAMHOME $EXE $JOB > $JOB.out
#/opt/lam-eth-gnu/6.5.6/bin/mpirun -np $NSLOTS -O -nger -c2c -v -x
PAMHOME=$PAMHOME $EXE $JOB > $JOB.out
The Grid Engine path file settings.sh is a file from Sun which the user runs
at the start of the session, it simply contains the paths:
LAMBINDIR=/opt/lam-eth-gnu/6.5.6/bin
SGEBINDIR=$SGE_ROOT/bin/linux
PATH=$SGE_ROOT/bin/$ARCH:$LAMBINDIR:/opt/lam-eth-gnu/6.5.6/etc:$PATH; export
PATH
(LAMBINDIR is necessary for qrsh-lam, sge-lam, as well as hboot, lamd and
mpirun amongst other things).
The configuration for the queues looks like this:
root> qconf -sq compute-0-0_smalljobs
qname compute-0-0_smalljobs
hostname compute-0-0
seq_no 1
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors 2
qtype BATCH INTERACTIVE PARALLEL
rerun FALSE
slots 20
tmpdir /tmp
shell /bin/bash
shell_start_mode NONE
prolog NONE
epilog NONE
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists listofusers
xuser_lists NONE
subordinate_list NONE
complex_list NONE
complex_values NONE
calendar NONE
initial_state enabled
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
This is the configuration for the PE (Parallel Environment):
simcompute-0:/home/testuser
root> qconf -sp cluster_sgelam
pe_name cluster_sgelam
queue_list compute-0-0_smalljobs compute-0-1_smalljobs
compute-0-2_smalljobs compute-0-3_smalljobs compute-0-4_smalljobs
compute-0-5_smalljobs
compute-0-6_smalljobs compute-0-7_smalljobs compute-0-8_smalljobs
slots 6
user_lists listofusers
xuser_lists NONE
start_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam start
stop_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam stop
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
stop_proc_args /opt/lam-eth-gnu/6.5.6/bin/sge-lam stop
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
Ive tried different allocation rules - $round_robin and $pe_slots - but
there is no difference.
The file lam-conf.lam simply contains the following:
/opt/lam-eth-gnu/6.5.6/bin/qrsh-lam local /opt/lam-eth-gnu/6.5.6/bin/lamd
$inet_topo $debug
qrsh-lam is installed. Ive also tried "qrsh-lam remote".
As sge-lam and qrsh-lam are perl scripts I have made sure Perl is
path-accessible ok.
The start command qsub is run as a normal user, ie not root. This user is
defined in the "listofusers" which is included in the Parallel Environment
config.
If anyone has an idea from this as to why LAM is not working with Grid
Engine, or what we might be doing wrong here, I`d be very grateful for your
help
Thanks in advance!
Kevin
--
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz
|