Howdy, I'm using Lam v 7.1.1, Grid Engine 6.0u4 on a
Rocks 4.0.0 cluster, 128 nodes.
I set up the parallel environment for lam_loose_rsh
using the instructions at:
http://gridengine.sunsource.net/howto/lam-integration/lam-integration.html
I configured the cluster to distribute the
/opt/gridengine/lam_loose_rsh directory to each of the compute nodes, so they
all have a copy of startlam.sh and stoplam.sh (both of which are executable by
everyone.
I'm running a simple hello world test where it
prints the name of the compute node that it is running on. The output is
correctly printing the name of each node, so the job looks like it's working.
However if I check the jobs head node for processes
under my name, I see:
/opt/lam/intel/bin/lamd -H 172.20.5.166 -P 42681 -n
0 -o 0 -sessionsuffix sge-26884-undefined
I added the -v -d switches to lamhalt in stoplam.sh
and here's what I see in the job log:
/opt/gridengine/default/spool/compute-2-39/active_jobs/26890.1/pe_hostfile
compute-2-39.local
compute-4-104.local
compute-2-64.local
compute-4-98.local
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
Using lamhalt:
/opt/lam/intel/bin/lamhalt on node compute-2-39.local
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
Shutting down LAM
hreq: sending HALT_PING to n1 (compute-4-104.local)
hreq: sending HALT_PING to n2 (compute-2-64.local)
hreq: sending HALT_PING to n3 (compute-4-98.local)
hreq: waiting for HALT ACKs from remote LAM daemons
hreq: received HALT_ACK from n1
(compute-4-104.local)
hreq: sending HALT_DIE to n1 (compute-4-104.local)
hreq: received HALT_ACK from n2 (compute-2-64.local)
hreq: sending HALT_DIE to n2 (compute-2-64.local)
hreq: received HALT_ACK from n3 (compute-4-98.local)
hreq: sending HALT_DIE to n3 (compute-4-98.local)
hreq: sending HALT_PING to n0 (compute-2-39.local)
hreq: received HALT_ACK from n0 (compute-2-39.local)
hreq: sending HALT_DIE to n0 (compute-2-39.local)
lamhalt: local LAM daemon halted
LAM halted
mkdir: No such file or directory
I'm not sure where that "mkdir: No such file..."
is coming from, however if I ssh to the head compute node and kill the lamd
process, another "mkdir: No such..." will get logged to the job log
file.