LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Robin Humble (rjh_at_[hidden])
Date: 2003-06-05 01:42:12


I configured lam-7.0b13 with:
  configure --with-tm=/opt/pbs --with-fc=ifc --with-boot=tm --with-rpi=usysv
which (as I understand it) makes tm the default boot method, and usysv
the default rpi. Suitable for a dual-Xeon linux cluster.

I then qsub a script like this to OpenPBS (openpbs-oscar-2.3.16-7 on an
OSCAR 2.0 cluster):

-----------
#!/bin/csh -f
#PBS -l nodes=8:ppn=2
#PBS -q workq
#PBS -r n

lamboot
mpirun -O C ./code < input
lamhalt
-----------
(yeah, I know I probably don't need the -O to mpirun)

the problem comes if I qdel this job whilst it's running - sure all the
lamd's die (yay!), but they leave shared mem segments around on all nodes
_except_ the root node. eg. ipcs shows:

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 819200 rjh 600 16810368 0

------ Semaphore Arrays --------
key semid owner perms nsems status
0x00000000 819200 rjh 600 3

------ Message Queues --------
key msqid owner perms used-bytes messages

   
Is there any way to make lamd tidy up its shared memory before it exits?
It seems to be doing it on the root node but not the rest.

Or alternatively (more OpenPBS/OSCAR related) is there a way for the
batch script to trap a signal from qdel and to run a lamhalt on all nodes?
eg. an old fashioned 'wipe -b $PBS_NODEFILE' or similar?

This has to work in a production environment so we can't have piles of
orphaned shared memory areas being left around as eventually jobs refuse to
start :-/

Great work on lam-7 BTW. runtime switchable stuff is ace :-)

cheers,
robin