LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-06-05 08:15:32


Where the LAM session directories also left hanging around in /tmp? If
so, then it is likely that the lamds on the non-root nodes never got their
death warning.

OpenPBS/PBS Pro are *supposed* to send a SIGTERM, wait a couple seconds,
then send a SIGKILL. PBS Pro does this pretty much all the time every
time. OpenPBS seems to have certain times when it just doesn't bother to
do anything. We use the SIGTERM as our signal to clean up all the shared
memory segments and kill all our processes and all that. If we don't get
it, that we are basically out of luck as far as cleaning up goes.

If the root node is getting the signal and the other nodes aren't, it may
be possible to do something with that. I'll have to look into the problem
some more.

Brian

On Thu, 5 Jun 2003, Robin Humble wrote:

>
> I configured lam-7.0b13 with:
> configure --with-tm=/opt/pbs --with-fc=ifc --with-boot=tm --with-rpi=usysv
> which (as I understand it) makes tm the default boot method, and usysv
> the default rpi. Suitable for a dual-Xeon linux cluster.
>
> I then qsub a script like this to OpenPBS (openpbs-oscar-2.3.16-7 on an
> OSCAR 2.0 cluster):
>
> -----------
> #!/bin/csh -f
> #PBS -l nodes=8:ppn=2
> #PBS -q workq
> #PBS -r n
>
> lamboot
> mpirun -O C ./code < input
> lamhalt
> -----------
> (yeah, I know I probably don't need the -O to mpirun)
>
> the problem comes if I qdel this job whilst it's running - sure all the
> lamd's die (yay!), but they leave shared mem segments around on all nodes
> _except_ the root node. eg. ipcs shows:
>
> ------ Shared Memory Segments --------
> key shmid owner perms bytes nattch status
> 0x00000000 819200 rjh 600 16810368 0
>
> ------ Semaphore Arrays --------
> key semid owner perms nsems status
> 0x00000000 819200 rjh 600 3
>
> ------ Message Queues --------
> key msqid owner perms used-bytes messages
>
>
> Is there any way to make lamd tidy up its shared memory before it exits?
> It seems to be doing it on the root node but not the rest.
>
> Or alternatively (more OpenPBS/OSCAR related) is there a way for the
> batch script to trap a signal from qdel and to run a lamhalt on all nodes?
> eg. an old fashioned 'wipe -b $PBS_NODEFILE' or similar?
>
> This has to work in a production environment so we can't have piles of
> orphaned shared memory areas being left around as eventually jobs refuse to
> start :-/
>
> Great work on lam-7 BTW. runtime switchable stuff is ace :-)
>
> cheers,
> robin
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
  Brian Barrett
  LAM/MPI developer and all around nice guy
  Have a LAM/MPI day: http://www.lam-mpi.org/