Hi,
I am using LAM+TORQUE+BLCR£¬i failed to restart che lam task under PBS(torque).
I tested it like this:
(1). Use torque to submit a task, the script is:
#PBS -S /bin/bash
#PBS -N Linpack
#PBS -l nodes=2:ppn=1
lamboot
lamnodes
echo "==================="
cd /home/lxz/src/HPL/hpl/bin/lxz
mpirun -np 2 xhpl
echo "==================="
lamhalt
I use "qsub linpack.sh" to submit it to torque.
(2) checkpoint it manually:
cr_checkpoint *** (PID of miprun)
After the execution, I can find the checkpint files.(3 files here)
(3) kill the task:
killall xhpl
(4) restart the task.
the sript:
#PBS -S /bin/bash
#PBS -N Mm5
#PBS -l nodes=2:ppn=1
lamboot
lamnodes
echo "==================="
cd /home/lxz/tmp
cr_restart context.11468
echo "==================="
lamhalt
i type "qsub restartlinpack.sh" to the commandline.
But the task is not restarted crrectly, I tarced the sorcecode of BLCR, found the reason is the "cr_restore_all_files" is failed because it can't find the file "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart the task.
I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.
Am i doing somthing wrong? How can i checkpoint and restart a task under torque(or openPBS) and LAM?
Thanks!
Liu xuezhao
2006-06-21
|