LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Liu Xuezhao (lxz_at_[hidden])
Date: 2006-06-21 06:09:56


Hi,

   I am using LAM+TORQUE+BLCR£¬i failed to restart che lam task under PBS(torque).
   I tested it like this:
   (1). Use torque to submit a task, the script is:
    #PBS -S /bin/bash
    #PBS -N Linpack
    #PBS -l nodes=2:ppn=1
    lamboot
    lamnodes
    echo "==================="
    cd /home/lxz/src/HPL/hpl/bin/lxz
    mpirun -np 2 xhpl
    echo "==================="
    lamhalt
        I use "qsub linpack.sh" to submit it to torque.
   (2) checkpoint it manually:
    cr_checkpoint *** (PID of miprun)
        After the execution, I can find the checkpint files.(3 files here)
   (3) kill the task:
    killall xhpl
   (4) restart the task.
        the sript:
    #PBS -S /bin/bash
    #PBS -N Mm5
    #PBS -l nodes=2:ppn=1
    lamboot
    lamnodes
    echo "==================="
    cd /home/lxz/tmp
    cr_restart context.11468
    echo "==================="
    lamhalt
        i type "qsub restartlinpack.sh" to the commandline.
    
    But the task is not restarted crrectly, I tarced the sorcecode of BLCR, found the reason is the "cr_restore_all_files" is failed because it can't find the file "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart the task.

    I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.

    Am i doing somthing wrong? How can i checkpoint and restart a task under torque(or openPBS) and LAM?
    Thanks!

Liu xuezhao
2006-06-21