Dear All,
I am running my software with the help of the given
script,
#!/bin/bash
AMBERHOME=/nfsexportn277/amber/amber8
export MPI_REMSH=/usr/bin/ssh
export
LD_LIBRARY_PATH=$LDLIBRARY_PATH:/opt/hptc/lib:/opt/hptc/lsf/top/6.0/linux2.4-glibc2.3-amd64-slurm/lib:/opt/intel/fce/9.0/lib:/opt/hpmpi/lib/linux_amd64:/opt/hpmpi/lib/linux_amd64:/opt/intel/fce/9.0/lib
export PATH=$PATH:/opt/hpmpi/bin
export SAURABH=/nfshomen278/saurabha/mid15-16
for i in 32; do
echo "START... $i.. "
ulimit -c unlimited
NO_NODES=`expr $i / 2`
#bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md10_r.in -c
$SAURABH/md10.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md10_r.rst -o $SAURABH/md10_r.out -x
$SAURABH/md10_r.mdcrd -ref $SAURABH/md10.rst
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md11_r.in -c
$SAURABH/md11.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md11_r.rst -o $SAURABH/md11_r.out -x
$SAURABH/md11_r.mdcrd -ref $SAURABH/md11.rst
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md12.in -c
$SAURABH/md11_r.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md12.rst -o $SAURABH/md12.out -x
$SAURABH/md12.mdcrd -ref $SAURABH/md11_r.rst
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md13.in -c
$SAURABH/md12.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md13.rst -o $SAURABH/md13.out -x
$SAURABH/md13.mdcrd -ref $SAURABH/md12.rst
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md14.in -c
$SAURABH/md13.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md14.rst -o $SAURABH/md14.out -x
$SAURABH/md14.mdcrd -ref $SAURABH/md13.rst
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md15.in -c
$SAURABH/md14.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md15.rst -o $SAURABH/md15.out -x
$SAURABH/md15.mdcrd -ref $SAURABH/md14.rst
while test 1;
do
str=`bjobs 2>&1`
echo "bjobs output: $str";
if [ "X$str" == "XNo unfinished job found" ];
then
break;
fi
sleep 20;
done
echo "DONE."
But after some time 2-3 hours my jobs suddenly get
stopped with following error.
srun: interrupt (one more within 1 sec to abort)
srun: interrupt (one more within 1 sec to abort)
srun: task[0-31]: running
srun: task0: running
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
srun: sending Ctrl-C to job
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
srun: error: n177: task[12-13]: Exited with exit code
1
srun: Terminating job
forrtl: error (69): process interrupted (SIGINT)
srun: error: n160: task0: Exited with exit code 1
If some one could tell em the possible reason for this
error, It would be great help from him.
saurabh
__________________________________________________________
Yahoo! India Answers: Share what you know. Learn something new
http://in.answers.yahoo.com/
|