LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: saurabh agrawal (imsam100_at_[hidden])
Date: 2007-05-09 10:09:54


Dear All,

I am running my software with the help of the given
script,
#!/bin/bash
                                                      
                                                      
                                               
AMBERHOME=/nfsexportn277/amber/amber8
                                                      
                                                      
                                               
export MPI_REMSH=/usr/bin/ssh
                                                      
                                                      
                                               
export
LD_LIBRARY_PATH=$LDLIBRARY_PATH:/opt/hptc/lib:/opt/hptc/lsf/top/6.0/linux2.4-glibc2.3-amd64-slurm/lib:/opt/intel/fce/9.0/lib:/opt/hpmpi/lib/linux_amd64:/opt/hpmpi/lib/linux_amd64:/opt/intel/fce/9.0/lib
                                                      
                                                      
                                               
export PATH=$PATH:/opt/hpmpi/bin
                                                      
                                                      
                                               
export SAURABH=/nfshomen278/saurabha/mid15-16
                                                      
                                                      
                                               
for i in 32; do
                                                      
                                                      
                                               
echo "START... $i.. "
                                                      
                                                      
                                               
                                                      
                                                      
                                               
ulimit -c unlimited
                                                      
                                                      
                                               
NO_NODES=`expr $i / 2`
                                                      
                                                      
                                               
                                                      
                                                      
                                               
#bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md10_r.in -c
$SAURABH/md10.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md10_r.rst -o $SAURABH/md10_r.out -x
$SAURABH/md10_r.mdcrd -ref $SAURABH/md10.rst
                                                      
                                                      
                                               
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md11_r.in -c
$SAURABH/md11.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md11_r.rst -o $SAURABH/md11_r.out -x
$SAURABH/md11_r.mdcrd -ref $SAURABH/md11.rst
                                                      
                                                      
                                    
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md12.in -c
$SAURABH/md11_r.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md12.rst -o $SAURABH/md12.out -x
$SAURABH/md12.mdcrd -ref $SAURABH/md11_r.rst
                                                      
                                                      
                                               
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md13.in -c
$SAURABH/md12.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md13.rst -o $SAURABH/md13.out -x
$SAURABH/md13.mdcrd -ref $SAURABH/md12.rst
                                                      
                                                      
                                               
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md14.in -c
$SAURABH/md13.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md14.rst -o $SAURABH/md14.out -x
$SAURABH/md14.mdcrd -ref $SAURABH/md13.rst
                                                      
                                                      
                                               
bsub -K -e $SAURABH/err_9_5_07.txt -o
$SAURABH/bslog.txt -n $i -ext
"SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
$AMBERHOME/exe/pmemd -O -i $SAURABH/md15.in -c
$SAURABH/md14.rst -p $SAURABH/mid-15-16_sat.top -r
$SAURABH/md15.rst -o $SAURABH/md15.out -x
$SAURABH/md15.mdcrd -ref $SAURABH/md14.rst
                                                      
                                                      
                                               
                                                      
                                                      
                                               
while test 1;
do
        str=`bjobs 2>&1`
        echo "bjobs output: $str";
        if [ "X$str" == "XNo unfinished job found" ];
then
                break;
        fi
        sleep 20;
done
                                                      
                                                      
                                               
echo "DONE."

But after some time 2-3 hours my jobs suddenly get
stopped with following error.

srun: interrupt (one more within 1 sec to abort)
srun: interrupt (one more within 1 sec to abort)
srun: task[0-31]: running
srun: task0: running
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
srun: sending Ctrl-C to job
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
forrtl: error (69): process interrupted (SIGINT)
srun: error: n177: task[12-13]: Exited with exit code
1
srun: Terminating job
forrtl: error (69): process interrupted (SIGINT)
srun: error: n160: task0: Exited with exit code 1

If some one could tell em the possible reason for this
error, It would be great help from him.

saurabh

                
__________________________________________________________
Yahoo! India Answers: Share what you know. Learn something new
http://in.answers.yahoo.com/