It looks like you should be mailing Platform (LSF), HP (HP-MPI), and/
or the SLURM developers to help with your issue.
This mailing list is to support the LAM implementation of MPI. We
can't help with the software that it appears you are using.
Good luck.
On May 9, 2007, at 10:09 AM, saurabh agrawal wrote:
> Dear All,
>
> I am running my software with the help of the given
> script,
> #!/bin/bash
>
>
>
> AMBERHOME=/nfsexportn277/amber/amber8
>
>
>
> export MPI_REMSH=/usr/bin/ssh
>
>
>
> export
> LD_LIBRARY_PATH=$LDLIBRARY_PATH:/opt/hptc/lib:/opt/hptc/lsf/top/6.0/
> linux2.4-glibc2.3-amd64-slurm/lib:/opt/intel/fce/9.0/lib:/opt/hpmpi/
> lib/linux_amd64:/opt/hpmpi/lib/linux_amd64:/opt/intel/fce/9.0/lib
>
>
>
> export PATH=$PATH:/opt/hpmpi/bin
>
>
>
> export SAURABH=/nfshomen278/saurabha/mid15-16
>
>
>
> for i in 32; do
>
>
>
> echo "START... $i.. "
>
>
>
>
>
>
> ulimit -c unlimited
>
>
>
> NO_NODES=`expr $i / 2`
>
>
>
>
>
>
> #bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md10_r.in -c
> $SAURABH/md10.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md10_r.rst -o $SAURABH/md10_r.out -x
> $SAURABH/md10_r.mdcrd -ref $SAURABH/md10.rst
>
>
>
> bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md11_r.in -c
> $SAURABH/md11.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md11_r.rst -o $SAURABH/md11_r.out -x
> $SAURABH/md11_r.mdcrd -ref $SAURABH/md11.rst
>
>
>
> bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md12.in -c
> $SAURABH/md11_r.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md12.rst -o $SAURABH/md12.out -x
> $SAURABH/md12.mdcrd -ref $SAURABH/md11_r.rst
>
>
>
> bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md13.in -c
> $SAURABH/md12.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md13.rst -o $SAURABH/md13.out -x
> $SAURABH/md13.mdcrd -ref $SAURABH/md12.rst
>
>
>
> bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md14.in -c
> $SAURABH/md13.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md14.rst -o $SAURABH/md14.out -x
> $SAURABH/md14.mdcrd -ref $SAURABH/md13.rst
>
>
>
> bsub -K -e $SAURABH/err_9_5_07.txt -o
> $SAURABH/bslog.txt -n $i -ext
> "SLURM[nodelist=n[21-36]]" /opt/hpmpi/bin/mpirun -srun
> $AMBERHOME/exe/pmemd -O -i $SAURABH/md15.in -c
> $SAURABH/md14.rst -p $SAURABH/mid-15-16_sat.top -r
> $SAURABH/md15.rst -o $SAURABH/md15.out -x
> $SAURABH/md15.mdcrd -ref $SAURABH/md14.rst
>
>
>
>
>
>
> while test 1;
> do
> str=`bjobs 2>&1`
> echo "bjobs output: $str";
> if [ "X$str" == "XNo unfinished job found" ];
> then
> break;
> fi
> sleep 20;
> done
>
>
>
> echo "DONE."
>
> But after some time 2-3 hours my jobs suddenly get
> stopped with following error.
>
> srun: interrupt (one more within 1 sec to abort)
> srun: interrupt (one more within 1 sec to abort)
> srun: task[0-31]: running
> srun: task0: running
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> srun: sending Ctrl-C to job
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> forrtl: error (69): process interrupted (SIGINT)
> srun: error: n177: task[12-13]: Exited with exit code
> 1
> srun: Terminating job
> forrtl: error (69): process interrupted (SIGINT)
> srun: error: n160: task0: Exited with exit code 1
>
>
> If some one could tell em the possible reason for this
> error, It would be great help from him.
>
>
> saurabh
>
>
>
> __________________________________________________________
> Yahoo! India Answers: Share what you know. Learn something new
> http://in.answers.yahoo.com/
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
|