>>Oh, this might be the problem. I am using 7.1.1. I will try the beta
>>version and let you know.
>
>
> It probably is. Sorry; I really should have brought that up in the
> first mail. :-\
That's ok. I tried the beta version and I still have the same problem. I
see only one context file. Again, to be clear this is what I am doing.
1. I submit a job using the following script
#PBS -l nodes=2:ppn=1
setenv PATH ${PATH}:/usr/local/bin
lamboot $PBS_NODEFILE
mpirun -ssi rpi crtcp -np 2 /home/ppadala/kabru/progs/hello
lamhalt
The program is running on the two machines and I can see them with ps
2. I run lamcheckpoint on the machine that's running mpirun. I get one
context file in my home dir.
3. I kill the processes (on both machines), Torque and Maui clean up
their data structures
4. I run lamrestart on the context file on the machine where mpirun was
run. At this stage nothing happens, no new processes or anything.
Am I missing something here? Do I need to lamboot before lamrestart?
Shouldn't I have lamhalt in the original script? Do I need to kill lamd
instead of the process?
Thanks for your help
--
Pradeep Padala
http://ppadala.blogspot.com
|