LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pradeep Padala (ppadala_at_[hidden])
Date: 2005-07-19 14:27:06


>>Oh, this might be the problem. I am using 7.1.1. I will try the beta
>>version and let you know.
>
>
> It probably is. Sorry; I really should have brought that up in the
> first mail. :-\

That's ok. I tried the beta version and I still have the same problem. I
see only one context file. Again, to be clear this is what I am doing.

1. I submit a job using the following script
    #PBS -l nodes=2:ppn=1
    setenv PATH ${PATH}:/usr/local/bin
    lamboot $PBS_NODEFILE
    mpirun -ssi rpi crtcp -np 2 /home/ppadala/kabru/progs/hello
    lamhalt

    The program is running on the two machines and I can see them with ps

2. I run lamcheckpoint on the machine that's running mpirun. I get one
    context file in my home dir.

3. I kill the processes (on both machines), Torque and Maui clean up
    their data structures
4. I run lamrestart on the context file on the machine where mpirun was
    run. At this stage nothing happens, no new processes or anything.

Am I missing something here? Do I need to lamboot before lamrestart?
Shouldn't I have lamhalt in the original script? Do I need to kill lamd
instead of the process?

Thanks for your help

-- 
Pradeep Padala
http://ppadala.blogspot.com