I did BLCR+LAM/MPI installation as followed:
1. BLCR-0.3.1 compiled on all cluster nodes.
2. Recompiled LAM/MPI with BLCR support on Master node only.
Then laminfo shows that BLCR has been built to LAM. And I did insmod
blcr.o, vmdump.o on all nodes too.
I can successfully run mpi application with command:
# Mpirun -np 4 -ssi rpi rctcp -ssi blcr application
# cr_checkpoint PID of mpirun
Context file was generated, but when I restarted that application with
cr_restart command, it shows file description error.
Can any body give me some suggestion? Should I recompile LAM/MPI on all
nodes? Does command cr_checkpoint will generate context file on all
nodes or only master node? Should the directory which saves conext file
be shared mounted directory?
Thanks
Tong
Dell
|