hi everyone.
i am working with blcr/lammpi, but something goes wrong and i don't know what's the reason.
first i installed blcr and lammpi properly. hence i tested c/r on a mpi program running on a single node. everything went smoothly. the program ran, checkpoint commands were executed successfully, context files were generated, and restart process as well ran properly. later i tried the same experiment on a 2-node cluster, in which i got failed. i started the mpi program with command:
mpirun -np 2 -ssi rpi crtcp -ssi cr blcr C ./lamtest
while the program was running, i did checkpoints using command:
lamcheckpoint -ssi cr blcr -pid 10411 (*10411 is the pid of mpirun.)
thus the command stopped there and never returned until ctrl-c.
i checked the working directory, i. e., my home directory, and no context file was found. however, some temporary files named as .context-xxxxx-xx.tmp presented.
so someone please tell me what's the problem and i will be much appreciated.
thanks .
___________________________________________________________
ÇÀ×¢ÑÅ»¢Ãâ·ÑÓÊÏä-3.5GÈÝÁ¿£¬20M¸½¼þ£¡
http://cn.mail.yahoo.com
|