Hi,
I met this problem also, I think this is a bug of lam-7.1.1 but it has not been confirmed by LAM developers.
You can use the previous version of LAM 7.0.6, but i have found 7.0.6 has another bug also.( It will produce the file "/tmp/lam-xxx_at_node01/lam-crtcp-rank-0.txt" which will lead the execution can't be restarted after the lamhalt and lamboot is executed).
It seems that fault-tolerance is not a essentially important feature of LAM-MPI? Can LAM's developers tell us something about the project plan about "fault-tolerance" of LAM-MPI?
Thanks.
Liu
2006-01-04
================================
>BTW: The LAM/MPI version I use is 7.1.1
>
>And the problem seem to be when I invoke the
>
> >cr_checkpoint --term $PID_of_mpirun
>
>only the process of mpirun is checkpointed, all the other MPI processes
>are not. So when I call
>
> >cr_restart context.$PID_of_mpirun
>
>it always reports :
> >mpirun (rpwait): Bad file descriptor
>
>It's obvious, no MPI process' image checkpointed. But, why? I have
>checked the output parameter of MPI_Init_thread(..., &provided), the
>thread level does have been set to MPI_THREAD_SERIALIZED already.
>
>Would you give me some hints?
>
>Thanks!
>
>Yuan
>
= = = = = = = = = = = = = = = = = = = =
|