LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Liu Xuezhao (lxz_at_[hidden])
Date: 2006-01-04 01:39:00


Hi,

   I met this problem also, I think this is a bug of lam-7.1.1 but it has not been confirmed by LAM developers.
   You can use the previous version of LAM 7.0.6, but i have found 7.0.6 has another bug also.( It will produce the file "/tmp/lam-xxx_at_node01/lam-crtcp-rank-0.txt" which will lead the execution can't be restarted after the lamhalt and lamboot is executed).
   It seems that fault-tolerance is not a essentially important feature of LAM-MPI? Can LAM's developers tell us something about the project plan about "fault-tolerance" of LAM-MPI?
   Thanks.

Liu
2006-01-04
================================

>BTW: The LAM/MPI version I use is 7.1.1
>
>And the problem seem to be when I invoke the
>
> >cr_checkpoint --term $PID_of_mpirun
>
>only the process of mpirun is checkpointed, all the other MPI processes
>are not. So when I call
>
> >cr_restart context.$PID_of_mpirun
>
>it always reports :
> >mpirun (rpwait): Bad file descriptor
>
>It's obvious, no MPI process' image checkpointed. But, why? I have
>checked the output parameter of MPI_Init_thread(..., &provided), the
>thread level does have been set to MPI_THREAD_SERIALIZED already.
>
>Would you give me some hints?
>
>Thanks!
>
>Yuan
>
= = = = = = = = = = = = = = = = = = = =