Brian Barrett wrote:
> On May 15, 2007, at 8:23 AM, Sauget Marc wrote:
>
>> I have developed an application with the Lam MPI version.
>> I have just a little problem.
>>
>> My application is very long to converge and use many computer.
>> I use the "student" computer and this kind of computer are not
>> "very" safe ...
>>
>> I have seen that there is a version of LAM mpi who is failure tolerant
>> but i don't need that (too complex for my little application). I want
>> just to find to know if it's be a version of LAM that don't stop to
>> run
>> if one sub-node (other as the "0" node) on the grid reboot or fail. Or
>> better as an another distribution, a module or an start option.
>>
>> For the moment, i use the stable version who are present in the debian
>> linux distribution (lam4). For the moment, because i have add a system
>> to detect the node failure, i can proceed my work if a machine
>> crash but
>> not if a machine reboot (in the first case, the virtual machine don't
>> known the node failure and than, proceed his work , and in the second,
>> the node warn the global virtual machine,who crash .... (i think) )
>
> There is rudimentary support for what you are trying to do in LAM/
> MPI, but it is not well tested and definitely not supported. If you
> run lamboot with the -x option, it will enable "fault tolerance" in
> the LAM universe. The lam daemons will detect a node failure and
> fail all communication pending to that node.
>
> LAM's fault tolerance is really only useful for manager worker codes
> where the worker is launched with MPI_COMM_SPAWN. Have a look at
> examples/fault/README in any recent LAM tarball for more
> information. If you need more fault tolerance than this provides,
> you might want to look at FT-MPI from the University of Tennessee,
> Knoxville.
>
> Hope this helps,
>
> Brian
>
Sorry for the dealys and thanks for the answer.
I have founded this help previously with the read of "man page" and
I have used this exemple :D
Disgrace for me, for this question :D
Thanks
++ Marc
|