Hi,
I have developed an application with the Lam MPI version.
I have just a little problem.
My application is very long to converge and use many computer.
I use the "student" computer and this kind of computer are not
"very" safe ...
I have seen that there is a version of LAM mpi who is failure tolerant
but i don't need that (too complex for my little application). I want
just to find to know if it's be a version of LAM that don't stop to run
if one sub-node (other as the "0" node) on the grid reboot or fail. Or
better as an another distribution, a module or an start option.
For the moment, i use the stable version who are present in the debian
linux distribution (lam4). For the moment, because i have add a system
to detect the node failure, i can proceed my work if a machine crash but
not if a machine reboot (in the first case, the virtual machine don't
known the node failure and than, proceed his work , and in the second,
the node warn the global virtual machine,who crash .... (i think) )
Thanks for your help
Sincerely,
Marc
|