LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2007-05-26 19:08:39


On May 15, 2007, at 8:23 AM, Sauget Marc wrote:

> I have developed an application with the Lam MPI version.
> I have just a little problem.
>
> My application is very long to converge and use many computer.
> I use the "student" computer and this kind of computer are not
> "very" safe ...
>
> I have seen that there is a version of LAM mpi who is failure tolerant
> but i don't need that (too complex for my little application). I want
> just to find to know if it's be a version of LAM that don't stop to
> run
> if one sub-node (other as the "0" node) on the grid reboot or fail. Or
> better as an another distribution, a module or an start option.
>
> For the moment, i use the stable version who are present in the debian
> linux distribution (lam4). For the moment, because i have add a system
> to detect the node failure, i can proceed my work if a machine
> crash but
> not if a machine reboot (in the first case, the virtual machine don't
> known the node failure and than, proceed his work , and in the second,
> the node warn the global virtual machine,who crash .... (i think) )

There is rudimentary support for what you are trying to do in LAM/
MPI, but it is not well tested and definitely not supported. If you
run lamboot with the -x option, it will enable "fault tolerance" in
the LAM universe. The lam daemons will detect a node failure and
fail all communication pending to that node.

LAM's fault tolerance is really only useful for manager worker codes
where the worker is launched with MPI_COMM_SPAWN. Have a look at
examples/fault/README in any recent LAM tarball for more
information. If you need more fault tolerance than this provides,
you might want to look at FT-MPI from the University of Tennessee,
Knoxville.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!