On Oct 20, 2003, at 1:39 PM, Robert LeBlanc wrote:
> I was thinking over the weekend that we could expand our cluster to
> possibly include workstations that dont get used very much with
> programs that dont have a lot of network traffic. Lam seems a good
> choice because of the ease to parallelize programs. The questions that
> came up are these: Is LAM fault tolerant that if a node becomes
> unavailable then it moves the work unit to another node? Is there a
> server daemon that monitors all the nodes in the list to see when
> nodes do become available? It seems like this would be possible
> especially ifssh is setup correctly, it could be used to reduce load
> on the cluster in real time.Any comments or suggestions?
LAM has some fault tolerance capabilities, but not to the degree you
are seeking. In particular, we do not have process migration
capabilities. We are currently looking into extending our Linux
checkpoint/restart abilities to include
checkpointing/migrating/restarting a single process (as opposed to the
entire application, as we do now), but there are no timelines as to
when this might happen.
However, depending on your application, LAM plus some
scheduling/control software may meet your needs. It is generally
possible to write a client/server type application in a fault tolerant
manner with LAM. You might want to take a look at the "fault" example
that comes with LAM/MPI. It can survive a complete failure of any one
of the worker nodes (there is a README included with the example that
includes a more detailed explanation).
We don't really have the monitoring abilities you are looking for -
this is really outside of the scope of MPI. There are a number of
packages that do what you need. You might be able to make Ganglia work
as you require. Some of the grid packages may meet your requirements,
but I'm not positive on that one.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|