> The kind of fail-over system you're describing is really one where every
> node can be a master, which basically means that they're equivalent in
> capability. The "master" is nominated by an election based on some sort
> of rules (i.e. first to boot, fastest network, whatever). You could
> possibly do this with LAM by using shell scripts to configure the
> election winner as the master, say by NFS, and from there configure the
> slaves. The scripts would be pretty complex though. As Bogdan says,
> you'll need a hot-swap master as well in case the master comes down. If
> a slave dies, you'd have to bring down the LAM universe and bring it
> back up minus the dead node. More scripting. It's doable, but it's a
> lot of work.
Yes, this is exactly my goal. And yes, it is quite a lot of work.
That is why I wanted to avoid re-inventing the wheel as much as
possible. Doing it all from scratch does has some benefits though.
Does anyone know what work has been done to show whether or not LAM
(or any MPI suite) can adapt a running process without the process
being aware of it, should a node drop out? This should theoretically
be possible, especially if you're willing to use some sort of a
kludge.. say just restart the process doubling up on a processor, and
re-assign the dropped out node's 'id'. There are still of course
issues with that (the memory that was lost, etc.) .. but I am by no
means an expert in this field.
Thanks for the reply!
Bill
|