Based on Jeff Squyres' response, waiting for OpenMPI looks like a good
option. You could have a reasonably robust implementation by doing what
MySQL does and replicating the data for one process(or?) on another.
Note that I don't mean duplicating the problem evaluation, (you could,
but you'd lose 50% of your performance.)
By replicating the data, you will lose some available memory, but in
theory you could backtrack to where you were just before a given
failure, reconstruct your problem and continue. This implies that you
have your previous problem state persisted somewhere, in memory, disk,
or whereever. How far you go depends on how fail-safe your system has
to be.
Damien Hocking
Light travels faster than sound. This is why some people appear bright until they speak.
Jeff Layton wrote:
> William Bierman wrote:
>
>>> The kind of fail-over system you're describing is really one where
>>> every
>>> node can be a master, which basically means that they're equivalent in
>>> capability. The "master" is nominated by an election based on some
>>> sort
>>> of rules (i.e. first to boot, fastest network, whatever). You could
>>> possibly do this with LAM by using shell scripts to configure the
>>> election winner as the master, say by NFS, and from there configure the
>>> slaves. The scripts would be pretty complex though. As Bogdan says,
>>> you'll need a hot-swap master as well in case the master comes
>>> down. If
>>> a slave dies, you'd have to bring down the LAM universe and bring it
>>> back up minus the dead node. More scripting. It's doable, but it's a
>>> lot of work.
>>>
>>
>>
>> Yes, this is exactly my goal. And yes, it is quite a lot of work.
>> That is why I wanted to avoid re-inventing the wheel as much as
>> possible. Doing it all from scratch does has some benefits though.
>>
>> Does anyone know what work has been done to show whether or not LAM
>> (or any MPI suite) can adapt a running process without the process
>> being aware of it, should a node drop out? This should theoretically
>> be possible, especially if you're willing to use some sort of a
>> kludge.. say just restart the process doubling up on a processor, and
>> re-assign the dropped out node's 'id'. There are still of course
>> issues with that (the memory that was lost, etc.) .. but I am by no
>> means an expert in this field.
>>
>
> Well, I'm not an expert either :) However, you might think of Jeff's
> new project: OpenMPI (www.open-mpi.org). It incorporates FT-MPI
> from the University of Tennesse (http://icl.cs.utk.edu/ftmpi/). I've
> never
> used it, but here's a quote from their front page,
>
> "FT-MPI survives the crash of n-1 processes in a n-process job, and,
> if required,
> can respawn/restart them. However, it is still the responsibility of
> the application
> to recover the data-structures and the data on the crahsed processes."
>
> I'm not sure this meets your requirements since it says that you have to
> recover the data-structures yourself. One other option is to HA each node
> so that if you lose a node, the other one picks up. I don't know how well
> this works with MPI though.
> With MPI-2 you can dynamically add and subtract nodes. This might
> buy you something depending upon what you ware doing.
> ICL also has a project called HARNESS (http://icl.cs.utk.edu/harness/)
> that you might be interested in.
> I've followed this thread sort of half heartedly, but let me ask a
> question.
> Why are you interested in so much redundancy in a cluster? (I'm not being
> accusitory just curious). I've run MPI jobs for several weeks without any
> problems. I've also had cluster stay up without any failure for almost 12
> months - and that was running 24/7. Do you have an app that needs to run
> an extremely long time? Have you looked at checkpointing your code?
> (Again, just curious).
>
> Jeff
>
|