LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Liu Xuezhao (lxz_at_[hidden])
Date: 2005-12-01 21:33:47


Thank you very much for let me know some about openMPI.

But I have a question here. Does openMPI support the fault-tolerance (i.e. checkpoint and restart for parellel application)? I know that LAM-MPI together with BLCR provide the checkpoint/restart functionality, will the CR module of LAM-MPI be involved in open-MPI? I have not found the information about it.

We are now do some work on the fault-tolerance or checkpoint/restart for MPI applications. It seems that LAM-MPI and BLCR can provide the basic solution for it. I wonder if there are practical using of it on real cluster system, can you tell me something about it? Thanks again.

Regards!
Liu Xuezhao
2005-12-02

======= 2005-12-01 11:33:00 you wrote£º=======
>There is a basic design document for writing a new RPI component
>available at:
>
> http://www.lam-mpi.org/using/docs/
>
>In particular, you should probably start with the "The System
>Services Interface to LAM/MPI" document for an overview of our
>component system. The entire RPI interface is documented in "Request
>Progression Interface (RPI) System Services Interface (SSI) Modules
>for LAM/MPI". The TCP RPI is pretty complex code - some friends of
>LAM put together a very nice document on it's inner workings that is
>also linked from the documentation page. That should get you started
>working with LAM/MPI's transport engine.
>
>You might want to take a look at Open MPI instead of LAM/MPI. Open
>MPI is the successor to LAM/MPI, developed by the LAM/MPI, FT-MPI, LA-
>MPI, and PAC-X MPI development teams. It has many of the great
>features of LAM/MPI (the component system, run-time selectable device
>support, etc.), as well as a bunch of new features (real multi-device
>support, better performance, good datatype support, etc.). The
>communication architecture is slightly different - there are two
>component layers under the MPI interface - the PML, which handles MPI
>semantics, message fragmenting, etc. and the BTL, which handles
>moving packets of data (and not much else).
>
>For your research, you would have to implement a BTL using the LLC
>protocol, which could then be compared to the TCP BTL. The BTL
>interface is quite small - only 11 functions (3 of which are optional
>to implement). Unfortunately, there is not as much documentation
>available for the BTL interface as there is for LAM's RPI interface.
>However, I think you will find the reduction in complication more
>than offsets the lack of documentation.
>
>If you are interested in using Open MPI, I'd suggest looking at some
>of the papers available here:
>
> http://www.open-mpi.org/papers/
>
>Note that references to the TEG PML are out of date - we've
>redesigned the lower layers and no longer actively support the TEG
>PML. (instead using the OB1 PML and the BTL interface). If you have
>any questions, there is a very responsive mailing list available -
>I'd recommend subscribing to the devel mailing list - more
>information is found here:
>
> http://www.open-mpi.org/community/lists/
>
>
>Hope this helps,
>
>Brian
>
>
>--
> Brian Barrett
> LAM/MPI developer and all around nice guy
> Have a LAM/MPI day: http://www.lam-mpi.org/
>
>
>_______________________________________________
>lam-devel mailing list
>lam-devel_at_[hidden]
>http://www.lam-mpi.org/mailman/listinfo.cgi/lam-devel

= = = = = = = = = = = = = = = = = = = =