LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-08-30 08:01:40


On Aug 29, 2005, at 9:04 PM, John Robinson wrote:

>> [snipped]
> I agree with what you propose about error codes. All I am reporting is
> what I have observed. If I had designed it, I would have done it much
> more along the lines you suggest.
>
> Note that I turn on exceptions, which is only helpful in a C++ context.
> In C++:
>
> MPI::COMM_WORLD.Set_errhandler( MPI::ERRORS_THROW_EXCEPTIONS );

You can do something similar in C and Fortran, except use the
MPI_ERRORS_RETURN error handler. Check out the MPI-1 specification for
a full description of its error handling capabilities (and
limitations). MPI::ERRORS_THROW_EXCEPTIONS was added in MPI-2 with the
C++ bindings because it's a natural error mechanism that is available
in C++ that is not available in C or Fortran. So yes, it's only
available in C++, but it only makes sense in C++.

> That is not a very good answer. This situation leds me to wonder if
> the
> accept/connect and publish/lookup were added at the last minute to the
> standard. They do not seem to be well thought through, IMHO.

No, there was actually a lot of debate for many months about the whole
dynamic chapter. :-)

The error mechanisms for these functions are quite consistent with all
the other MPI functions.

>> This backgroud process should be part of the MPI daemon, but not
>> applications, right?
>> The daemon has all the knowledge of which processes published what
>> names, and
>> which processes died, and it is a background process.
>
> I meant writing a daemon that could poke into the MPI environment to
> figure out if something is stopped or hung. If some MPI states cause
> the program to abort, you could still monitor using a shell script that
> decodes the completion status of the programs it invokes, and write a
> lot of little programs for each probe (MPI present, name published,
> test
> connection works, etc.). But this gets pretty messy.

Agreed. This is a function of LAM's implementation. We had Grand
Plans to make this more fine-grained and more useful for things like
this, but then we started working on Open MPI (see a mail from me about
this earlier this morning -- starting working on Open MPI meant
stopping working on many things in LAM, with the intent that we would
[eventually] do them in Open MPI instead).

We do plan to have such fine-grained tools in Open MPI (e.g., command
line tools to unpublish a name, kill a specific process and/or parallel
job, etc.), but they will not be included in Open MPI v1.0.

> I am not aware of a way to hook a user program to the MPI daemon (lamd
> in the case of LAM), but I like your suggestion. Another approach
> would
> be a more elaborate utility that could check various MPI states (think,
> an expanded mpitask), and return completion status to its caller.

I replied something about this in a mail a few minutes ago -- you might
want to combine MPI semantics with a few semantics of your own (e.g.,
lockfiles or somesuch) to know when processes are there and/or dead,
etc. It would take some thought, but you should be able to produce a
reasonable-enough (although probably not perfect) system to get it
right 99.9% of the time.

> Not_a_MPI_designer,_just_a_user_ly y'rs

We like input from everyone -- even criticism! :-)

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/