LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Lei_at_[hidden]
Date: 2005-08-29 16:32:14


Thanks a lot for your help, John!

John Robinson wrote:

>Hi Lei,
>
>I am working on a client-server application, and using MPI_Publish_Name,
>although I am writing in C++.
>
>To avoid the crashes on name loookup or publish, I wrap the MPI calls in
>try/catch blocks.
>
>
So MPI_Publish_name(), MPI_Lookup_name(), and MPI_Comm_connect() actually
throw errors for your C++ program to catch? I am not sure I understand
how your
wrapping works. I thought if a call to, say MPI_Comm_connect(), is
going to crash,
the process will crash whether or not the C++ program catches anything.

Also, for Fortran/C programs the mechanism to catch errors is to check
return codes.
So when MPI_Lookup_name() finds out that a name is not published in the MPI
universe, the call should return a "unfound" error code for the Fotran/C
programs
to do error handling (such as spawning the process that publishes the name).

>The lingering published name I do not yet need (or have) a solution for,
>but I can imagine a background process/cron job that wakes up from time
>to time to check for the name and whether the server is accepting
>connections, and takes appropriate action. Again, I expect you would
>need try/catch, and probably timeouts, to handle crashed servers or a
>hung lam environment (i.e. in need of lamhalt/lamwipe).
>
>
>
This backgroud process should be part of the MPI daemon, but not
applications, right?
The daemon has all the knowledge of which processes published what
names, and
which processes died, and it is a background process.

-Lei