Hi Lei,
Some responses below...
Lei_at_ICS wrote:
> Thanks a lot for your help, John!
>
> John Robinson wrote:
>
>
>>Hi Lei,
>>
>>I am working on a client-server application, and using MPI_Publish_Name,
>>although I am writing in C++.
>>
>>To avoid the crashes on name loookup or publish, I wrap the MPI calls in
>>try/catch blocks.
>>
>>
>
> So MPI_Publish_name(), MPI_Lookup_name(), and MPI_Comm_connect() actually
> throw errors for your C++ program to catch? I am not sure I understand
> how your
> wrapping works. I thought if a call to, say MPI_Comm_connect(), is
> going to crash,
> the process will crash whether or not the C++ program catches anything.
>
> Also, for Fortran/C programs the mechanism to catch errors is to check
> return codes.
> So when MPI_Lookup_name() finds out that a name is not published in the MPI
> universe, the call should return a "unfound" error code for the Fotran/C
> programs
> to do error handling (such as spawning the process that publishes the name).
I agree with what you propose about error codes. All I am reporting is
what I have observed. If I had designed it, I would have done it much
more along the lines you suggest.
Note that I turn on exceptions, which is only helpful in a C++ context.
In C++:
MPI::COMM_WORLD.Set_errhandler( MPI::ERRORS_THROW_EXCEPTIONS );
This is only available for C++ programs, unfortunately. This is one
place where the language bindings are not completely symmetrical. So
unless you can write (at least part of) your application in C++, you are
probably out of luck.
That is not a very good answer. This situation leds me to wonder if the
accept/connect and publish/lookup were added at the last minute to the
standard. They do not seem to be well thought through, IMHO.
>>The lingering published name I do not yet need (or have) a solution for,
>>but I can imagine a background process/cron job that wakes up from time
>>to time to check for the name and whether the server is accepting
>>connections, and takes appropriate action. Again, I expect you would
>>need try/catch, and probably timeouts, to handle crashed servers or a
>>hung lam environment (i.e. in need of lamhalt/lamwipe).
>>
>>
>>
>
> This backgroud process should be part of the MPI daemon, but not
> applications, right?
> The daemon has all the knowledge of which processes published what
> names, and
> which processes died, and it is a background process.
I meant writing a daemon that could poke into the MPI environment to
figure out if something is stopped or hung. If some MPI states cause
the program to abort, you could still monitor using a shell script that
decodes the completion status of the programs it invokes, and write a
lot of little programs for each probe (MPI present, name published, test
connection works, etc.). But this gets pretty messy.
I am not aware of a way to hook a user program to the MPI daemon (lamd
in the case of LAM), but I like your suggestion. Another approach would
be a more elaborate utility that could check various MPI states (think,
an expanded mpitask), and return completion status to its caller.
Not_a_MPI_designer,_just_a_user_ly y'rs
/jr
---
> -Lei
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|