You might want to experiment with the approach of posting a non-blocking
receive on a special "error" tag. Then always do all communication in a
nonblocking manner, but instead of waiting for just the [normal] one
request, wait for the outstanding request *and* the "error" request.
If the error request finishes, you know you've got an error, and therefore
have a safe way to exit your loops without leaving MPI hanging or in an
error state.
That's somewhat of a nebulous explanation. So here's some pseudocode:
-----
MPI_Request error_req = MPI_REQUEST_NULL;
typedef enum return_codes { ret_success, ret_error } return_codes_t;
void main_loop_function(void) {
MPI_Irecv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, my_error_tag, comm,
&error_req);
/* ... */
for (i = 0; .......) {
/* ... */
if (do_send(...) == ret_error) {
/* ...error handling... */
}
if (do_recv(...) == ret_error) {
}
}
}
return_codes_t do_send(...) {
int i;
MPI_Request pending[2] = { error_req, MPI_REQUEST_NULL };
MPI_Isend(..., &pending[1]);
MPI_Waitany(2, pending, &i, &status);
if (i == 0) {
/* originator of error is in status.MPI_SOURCE */
/* re-post error tag request (if necessary) */
MPI_Irecv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, my_error_tag, comm,
&error_req);
return ret_error;
}
return ret_success;
}
return_codes_t do_recv(...) {
/* ...similar to do_send... */
}
-----
This is off the top of my head; you may need to play with this a bit
to make it work for you.
Hope this helps.
On Wed, 2 Jul 2003 nrycro_at_[hidden] wrote:
> This really isn't a lam specific problem but I wanted to find out
> if anyone has handled errors in parallel.... I have a situation where
> one process throws an error for some reason (ie. FPE) and jumps out of
> a loop leaving other processes hanging in some MPI function within the loop. I can rescue the application from this error so I don't wish to abort.
>
> I can't see a clear solution. There could be a global error check before
> each MPI function which the process which had thrown the error and jumped out of the loop would go through aswell but this seems expensive and difficult
> especially where there are multiple p2p comms at the same time.
>
> I've also played with the idea of probing for a message before entering
> MPI functions but this only works if the error is thrown before another
> process probes...
>
> There could be another approach in which another thread exists on each
> process... the thread would never call MPI calls but could send a signal
> to escape the hung process.... But then MPI would be left in an erroneous state....
>
> Has anyone experienced or tried to solve this problem....?
>
> thanks in advance ,
>
> Noel.
>
> __________________________________________________________________
> McAfee VirusScan Online from the Netscape Network.
> Comprehensive protection for your entire computer. Get your free trial today!
> http://channels.netscape.com/ns/computing/mcafee/index.jsp?promo=393397
>
> Get AOL Instant Messenger 5.1 free of charge. Download Now!
> http://aim.aol.com/aimnew/Aim/register.adp?promo=380455
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|