LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-07-05 13:21:55


You might want to experiment with the approach of posting a non-blocking
receive on a special "error" tag. Then always do all communication in a
nonblocking manner, but instead of waiting for just the [normal] one
request, wait for the outstanding request *and* the "error" request.

If the error request finishes, you know you've got an error, and therefore
have a safe way to exit your loops without leaving MPI hanging or in an
error state.

That's somewhat of a nebulous explanation. So here's some pseudocode:

-----
MPI_Request error_req = MPI_REQUEST_NULL;
typedef enum return_codes { ret_success, ret_error } return_codes_t;

void main_loop_function(void) {
  MPI_Irecv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, my_error_tag, comm,
            &error_req);

  /* ... */
  for (i = 0; .......) {
    /* ... */
    if (do_send(...) == ret_error) {
      /* ...error handling... */
    }
    if (do_recv(...) == ret_error) {
    }
  }
}

return_codes_t do_send(...) {
  int i;

  MPI_Request pending[2] = { error_req, MPI_REQUEST_NULL };
  MPI_Isend(..., &pending[1]);
  MPI_Waitany(2, pending, &i, &status);
  if (i == 0) {
    /* originator of error is in status.MPI_SOURCE */
    /* re-post error tag request (if necessary) */
    MPI_Irecv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, my_error_tag, comm,
              &error_req);
    return ret_error;
  }
  return ret_success;
}

return_codes_t do_recv(...) {
  /* ...similar to do_send... */
}
-----

This is off the top of my head; you may need to play with this a bit
to make it work for you.

Hope this helps.

On Wed, 2 Jul 2003 nrycro_at_[hidden] wrote:

> This really isn't a lam specific problem but I wanted to find out
> if anyone has handled errors in parallel.... I have a situation where
> one process throws an error for some reason (ie. FPE) and jumps out of
> a loop leaving other processes hanging in some MPI function within the loop. I can rescue the application from this error so I don't wish to abort.
>
> I can't see a clear solution. There could be a global error check before
> each MPI function which the process which had thrown the error and jumped out of the loop would go through aswell but this seems expensive and difficult
> especially where there are multiple p2p comms at the same time.
>
> I've also played with the idea of probing for a message before entering
> MPI functions but this only works if the error is thrown before another
> process probes...
>
> There could be another approach in which another thread exists on each
> process... the thread would never call MPI calls but could send a signal
> to escape the hung process.... But then MPI would be left in an erroneous state....
>
> Has anyone experienced or tried to solve this problem....?
>
> thanks in advance ,
>
> Noel.
>
> __________________________________________________________________
> McAfee VirusScan Online from the Netscape Network.
> Comprehensive protection for your entire computer. Get your free trial today!
> http://channels.netscape.com/ns/computing/mcafee/index.jsp?promo=393397
>
> Get AOL Instant Messenger 5.1 free of charge. Download Now!
> http://aim.aol.com/aimnew/Aim/register.adp?promo=380455
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/