This really isn't a lam specific problem but I wanted to find out
if anyone has handled errors in parallel.... I have a situation where
one process throws an error for some reason (ie. FPE) and jumps out of
a loop leaving other processes hanging in some MPI function within the loop. I can rescue the application from this error so I don't wish to abort.
I can't see a clear solution. There could be a global error check before
each MPI function which the process which had thrown the error and jumped out of the loop would go through aswell but this seems expensive and difficult
especially where there are multiple p2p comms at the same time.
I've also played with the idea of probing for a message before entering
MPI functions but this only works if the error is thrown before another
process probes...
There could be another approach in which another thread exists on each
process... the thread would never call MPI calls but could send a signal
to escape the hung process.... But then MPI would be left in an erroneous state....
Has anyone experienced or tried to solve this problem....?
thanks in advance ,
Noel.
__________________________________________________________________
McAfee VirusScan Online from the Netscape Network.
Comprehensive protection for your entire computer. Get your free trial today!
http://channels.netscape.com/ns/computing/mcafee/index.jsp?promo=393397
Get AOL Instant Messenger 5.1 free of charge. Download Now!
http://aim.aol.com/aimnew/Aim/register.adp?promo=380455
|