On Wed, 16 Jul 2003, W.PAKDEE wrote:
> My mpi parallel code has some problem running on our cluster which uses
> a Copper Gigabit Switch. My jobs finishes with errors. (One of the
> processes started by mpirun has exited with a nonzero exit). However,
> the same one runs without problems on the beowulf cluster using Myrinet
> cards.
>
> - What could possibly be the cause the problem?
> - Is it the coding (parallel algorithms)?
> - Does it have anything to do with rates of sending and receiving data
> across processors? (CPU's speeds have to be equal?)
These types of errors can be caused by a lot of things; it's really hard
to say without knowing your application in detail. Here's a few questions
to help you look into what the problem might be:
- Does LAM print any other error messages about why the process(es) die?
- Do your applications return with exit(0) (or "return 0" from main())?
- Does your application always call MPI_FINALIZE before exiting?
- Are any corefiles generated that you can look at in the debugger to see
why it died?
- Does your application finish, or does it die before normal
termination/completion?
I generally have a look at *where* an application dies, which then gives
good clues as to *why* it dies.
Hope that helps.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|