On Apr 21, 2005, at 5:22 AM, Andrew.Bridgeman_at_[hidden] wrote:
> I currently have a problem when running Dyna 5434 mpp on
> my
> Redhat ES workstations. The .error and .out text is below. It seems to
> point to a problem with lam as it is the last thing logged in the .out
> file. Could someone please advise me on what i can do to resolve this
> issue.
>
> error log
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> out log
>
> PID 4512 failed on node n0 with exit status 38.
> LSTC network license granted
Your problem is early in your output and points to your application,
not LAM. All the other output from LAM is the result of the first
failure. There is a bit of an inherent race condition between when LAM
notices a process died and when it cleans up the other processes. This
can lead to the other processes noticing that communication failed
before they are killed by the LAM run time system. Hence all the
errors about MPI functions failing.
Anyway, the problem is that one of the processes (PID 4512 on node n0)
exited. Based on the rest of the output, it appears that it wasn't
supposed to exit that early. And based on the non-zero exit status, it
probably was an error condition.
I would look for places your code can exit with a non-zero exit status
before end of application. Since you know the exit code is 38, perhaps
that will help localize the issue. If the error always happens on one
node, it might be something that rank in MPI is trying to do - you
might be able to use a debugger to figure out exactly where the exit is
happening.
Take a look at the LAM/MPI FAQ for more information on debugging MPI
applications:
http://www.lam-mpi.org/faq/
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|