LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Vishal Sahay (vsahay_at_[hidden])
Date: 2004-04-19 14:57:12


Hi --

# In gmx381_new_iter9, the error message is
# Node 13: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0013
#
# Node 13: error opening file /home/jr241/gmx381_new_iter9/local/d3plot06
#
# Node 12: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0012
#
# Node 4: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0004
#

You might want to check if read access to these files / dir is being
restricted in some way to you.

# MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 6, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 10, MPI_COMM_WORLD)
#
# Rank (7, MPI_COMM_WORLD): Call stack within LAM:
# Rank (7, MPI_COMM_WORLD): - MPI_Recv()
# Rank (7, MPI_COMM_WORLD): - MPI_Bcast()
# Rank (7, MPI_COMM_WORLD): - MPI_Allreduce()
# Rank (7, MPI_COMM_WORLD): - main()
#
# Rank (6, MPI_COMM_WORLD): Call stack within LAM:
# Rank (6, MPI_COMM_WORLD): - MPI_Recv()
# Rank (6, MPI_COMM_WORLD): - MPI_Bcast()
# Rank (6, MPI_COMM_WORLD): - MPI_Allreduce()
# Rank (6, MPI_COMM_WORLD): - main()
# Rank (10, MPI_COMM_WORLD): Call stack within LAM:

This happens when some MPI process in your parallel job failed and
crashed, leaving the other processes who try to contact them throwing
these errors. You might want to see your code and figure out why these
processes crash. Some reasons may be a memory badness, or some invalid
operation causing the premature termination.

-Vishal