Hi,
I could not find the definitive answer to this one by scanning the mail
archives:
Using v 7.0.2 of LAM/MPI on a Linux rack running PBS, I cannot set the
session prefix using the LAM_MPI_SESSION_PREFIX (sp) environment variable.
My workaround is to unsetenv the PBS_JOBID environment variable first before
running lamboot.
Is this the expected and desired behavior?
I can explain my use case and reasoning if needed.
Thanks,
-rich
-----Original Message-----
From: Vishal Sahay [mailto:vsahay_at_[hidden]]
Sent: Monday, April 19, 2004 1:57 PM
To: General LAM/MPI mailing list
Subject: RE: LAM: Help: MPI_Irecv and pthreads.
Hi --
# In gmx381_new_iter9, the error message is
# Node 13: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0013
#
# Node 13: error opening file /home/jr241/gmx381_new_iter9/local/d3plot06
#
# Node 12: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0012
#
# Node 4: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0004
#
You might want to check if read access to these files / dir is being
restricted in some way to you.
# MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 6, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
# MPI_Recv: process in local group is dead (rank 10, MPI_COMM_WORLD)
#
# Rank (7, MPI_COMM_WORLD): Call stack within LAM:
# Rank (7, MPI_COMM_WORLD): - MPI_Recv()
# Rank (7, MPI_COMM_WORLD): - MPI_Bcast()
# Rank (7, MPI_COMM_WORLD): - MPI_Allreduce()
# Rank (7, MPI_COMM_WORLD): - main()
#
# Rank (6, MPI_COMM_WORLD): Call stack within LAM:
# Rank (6, MPI_COMM_WORLD): - MPI_Recv()
# Rank (6, MPI_COMM_WORLD): - MPI_Bcast()
# Rank (6, MPI_COMM_WORLD): - MPI_Allreduce()
# Rank (6, MPI_COMM_WORLD): - main()
# Rank (10, MPI_COMM_WORLD): Call stack within LAM:
This happens when some MPI process in your parallel job failed and
crashed, leaving the other processes who try to contact them throwing
these errors. You might want to see your code and figure out why these
processes crash. Some reasons may be a memory badness, or some invalid
operation causing the premature termination.
-Vishal
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|