Hi Rich,
This may more empathetical then helpful...
I could not get the LAM_MPI_SESSION_PREFIX
working if I had TMPDIR set differently on
different nodes in my cluster. That could
be your problem. I ended up repackaging
the RPM with the following changes to ignore
TMPDIR:
share/etc/kill.c:
/*
# does not work if TMPDIR is different on each node.
# } else if ((tmp = getenv("TMPDIR")) != NULL) {
# tmpprefix = strdup(tmp);
*/
config/ltmain.sh:
# Problem with different TMPDIR's on each node.
# test -n "$TMPDIR" && tmpdir="$TMPDIR"
share/totalview/config/ltmain.sh:
# If different TMPDIRs on each node, then a problem happens.
# test -n "$TMPDIR" && tmpdir="$TMPDIR"
I never figured out the exact cause of my problem,
or reported my error (I did not want to be a whiner)...
So I tracked down my error to be a result of
TMPDIR and hit it with a rock ;-)
Regards,
Joe Griffin
Drake, Richard R wrote:
>Hi,
>
>I could not find the definitive answer to this one by scanning the mail
>archives:
>
>Using v 7.0.2 of LAM/MPI on a Linux rack running PBS, I cannot set the
>session prefix using the LAM_MPI_SESSION_PREFIX (sp) environment variable.
>My workaround is to unsetenv the PBS_JOBID environment variable first before
>running lamboot.
>
>Is this the expected and desired behavior?
>
>I can explain my use case and reasoning if needed.
>
>Thanks,
>
>-rich
>
>
>-----Original Message-----
>From: Vishal Sahay [mailto:vsahay_at_[hidden]]
>Sent: Monday, April 19, 2004 1:57 PM
>To: General LAM/MPI mailing list
>Subject: RE: LAM: Help: MPI_Irecv and pthreads.
>
>
>Hi --
>
>
># In gmx381_new_iter9, the error message is
># Node 13: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0013
>#
># Node 13: error opening file /home/jr241/gmx381_new_iter9/local/d3plot06
>#
># Node 12: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0012
>#
># Node 4: error opening file /home/jr241/gmx381_new_iter9/local/dbout.0004
>#
>
>You might want to check if read access to these files / dir is being
>restricted in some way to you.
>
># MPI_Recv: process in local group is dead (rank 7, MPI_COMM_WORLD)
># MPI_Recv: process in local group is dead (rank 6, MPI_COMM_WORLD)
># MPI_Recv: process in local group is dead (rank 11, MPI_COMM_WORLD)
># MPI_Recv: process in local group is dead (rank 10, MPI_COMM_WORLD)
>#
># Rank (7, MPI_COMM_WORLD): Call stack within LAM:
># Rank (7, MPI_COMM_WORLD): - MPI_Recv()
># Rank (7, MPI_COMM_WORLD): - MPI_Bcast()
># Rank (7, MPI_COMM_WORLD): - MPI_Allreduce()
># Rank (7, MPI_COMM_WORLD): - main()
>#
># Rank (6, MPI_COMM_WORLD): Call stack within LAM:
># Rank (6, MPI_COMM_WORLD): - MPI_Recv()
># Rank (6, MPI_COMM_WORLD): - MPI_Bcast()
># Rank (6, MPI_COMM_WORLD): - MPI_Allreduce()
># Rank (6, MPI_COMM_WORLD): - main()
># Rank (10, MPI_COMM_WORLD): Call stack within LAM:
>
>
>This happens when some MPI process in your parallel job failed and
>crashed, leaving the other processes who try to contact them throwing
>these errors. You might want to see your code and figure out why these
>processes crash. Some reasons may be a memory badness, or some invalid
>operation causing the premature termination.
>
>-Vishal
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
|