Hi folks,
I have been trying to figure this problem out for a couple of days now
and have given myself a headache, so
would appreciate some guidance.
<<lam712-doc.tar.gz>>
Lam release is 7.1.2. Systems are RHEL WSR3.
There are many successful examples of the original application as well
as the test one, running on different node sets.
The problem scenario is 360 processes on 294 nodes. The test
application is a slight variant of the lam/examples/hello/hello.c
program. (As you might guess, this was not the original application).
lamboot is successful.
the mpirun specifying "-ssi rpi lamd" fails - issuing
MPI_Recv: message truncated: Input/output error (rank 0, comm3) .
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Gatherv()
Rank (0, MPI_COMM_WORLD): - MPI_Init()
Rank (0, MPI_COMM_WORLD): - main()
No output from the application itself appears, suggesting to me that the
MPI_Init is never completed for rank 0?
Output captured from the run is in the mpirun_lamd.log.t1b file.
the mpirun specifying "-ssi rpi tcp" succeeds without an intervening
lamboot .
Is it possible that the issue is overloading of the lamd on the master
node somehow?...
documentation in the attached gzipped tar file is.
config.log -
laminfo.out - output from a laminfo command run on the master node.
bootfile_t1 - the bootfile used
lamnodes.out - output from a lamnodes command run on the master node
after the lamboot
hello.c - source code of the test application
run.sh - file containing the commands which were used to set environment
and test. shows parameters for lamboot and mpirun.
mpirun_lamd.log.t1b - output from the mpirun command specifying "-ssi
rpi lamd"
mpirun_tcp.log.t1b - output from the mpirun command specifying "-ssi rpi
tcp" .
Any suggestion on how to capture more information to identify the real
problem would be appreciated.
Regards,
Mac McCalla
Geoscience Systems
Hess Corporation
500 Dallas St. , Houston, Texas 77002
|