- Next message: W.PAKDEE: "LAM: LAM and MPICH"
- Previous message: Gurgul, Dennis J.: "RE: LAM: lamboot problem !"
- Next in thread: Jeff Squyres: "Re: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Reply: Jeff Squyres: "Re: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Maybe reply: McCalla, Mac: "RE: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Maybe reply: McCalla, Mac: "FW: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
Hello,
Our lam environment is 6.5.9. I have a user running lam-mpi job on a
40 node beowolf cluster. Today his job has encountered 4 failures
As follows:
at 10:58 ....."lamboot -v -s lamboot.mpi" command received
PBIND-------------------------------------
LAM failed to execute a LAM binary on the remote node beo132.
(lamhalt was then successfully executed)
Job was restarted by the user at step1 (lamboot).
At 12:00 step2 ---> " mpirun -lamd -w -pty -O -x
NJS_WORKDIR,NJS_STEPNAME xa=-40_migrate.apps" command received
YPBINDPROC_DOMAIN: Domain not bound
Job restarted by user at step 2:
At 12:33 step 29 ---> " mpirun -lamd -w -pty -O -x
NJS_WORKDIR,NJS_STEPNAME xa=-35_image_reduce.apps"
Command received
Mpirun: cannot start /u/morton/fxmig/test/run_dv_reduce_bcast
on n35: invalid address tag
Lamhalt was executed successfully.
Lamboot was executed manually by user.
Job was restarted by the user at step 29 .
At 12:47 step 84 ---> " mpirun -lamd -w -pty -O -x
NJS_WORKDIR,NJS_STEPNAME xa=-24_image_reduce.apps"
Command received
Mpirun: cannot start /u/morton/fxmig/test/run_dv_reduce_bcast
on n3: invalid address tag
Not clear that lamhalt was executed.
Lamboot was executed manually by user.
Job was restarted by the user at step 84.
At 13:52 step 109 ---> " mpirun -lamd -w -pty -O -x
NJS_WORKDIR,NJS_STEPNAME xa=-19_image_reduce.apps"
Command received
Mpirun: cannot start /u/morton/fxmig/test/run_dv_reduce_bcast on
n35: invalid address tag
Lamhalt was executed.
Lamboot was executed manually by user.
Job was restarted by the user at step 109.
=========================================================
Are the PBIND and YPBINDPROC errors related? Could they be related to
the invalid address tag
Messages? What do the invalid address tag messages mean? Is there
some diagnostic technique we could use to trap what causes the invalid
address tag message?
Thanks for your time.
Mac mccalla
Amerada Hess corp.
Houston, Tx
- Next message: W.PAKDEE: "LAM: LAM and MPICH"
- Previous message: Gurgul, Dennis J.: "RE: LAM: lamboot problem !"
- Next in thread: Jeff Squyres: "Re: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Reply: Jeff Squyres: "Re: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Maybe reply: McCalla, Mac: "RE: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
- Maybe reply: McCalla, Mac: "FW: LAM: error msgs ---Pbind , YPBindProc_domain, and mpirun: cannot start"
|