LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-07-24 06:04:43


On Wed, 23 Jul 2003, McCalla, Mac wrote:

> Our lam environment is 6.5.9. I have a user running lam-mpi job on a 40
> node beowolf cluster. Today his job has encountered 4 failures As
> follows:
>
> at 10:58 ....."lamboot -v -s lamboot.mpi" command received
> PBIND-------------------------------------
> LAM failed to execute a LAM binary on the remote node beo132.
> NJS_WORKDIR,NJS_STEPNAME xa=-40_migrate.apps" command received
> YPBINDPROC_DOMAIN: Domain not bound
>
> At 12:33 step 29 ---> " mpirun -lamd -w -pty -O -x
> NJS_WORKDIR,NJS_STEPNAME xa=-35_image_reduce.apps"
> Command received
> Mpirun: cannot start /u/morton/fxmig/test/run_dv_reduce_bcast
> on n35: invalid address tag
> [snipped 2 more of the same error]
>
> Are the PBIND and YPBINDPROC errors related? Could they be related

I'm not entirely clear what you're showing us in the output, above.
The only output that looks like it's comming from LAM is the "invalid
address tag" error.

Specifically, the "PBIND" and "YPBINDPROC_DOMAIN" errors were not
printed by LAM. The "LAM failed to execute a LAM binary..." message
doesn't look familiar, either.

> to the invalid address tag Messages? What do the invalid address
> tag messages mean? Is there some diagnostic technique we could use
> to trap what causes the invalid address tag message?

The invalid address tag is an odd one -- it's actually a lamd error
indicating that there was some kind of problem in the LAM session
directory.

I don't think the input and output that you're sending is complete.
For example, the mpirun commands that you listed don't seem to include
an MPI executable name:

   mpirun -lamd -w -pty -O -x NJS_WORKDIR,NJS_STEPNAME \
       xa=-35_image_reduce.apps

In this command, is "xa=-35_image_reduce.apps" the user's executable?
I'm confused because some of the error messages you reported said that
"run_dv_reduce_bcast" could not be run.

Is the user's job script running mpirun a large number of times?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/