On Fri, 5 Dec 2003, Maria Barnes wrote:
> Once in a while we get this strange error from mpirun command when
> running our application (/ldas/bin/wrapperAPI):
>
> mpirun: cannot start /ldas/bin/wrapperAPI on n0 (o): Success
To understand what's happening here, let me give a brief explanation of
how mpirun launches processes...
- mpirun sends a message to the lamd on the target node saying "please
launch the foo process"
- the lamd on that node then fork/exec's foo
- the lamd returns a message back to mpirun indicating the status of the
launch -- whether it was successful or not. In the case of failure,
the lamd's errno is included in the message.
So I think what's happening here is that the lamd failed to launch the
message, but somehow either it didn't report the error properly, the OS
didn't report the error properly, or mpirun is displaying the error
improperly (hence, the "Success" message).
Can you verify under what conditions this occurs? I'm assuming that
/ldas/bin/wrapperAPI exists, is executable, etc.
IIRC, I've seen LAM do similar things (albiet not with a "Success"
message) when NFS is acting wonky. For example, I've seen NFS race
conditions where the executable is available on one node but not another.
Could this be happening here?
> We used to see this error message more frequently when using older
> version of lam (5.x.x or 6.x.x - can't remember which one exactly), but
> it went away once we switched to the newer one. Few weeks ago we
> installed lam-7.0.3, and the error message is back. Could you, please,
> pin point to what could be the problem?
I don't think that we have changed this code significantly in quite a
while; I'm somewhat surprised that your problem went away and then
mysteriously re-appeared.
The problem will be in the kenyad in the lamd (the lamd is comprised of a
bunch of "pseudo-daemons"; the kenyad is the process control
pseudo-daemon). This is the entity that will be catching the error code
and sending it back to mpirun. If we can't track this down, I may ask you
to attach a debugger to your local lamd and see if we can figure out what
lamd thinks the problem is.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|