On Dec 9, 2005, at 3:43 PM, Dilani Perera wrote:
> 20071 out running on n15
> ----------------------------------------------------------------------
> ----
> It seems that rank 15 was not able to open a TCP client socket for
> some reason. LAM is likely to abort your program shortly. :-(
>
> Perhaps this unix error message will help:
>
> Unix errno: 111
> Connection refused
> ----------------------------------------------------------------------
> ---
I think most of the output here is a red herring.
> nuthatch% ---------------------------------------
> <
> /bin/ksh:
> ----------------------------------------------------------------------
> ---:
> not found
> nuthatch% MPI_Recv: process in local group is dead (rank 4,
> MPI_COMM_WORLD)
> /bin/ksh: syntax error: `(' unexpected
This message looks quite fishy to me. Why is ksh reporting an error
here?
Can you verify that you're running the same application on all
nodes? It looks like mpriun found a shell script on at least some
nodes and tried to treat it like an MPI application (that's a wild
guess). Can you try running with the absolute path name of your
executable? (I'm assuming you have a shared filesystem visible to
all nodes)
Something like:
mpirun -np 16 `pwd`/my_application
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|