LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: pjod1 (pjod1_at_[hidden])
Date: 2004-07-30 17:49:14


Hi,
   I've encountered the following problem with running LAM (-7.0.6) & IMPI and
I was wondering if anyone had any ideas on what could be causing it?

I plan to use LAM (& its IMPI support) to run MPI jobs across 2 clusters. So I
downloaded LAM and installed it and its working fine. I also downloaded the
IMPI server from (http://www.osl.iu.edu/research/impi/) and installed it as
well.

So for my first experiment I ran a simple hello world program on the following
setup:
- Two (LAM) clients connect to an impi server (each client has just one
machine) and it works fine.

But when I try to increase the number of machines beyond one for any of the
clients - the client just hangs.
>From debugging through the code the client seems to be hanging in the MPI_Recv
function. I was looking through the LAM/MPI user list to see if the problem
has been discussed before. I found a posting on the exact same problem
(http://www.lam-mpi.org/MailArchives/lam/msg07560.php) but there was no reply
posted.

Here is the command I run (having already ran the lamboot command)

mpirun -client 0 143.239.22.116:9000 C ../../lam-7.0.6/examples/hello/hello -v

It just hangs & when I type control-c I get the following output:

///////////////////////////////////////////////////////////////////
MPI_Recv: process in local group is dead (rank 2, comm 9)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Bcast()
Rank (0, MPI_COMM_WORLD): - MPI_Allreduce()
Rank (0, MPI_COMM_WORLD): - MPI_Comm_split()
Rank (0, MPI_COMM_WORLD): - MPI_Intercomm_merge()
Rank (0, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 28419 failed on node n0 (143.239.22.104) due to signal 15.
-----------------------------------------------------------------------------
/////////////////////////////////////////////////////////////////////

Thanks very much for your help.