LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Karl Hahn (hahnk_at_[hidden])
Date: 2004-08-03 04:44:22


Hi again,

I have problems using Infiniband. This is my first
encounter with Infiniband, so I only know that it
is a kind of 'fast network' :-)

I have compiled a small program which I can run
with tcp and lamd rpi. With rpi ib I get errors.

1) There are four lamnodes on a quad CPU machine.
So Infiniband is not actually used(?). But is should
be possible to run the program. Is this right?

2) Can I find detailed error messages of LAM
anywhere?

3) Can someone recommend a kind of 'Infiniband
tutorial for LAM'? The cluster is connected with
Ethernet and Infiniband. Which one is used? How
can I control this?

4) It _could_ be possible that I have mixed
two versions of LAM (installed is a 6.5.4(!),
I use a 7.1.beta). I tried to avoid problems
using a new $PATH, $LD_LIBRARY_PATH, and
-I/xxx in the Makefile. But I am not sure ...

Sorry for the beginners' questions!
Charlie

And here is the output of my program:

$ lamboot

LAM 7.1b13/MPI 2 C++/ROMIO - Indiana University

$ lamnodes
n0 node:4:origin,this_node

$ mpirun -c 4 -ssi rpi tcp ./stresstest

Try to initialize cluster...
  running 4 processes ...
  checking the LAM nodes ...
n0 node:4:origin,this_node

Broadcasting message with size 100MB to 3 slaves.
Broadcast done.
Gathering data.
Gather done.

$ mpirun -c 4 -ssi rpi ib ./stresstest
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue

The exact error string returned by Infiniband API is as follows:

-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue

The exact error string returned by Infiniband API is as follows:

"Operation Completed Successfully"
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue

The exact error string returned by Infiniband API is as follows:

"Operation Completed Successfully"
-----------------------------------------------------------------------------
"Operation Completed Successfully"
-----------------------------------------------------------------------------
MPI_Recv: internal MPI error: Invalid argument (rank 0, MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Init()
Rank (0, MPI_COMM_WORLD): - main()
MPI_Send: internal MPI error: Invalid argument (rank 3, MPI_COMM_WORLD)
Rank (3, MPI_COMM_WORLD): Call stack within LAM:
Rank (3, MPI_COMM_WORLD): - MPI_Send()
Rank (3, MPI_COMM_WORLD): - MPI_Init()
Rank (3, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue

The exact error string returned by Infiniband API is as follows:

"Operation Completed Successfully"
-----------------------------------------------------------------------------
MPI_Send: internal MPI error: Invalid argument (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Send()
Rank (1, MPI_COMM_WORLD): - MPI_Init()
Rank (1, MPI_COMM_WORLD): - main()
MPI_Send: internal MPI error: Invalid argument (rank 2, MPI_COMM_WORLD)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Send()
Rank (2, MPI_COMM_WORLD): - MPI_Init()
Rank (2, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 18491 failed on node n0 (x.x.x.x) with exit status 1.
-----------------------------------------------------------------------------