Hi again,
I have problems using Infiniband. This is my first
encounter with Infiniband, so I only know that it
is a kind of 'fast network' :-)
I have compiled a small program which I can run
with tcp and lamd rpi. With rpi ib I get errors.
1) There are four lamnodes on a quad CPU machine.
So Infiniband is not actually used(?). But is should
be possible to run the program. Is this right?
2) Can I find detailed error messages of LAM
anywhere?
3) Can someone recommend a kind of 'Infiniband
tutorial for LAM'? The cluster is connected with
Ethernet and Infiniband. Which one is used? How
can I control this?
4) It _could_ be possible that I have mixed
two versions of LAM (installed is a 6.5.4(!),
I use a 7.1.beta). I tried to avoid problems
using a new $PATH, $LD_LIBRARY_PATH, and
-I/xxx in the Makefile. But I am not sure ...
Sorry for the beginners' questions!
Charlie
And here is the output of my program:
$ lamboot
LAM 7.1b13/MPI 2 C++/ROMIO - Indiana University
$ lamnodes
n0 node:4:origin,this_node
$ mpirun -c 4 -ssi rpi tcp ./stresstest
Try to initialize cluster...
running 4 processes ...
checking the LAM nodes ...
n0 node:4:origin,this_node
Broadcasting message with size 100MB to 3 slaves.
Broadcast done.
Gathering data.
Gather done.
$ mpirun -c 4 -ssi rpi ib ./stresstest
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue
The exact error string returned by Infiniband API is as follows:
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue
The exact error string returned by Infiniband API is as follows:
"Operation Completed Successfully"
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue
The exact error string returned by Infiniband API is as follows:
"Operation Completed Successfully"
-----------------------------------------------------------------------------
"Operation Completed Successfully"
-----------------------------------------------------------------------------
MPI_Recv: internal MPI error: Invalid argument (rank 0, MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Recv()
Rank (0, MPI_COMM_WORLD): - MPI_Init()
Rank (0, MPI_COMM_WORLD): - main()
MPI_Send: internal MPI error: Invalid argument (rank 3, MPI_COMM_WORLD)
Rank (3, MPI_COMM_WORLD): Call stack within LAM:
Rank (3, MPI_COMM_WORLD): - MPI_Send()
Rank (3, MPI_COMM_WORLD): - MPI_Init()
Rank (3, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
An erroneous completion was generated while polling for the Infiniband
completion queue
The exact error string returned by Infiniband API is as follows:
"Operation Completed Successfully"
-----------------------------------------------------------------------------
MPI_Send: internal MPI error: Invalid argument (rank 1, MPI_COMM_WORLD)
Rank (1, MPI_COMM_WORLD): Call stack within LAM:
Rank (1, MPI_COMM_WORLD): - MPI_Send()
Rank (1, MPI_COMM_WORLD): - MPI_Init()
Rank (1, MPI_COMM_WORLD): - main()
MPI_Send: internal MPI error: Invalid argument (rank 2, MPI_COMM_WORLD)
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Send()
Rank (2, MPI_COMM_WORLD): - MPI_Init()
Rank (2, MPI_COMM_WORLD): - main()
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 18491 failed on node n0 (x.x.x.x) with exit status 1.
-----------------------------------------------------------------------------
|