LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-08-04 16:32:09


Oops! My fault for not paying attention to the fact that Vishal
committed some IB changes and forgetting to roll a new beta. 7.1b14 is
now out (http://www.lam-mpi.org/beta/) that has these changes. It's
identical to last night's snapshot (except for the version number, of
course), so if you already grabbed that, you don't need to get b14.

We would love to get some external feedback on our ib RPI module.
Please let us know how it goes.

On Aug 4, 2004, at 12:00 PM, Vishal Sahay wrote:

> Hi --
>
> Can you try out the subversion (repository) version of LAM or the LAM
> nightly tarball from http://www.lam-mpi.org/svn/ and see if you still
> get
> the same problem. There were a few fixes that went in there.
>
> Also you would be able to see usage of Infiniband with LAM in the LAM
> docs
> under <your/LAM/source/dir>/doc.
>
> When you are using -ssi rpi ib, even for the four lamnodes on the same
> machine (4 CPU), Infiniband will be used.
>
> Just a note, when you have your own private installation of LAM, you do
> not need to use "-I/xxx" explicitly. It will be embedded in the
> mpicc/mpiCC wrapper compiler. You can verify that using the "-showme"
> option to the wrapper compilers -- eg: mpicc -showme
>
> Thanks!
> -Vishal
>
>
> On Tue, 3 Aug 2004, Karl Hahn wrote:
>
> # Hi again,
> #
> # I have problems using Infiniband. This is my first
> # encounter with Infiniband, so I only know that it
> # is a kind of 'fast network' :-)
> #
> # I have compiled a small program which I can run
> # with tcp and lamd rpi. With rpi ib I get errors.
> #
> # 1) There are four lamnodes on a quad CPU machine.
> # So Infiniband is not actually used(?). But is should
> # be possible to run the program. Is this right?
> #
> # 2) Can I find detailed error messages of LAM
> # anywhere?
> #
> # 3) Can someone recommend a kind of 'Infiniband
> # tutorial for LAM'? The cluster is connected with
> # Ethernet and Infiniband. Which one is used? How
> # can I control this?
> #
> # 4) It _could_ be possible that I have mixed
> # two versions of LAM (installed is a 6.5.4(!),
> # I use a 7.1.beta). I tried to avoid problems
> # using a new $PATH, $LD_LIBRARY_PATH, and
> # -I/xxx in the Makefile. But I am not sure ...
> #
> # Sorry for the beginners' questions!
> # Charlie
> #
> #
> # And here is the output of my program:
> #
> # $ lamboot
> #
> # LAM 7.1b13/MPI 2 C++/ROMIO - Indiana University
> #
> # $ lamnodes
> # n0 node:4:origin,this_node
> #
> # $ mpirun -c 4 -ssi rpi tcp ./stresstest
> #
> # Try to initialize cluster...
> # running 4 processes ...
> # checking the LAM nodes ...
> # n0 node:4:origin,this_node
> #
> # Broadcasting message with size 100MB to 3 slaves.
> # Broadcast done.
> # Gathering data.
> # Gather done.
> #
> # $ mpirun -c 4 -ssi rpi ib ./stresstest
> #
> -----------------------------------------------------------------------
> ------
> # An erroneous completion was generated while polling for the
> Infiniband
> # completion queue
> #
> # The exact error string returned by Infiniband API is as follows:
> #
> #
> -----------------------------------------------------------------------
> ------
> # An erroneous completion was generated while polling for the
> Infiniband
> # completion queue
> #
> # The exact error string returned by Infiniband API is as follows:
> #
> # "Operation Completed Successfully"
> #
> -----------------------------------------------------------------------
> ------
> #
> -----------------------------------------------------------------------
> ------
> # An erroneous completion was generated while polling for the
> Infiniband
> # completion queue
> #
> # The exact error string returned by Infiniband API is as follows:
> #
> # "Operation Completed Successfully"
> #
> -----------------------------------------------------------------------
> ------
> # "Operation Completed Successfully"
> #
> -----------------------------------------------------------------------
> ------
> # MPI_Recv: internal MPI error: Invalid argument (rank 0,
> MPI_COMM_WORLD)
> # Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> # Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> # Rank (0, MPI_COMM_WORLD): - MPI_Init()
> # Rank (0, MPI_COMM_WORLD): - main()
> # MPI_Send: internal MPI error: Invalid argument (rank 3,
> MPI_COMM_WORLD)
> # Rank (3, MPI_COMM_WORLD): Call stack within LAM:
> # Rank (3, MPI_COMM_WORLD): - MPI_Send()
> # Rank (3, MPI_COMM_WORLD): - MPI_Init()
> # Rank (3, MPI_COMM_WORLD): - main()
> #
> -----------------------------------------------------------------------
> ------
> # An erroneous completion was generated while polling for the
> Infiniband
> # completion queue
> #
> # The exact error string returned by Infiniband API is as follows:
> #
> # "Operation Completed Successfully"
> #
> -----------------------------------------------------------------------
> ------
> # MPI_Send: internal MPI error: Invalid argument (rank 1,
> MPI_COMM_WORLD)
> # Rank (1, MPI_COMM_WORLD): Call stack within LAM:
> # Rank (1, MPI_COMM_WORLD): - MPI_Send()
> # Rank (1, MPI_COMM_WORLD): - MPI_Init()
> # Rank (1, MPI_COMM_WORLD): - main()
> # MPI_Send: internal MPI error: Invalid argument (rank 2,
> MPI_COMM_WORLD)
> # Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> # Rank (2, MPI_COMM_WORLD): - MPI_Send()
> # Rank (2, MPI_COMM_WORLD): - MPI_Init()
> # Rank (2, MPI_COMM_WORLD): - main()
> #
> -----------------------------------------------------------------------
> ------
> # One of the processes started by mpirun has exited with a nonzero exit
> # code. This typically indicates that the process finished in error.
> # If your process did not finish in error, be sure to include a "return
> # 0" or "exit(0)" in your C code before exiting the application.
> #
> # PID 18491 failed on node n0 (x.x.x.x) with exit status 1.
> #
> -----------------------------------------------------------------------
> ------
> #
> #
> # _______________________________________________
> # This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> #
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/