LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-09-25 06:57:04


On Mon, 22 Sep 2003 seberino_at_[hidden] wrote:

> [description of non-uniform block array snipped]
>
> How get all of these "correctly calculated subarrays" from each node and
> make a copy of a "completely correct main array" for all nodes??

I don't think there's a one-step/single collective that will do this
pattern for you -- I think you'll need to do at least a 2-step process
(distribute across a process' row and across a process' column).

Here's your data map:

----------------------------------------
| | | | |
| 0 | 4 | 8 | 12 |
| | | | |
| | | | |
----------------------------------------
| | | | |
| 1 | 5 | 9 | 13 |
| | | | |
----------------------------------------
| 2 | 6 | 10 | 14 |
| | | | |
----------------------------------------
| 3 | 7 | 11 | 15 |
| | | | |
----------------------------------------

Let's take block 0 as an example; it's a 6x4 block.

a) First, distribute across 0's column. All processes in 0's column
   share a common width (6). If you make a datatype that represents a
   single row in the sub-block of width 6 (i.e., representing 6x1),
   you can then use MPI_Allgatherv to share blocks 0, 1, 2, and 3
   between their respective processes. So process zero will share 4
   of those datatypes, process 1 will share 3, and processes 2 and 3
   will share 2 of those datatypes.

b) Next, distribute across 0's row. All processes in the row share a
   common height (4). So make a datatype that represents a single
   column in the sub-block of height 4 (i.e., representing 1x4) and
   use MPI_Allgatherv to share blocks 0, 4, 8, and 12 between their
   respective processes.

In general, I am an advocate of using MPI collectives whenever possible.
However, I do recognize that not all MPI implementations' collectives are
as optimized as they could be (including LAM's; they got "better" in 7.0,
but there's still a lot of work to do). So if this distributed all-to-all
kind of communication pattern is going to be executed frequently in your
program (e.g., every iteration), it may be a significant factor in your
application's overall performance. Hence, you may want to spend a little
time and see if you can beat the MPI implementation's
MPI_Allgatherv/datatype performance.

Two alternatives that immediately come to mind are:

1. A naieve MPI_Isend approach -- just Isend each process's block to
   every other process. Something simple like:

        for (i = 0; i < size; ++i)
          if (i != me) {
            MPI_Irecv(...);
            MPI_Isend(...);
          }
        MPI_Waitall(...)

   And let the MPI implementation make progress on all the messages as
   it sees fit.

2. Optimize #1 a big to do process-pairwise exchanges (to eliminate
   the bottleneck of a simple loop-based approach where every process
   will start off my sending to process 0, then to process 1, etc.),
   and so on. E.g., think of the entire process in 11 timesteps.

   Timestep Exchanges
   0 0<->1 2<->3 4<->5 6<->7 8<->9 10<->11
   1 0<->2 1<->3 4<->6 5<->7 8<->10 9<->11
   2 0<->3 1<->2 4<->7 5<->6 8<->11 9<->10
   ...and so on.

   It *may* even be benficial to do this in a blocking manner (i.e.,
   use MPI_Send/MPI_Recv vs. MPI_Isend/MPI_Irecv). Given the regular
   pattern of data distribution, if you force the whole thing to be
   synchronous, it may avoid some unexpected receives, possibly make
   the code a little simpler, etc. You might want to try it both
   ways:

   2a. Post 11 Irecvs/Isends and then do a big Waitall (pre-post the
       receives in more-or-less the order that you expect to receive
       them so that the MPI can potentially have to search less to
       find a matching receive)

   2b. Use a synchronous approach (i.e., MPI_Send/MPI_Recv).

   I should note that approach 2a is technically a violation of MPI.
   MPI says that you cannot use the same buffer twice simultaneously.
   Hence, executing MPI_Isend on the same buffer 11 times without
   waiting for them to complete first is technically not allowed.
   That being said, most MPI implementations allow this if you are
   purely *sending* from that buffer (i.e., reading -- not writing).
   But beware. You have been warned. :-)

> (in fortran COLUMNS are continugous right?)

Correct.

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/