LAM/MPI logo

LAM/MPI Development Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Ramachandra K (ariston_at_[hidden])
Date: 2005-11-12 05:42:01


Hi,

I have been trying to understand the flow control mechanism used in
the IB SSI RPI module in LAM/MPI (7.1.1) and if my understanding is
correct, there seems to be a case where the flow control mechanism
fails.

Before describing the possible failing case, I would like to put forth
my understanding of IB flow control. Please correct me if my
understanding is incorrect.

In the IB module:

1.Each process pre-posts a certain number (num_envelopes) of receive
work requests and also maintains the following counters (in the
corresponding proc structure):
    i) The number of RDMA Sends it has done to a particular process
        [fc_local_sent_env_cnt]
   ii) The number of unconsumed pre-posted RDMA Recv work requests for a
        particular process [rem_recv_bufs]
  iii) The number of RDMA Recvs from a particular process [fc_post_env_cnt]
   iv) The number of RDMA Recvs received by the other process from this
       process. This counter is remotely updated by the remote process using
       an RDMA Write. [fc_remote_recv_env_cnt]

Consider a scenario where processes A and B on two different nodes are
communicating. For ease of explanation, the following names are used for
the counters instead of the ones actually used in the code:

Process A maintains these counters for a process B:

Sb = Number of RDMA Sends done to B [fc_local_sent_env_cnt]
Pb = Number of unconsumed pre-posted RDMA Receive requests for process
     B [rem_recv_bufs]
Rb = Number of RDMA Send messages that were received at A from B.
     [fc_post_env_cnt]
Remb, Remote RDMA recv count at B = the RDMA recv count reported by B
which is the number of RDMA sends from A that were received by B. (This counter
value is written remotely by B using an RDMA write). [fc_remote_recv_env_cnt]

Similarly B will maintain the same counts for A: Sa, Pa, Ra, Rema.

Also consider N = num_envelopes.

As part of flow control, B updates A's Remb counter with the value of
Ra (B's variable) and A updates B's Rema counter with the value of Rb
(A's variable)

2. Whenever Recv completions occur, a process does the following:
    i) Process the recv completion, which may mean retrieving data from the
       envelope in case of a tiny message or doing an RDMA Read in case of a
       large message.
   ii) Since a recv work request has been consumed, decrement the number of
       unconsumed pre-posted RDMA Recv work request count associated with
       this process.
   iii) If the pre-posted recv work request count becomes less than or equal
       to N/2, repost N/2 work requests and update the unconsumed pre-posted
       recv work request count.
   iv) Increment the RDMA recv completion count and write this value to
       the corresponding process' counter using an RDMA write.

For example whenever B receives a work completion on the CQ associated with A,
it does the following:
    i) Process the work completion
   ii) Pa--;
  iii) if(Pa <= N/2) {repost N/2 work requests; Pa = Pa + N/2}
   iv) a. Ra++
       b. Write value of Ra to A's variable Remb using RDMA Write

3. Also before doing an RDMA send, a process checks if there are
enough unconsumed pre-posted RDMA Recv requests (free slots) on
the other end. For example before A does an RDMA Send to B, it
does the following:
   i) if N-(Sb-Remb) > 0 then proceed with the RDMA Send and do Sb++
   ii) if N-(Sb-Remb) is equal to 0, then it indicates that there are no more
       pre-posted recv requests available on the other end, so put the
       MPI send request on a queue, which will be processed later on

In the lam_ssi_rpi_ib_advance() function the value of N-(Sb-Remb) is
checked to see if the peer process has remotely updated Remb
(with an RDMA write). If it has been updated, the MPI send requests
which have been queued are processed.

Now consider an RDMA Send from A to B with initial number of
pre-posted requests, N = 4.
Initially on A Sb = 0, Pb = 4, Rb = 0, Remb = 0 and on B we have
Sa =0, Pa = 4, Ra = 0, Rema = 0.

Let RWR be the number of unconsumed receive work requests on the QP of B.
[RWR is not a variable maintained in the code, but it is the actual
number of unconsumed receive work requests present on the hardware]

As a special case scenario assume that initially B immediately polls
the CQ and updates counters but later on slows down and does not
immediately poll the CQ and update counters.
This is possible when A does a series of non-blocking sends while B
does one MPI_Recv() and then sleeps for some time. Thus B would poll
the CQ once and then not poll again till it wakes up. This scenario is
explained below, where Column A shows the action of A (like doing an
RDMA Send) and the column B shows the corresponding action of B
like a work request getting consumed or a work completion being processed)

     A B
1. Sb = 0 Remb = 0 Ra = 0, Pa = 4, RWR = 4

2. N-(Sb-Remb) = 4-(0-0)= 4 != 0, B polls immediately and updates counters
 so do an RDMA Send so Ra = 1, Pa=3. RWR = 3. B also does
 and Sb++ an RDMA Write to update
                                                 A's Remb variable
                                                 with value of Ra = 1

3.Sb = 1 Remb = 1
 N-(Sb-Remb) = 4-(1-1) = 4 != 0 Now lets say B sleeps, does not poll
 so do an RDMA Send and update any counters so Ra = 1, Pa = 3
 and Sb++ but RWR = 2

4. Sb = 2 Remb = 1 B still does not poll and
 N-(Sb-Remb) = 4-(2-1) = 3 != 0 update counters. So
 so do an RDMA Send Ra = 1, Pa = 3, but RWR = 1
 and Sb++

5. Sb = 3, Remb = 1 B still has'nt updated counters.
 N-(Sb-Remb) = 4-(3-1)= 2 != 0 so Ra = 1, Pa = 3 but RWR = 0
 so do an RDMA Send
 and Sb++

6. Sb = 4, Remb = 1
 N-(Sb-Remb) = 4-(4-1)= 1 != 0
 so do an RDMA Send
 and Sb++

This RDMA Send will fail because RWR=0 on B i.e B does not have any
pre-posted RDMA Recv work requests.

This problem can be avoided if B does an RDMA Write only when it
reposts work requests instead of B doing an RDMA Write after every
receive completion. Also whenever B reposts work requests it must
ensure that again N unconsumed work requests are available on the QP.

Regards,
Ram
----------------------------------------
K. Ramachandra
Member of Technical Staff
Great Software Laboratory Pvt Ltd
Pune, India