Hi,
(After looking at the archives, I realised that the formatting in my
earlier mail
is lost and the example is not very clear. So I am resending the mail
after re-formatting the example.)
I have been trying to understand the flow control mechanism used in
the IB SSI RPI module in LAM/MPI (7.1.1) and if my understanding is
correct, there seems to be a case where the flow control mechanism
fails.
Before describing the possible failing case, I would like to put forth
my understanding of IB flow control. Please correct me if my
understanding is incorrect.
In the IB module:
1.Each process pre-posts a certain number (num_envelopes) of receive
work requests and also maintains the following counters (in the
corresponding proc structure):
i) The number of RDMA Sends it has done to a particular process
[fc_local_sent_env_cnt]
ii) The number of unconsumed pre-posted RDMA Recv work requests for a
particular process [rem_recv_bufs]
iii) The number of RDMA Recvs from a particular process [fc_post_env_cnt]
iv) The number of RDMA Recvs received by the other process from this
process. This counter is remotely updated by the remote process using
an RDMA Write. [fc_remote_recv_env_cnt]
Consider a scenario where processes A and B on two different nodes are
communicating. For ease of explanation, the following names are used for
the counters instead of the ones actually used in the code:
Process A maintains these counters for a process B:
Sb = Number of RDMA Sends done to B [fc_local_sent_env_cnt]
Pb = Number of unconsumed pre-posted RDMA Receive requests for process
B [rem_recv_bufs]
Rb = Number of RDMA Send messages that were received at A from B.
[fc_post_env_cnt]
Remb, Remote RDMA recv count at B = the RDMA recv count reported by B
which is the number of RDMA sends from A that were received by B. (This counter
value is written remotely by B using an RDMA write). [fc_remote_recv_env_cnt]
Similarly B will maintain the same counts for A: Sa, Pa, Ra, Rema.
Also consider N = num_envelopes.
As part of flow control, B updates A's Remb counter with the value of
Ra (B's variable) and A updates B's Rema counter with the value of Rb
(A's variable)
2. Whenever Recv completions occur, a process does the following:
i) Process the recv completion, which may mean retrieving data from the
envelope in case of a tiny message or doing an RDMA Read in case of a
large message.
ii) Since a recv work request has been consumed, decrement the number of
unconsumed pre-posted RDMA Recv work request count associated with
this process.
iii) If the pre-posted recv work request count becomes less than or equal
to N/2, repost N/2 work requests and update the unconsumed pre-posted
recv work request count.
iv) Increment the RDMA recv completion count and write this value to
the corresponding process' counter using an RDMA write.
For example whenever B receives a work completion on the CQ associated with A,
it does the following:
i) Process the work completion
ii) Pa--;
iii) if(Pa <= N/2) {repost N/2 work requests; Pa = Pa + N/2}
iv) a. Ra++
b. Write value of Ra to A's variable Remb using RDMA Write
3. Also before doing an RDMA send, a process checks if there are
enough unconsumed pre-posted RDMA Recv requests (free slots) on
the other end. For example before A does an RDMA Send to B, it
does the following:
i) if N-(Sb-Remb) > 0 then proceed with the RDMA Send and do Sb++
ii) if N-(Sb-Remb) is equal to 0, then it indicates that there are no more
pre-posted recv requests available on the other end, so put the
MPI send request on a queue, which will be processed later on
In the lam_ssi_rpi_ib_advance() function the value of N-(Sb-Remb) is
checked to see if the peer process has remotely updated Remb
(with an RDMA write). If it has been updated, the MPI send requests
which have been queued are processed.
Now consider an RDMA Send from A to B with initial number of
pre-posted requests, N = 4.
Initially on A Sb = 0, Pb = 4, Rb = 0, Remb = 0 and on B we have
Sa =0, Pa = 4, Ra = 0, Rema = 0.
Let RWR be the number of unconsumed receive work requests on the QP of B.
[RWR is not a variable maintained in the code, but it is the actual
number of unconsumed receive work requests present on the hardware]
As a special case scenario assume that initially B immediately polls
the CQ and updates counters but later on slows down and does not
immediately poll the CQ and update counters.
This is possible when A does a series of non-blocking sends while B
does one MPI_Recv() and then sleeps for some time. Thus B would poll
the CQ once and then not poll again till it wakes up. This scenario is
explained below, where events occuring at A (like RDMA Send) and
and the corresponding events occuring at B (RDMA Recv etc) are shown
in each step.
1.
A: Sb = 0 Remb = 0
B: Ra = 0, Pa = 4, RWR = 4
2.
A: N-(Sb-Remb) = 4-(0-0)= 4 != 0, so do an RDMA Send
and Sb++
B: B polls immediately and updates counters so Ra = 1,
Pa=3. RWR = 3. B also does an RDMA Write to
update A's Remb variable with value of Ra = 1
3.
A: Sb = 1 Remb = 1
N-(Sb-Remb) = 4-(1-1) = 4 != 0
so do an RDMA Send
and Sb++
B: Now lets say B sleeps, does not poll
and update any counters so Ra = 1,
Pa = 3 but RWR = 2 [the work request
on the hardware is still consumed]
4.
A: Sb = 2 Remb = 1
N-(Sb-Remb) = 4-(2-1) = 3 != 0
so do an RDMA Send and Sb++
B: B still does not poll and
update counters. So
Ra = 1, Pa = 3, but RWR = 1
5.
A: Sb = 3, Remb = 1
N-(Sb-Remb) = 4-(3-1)= 2 != 0
so do an RDMA Send
and Sb++
B: B still has'nt updated counters.
so Ra = 1, Pa = 3 but RWR = 0
6.
A: Sb = 4, Remb = 1
N-(Sb-Remb) = 4-(4-1)= 1 != 0
so do an RDMA Send
and Sb++
This RDMA Send will fail because RWR=0 on B i.e B does not have any
pre-posted RDMA Recv work requests.
This problem can be avoided if B does an RDMA Write only when it
reposts work requests instead of B doing an RDMA Write after every
receive completion. Also whenever B reposts work requests it must
ensure that again N unconsumed work requests are available on the QP.
Regards,
Ram
----------------------------------------
K. Ramachandra
Member of Technical Staff
Great Software Laboratory Pvt Ltd
Pune, India
|