Hi,
In parallel matrix multiplication, the C submatrices C1, C2, C3, etc
are computed using A and B submatrix pairs (A1, B1), (A2, B2),
(A3, B3), etc received from other PEs. If I loop over C1, C2, ...
in that order, my MPI_wait() may really have to spend time waiting
for the submatrix pairs (Ai, Bi) to come, even if other pairs (Aj, Bj)
have already arrived. So my questions is: how do I pick the already
arrived pairs to compute, so that my CPUs are mostly busy and
the cost of communication is partially hidden? Is MPI_probe()
the right way to go? Do I need to maintain a queue myself to manage
the skipped pairs (since they are still being communicated) so I can
come back to them at a later time?
Any suggestions are highly appreciated!
Thanks,
-Lei
|