I found that, when using nonblocking communications, if I start certain
amount of communications (about 30 per process) in bulk, the performance
is really bad, even though there is enough calculos time after these
operations to overlap whit communications and I do calls to "MPI_Test"
to controll progress. In this scenario, it is preferible to use the
blocking approach.
If, I join buffers to reduce the amount of communications, performance
with nonclobking communications increase amazingly, reaching more than
200% in some cases.
|