As almost a new comer for MPI, I am confused on a specific application:
hierarchical communication.
After building a hierarchical bi-tree of processes, I try MPI_Reduce (or
MPI_Bcast) to communicate between the two processes at each level.
Surprisingly, the performance gets worse sometimes. For example,
Mpirun -np 16 executive
The first invoking of hierarchical-MPI_Reduce may costs merely 0.01 second,
but the second one may cost up to 0.38 second (for some processes), or 38
times.
The question is, what's the real reason, and tried to find it in a lot of
ways. Now I want your kindly help.
The following is the brief code:
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD,&np)
level = 0;
key = my_rank; //in MPI_COMM_WORLD
color = key/power_of_2[level+1]; //finally, colr = rank /
2^(levelno +1)
//color = 0, 0, 1, 1, 2, 2, 3, 3...
MPI_Comm_split(MPI_COMM_WORLD, color, key, &subComm[levelno][key]);
//go on splitting, until construct a bi-tree of level
ceiling(log(np)/log(2));
//build a hierarchichal bi-tree of processes using the MPI_COMM_SPLIT
// 0
// 0 8
// 0 4 8 12
// 0 2 4 6 8 10 12 14
// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
for(levelno=1; levelno<iLevel; levelno++)
{
color = MPI_UNDEFINED;
if( key % power_of_2[levelno] == 0) //2 to some power
//if level=1, key is 0, 2, 4, 6, 8, ...
{
color = key/power_of_2[levelno+1]; //finally, colr = rank /
2^(levelno+1)
//key= 0, 1, 2, 3, 4, 5, 6, 7, 8, ... //in
MPI_COMM_WORLD
// = 0, 0, 0, 0, 1, 1, 1, 1, 2, ... //in new
communicator
}
MPI_Comm_split(MPI_COMM_WORLD, color, key,
&subComm[levelno][key]);
}
for(loop=0; loop < 2; loop ++)
{
// level 0, all invoke
MPI_Reduce(buff1,
buff2,size,MPI_DOUBLE,MPI_SUM,0,subComm[levelno][my_rank]);
for(levelno=0; levelno<iLevel; levelno++ )
{
if( rankIsInLevel(my_rank, power_of_2[levelno]))
//this process is in this level, i.e. my_rank %2^level == 0
MPI_Reduce(buff1, buff2, size, MPI_DOUBLE, MPI_SUM,0,
subComm[levelno][my_rank]);
}//for
}
The running time of different processes varies greatly, for example, for
process 0, the time is 0.01 * 2 = 0.02 seconds, but for process 1, the time
may be 0.38*2.
By the way, the machine is a linux cluster composed by Dell processors.
Thank you very much!
Xiren
|