As almost a new comer for
MPI, I am confused on a
specific application: hierarchical communication.
After building a hierarchical bi-tree
of processes, I try MPI_Reduce
(or MPI_Bcast) to communicate between
the two processes at each level.
Surprisingly, the performance gets worse sometimes. For example,
Mpirun –np 16 executive
The first invoking of hierarchical-MPI_Reduce may costs merely 0.01 second, but the second one may cost up to 0.38 second (for some processes), or 38
times.
The question is, what’s
the real reason, and tried to find it in a lot of ways. Now I want your kindly
help.
The following is the brief code:
MPI_Init(&argc,
&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD,&np)
level = 0;
key = my_rank;
//in
MPI_COMM_WORLD
color =
key/power_of_2[level+1]; //finally, colr = rank /
2^(levelno +1)
//color
= 0, 0, 1, 1, 2, 2, 3, 3...
MPI_Comm_split(MPI_COMM_WORLD, color, key, &subComm[levelno][key]);
//go on splitting,
until construct a bi-tree of level ceiling(log(np)/log(2));
//build a hierarchichal bi-tree of
processes using the MPI_COMM_SPLIT
//
0
//
0
8
//
0 4
8 12
//
0 2 4 6 8
10 12 14
//
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
for(levelno=1;
levelno<iLevel; levelno++)
{
color = MPI_UNDEFINED;
if( key % power_of_2[levelno] == 0) //2 to some power
//if level=1, key is 0, 2, 4, 6, 8, ...
{
color = key/power_of_2[levelno+1]; //finally, colr = rank / 2^(levelno+1)
//key= 0, 1, 2, 3, 4,
5, 6, 7, 8, ... //in MPI_COMM_WORLD
// = 0, 0, 0, 0,
1, 1, 1, 1, 2, ... //in new communicator
}
MPI_Comm_split(MPI_COMM_WORLD, color, key, &subComm[levelno][key]);
}
for(loop=0; loop < 2; loop ++)
{
// level 0, all
invoke
MPI_Reduce(buff1, buff2,size,MPI_DOUBLE,MPI_SUM,0,subComm[levelno][my_rank]);
for(levelno=0;
levelno<iLevel; levelno++ )
{
if( rankIsInLevel(my_rank, power_of_2[levelno]))
//this
process is in this level, i.e. my_rank
%2^level == 0
MPI_Reduce(buff1, buff2, size, MPI_DOUBLE,
MPI_SUM,0, subComm[levelno][my_rank]);
}//for
}
The running time of different processes varies greatly, for example, for
process 0, the time is
0.01 * 2 = 0.02 seconds, but for process 1, the time may be 0.38*2.
By the way, the machine is a
linux cluster composed by
Dell processors.
Thank you very much!
Xiren