As almost a new comer for MPI, I am confused on a specific application: hierarchical communication.

 

After building a hierarchical bi-tree of processes, I try MPI_Reduce (or MPI_Bcast) to communicate between the two processes at each level.

 

Surprisingly, the performance gets worse sometimes. For example,

 

Mpirun –np 16 executive

 

The first invoking of hierarchical-MPI_Reduce may costs merely 0.01 second, but the second one may cost up to 0.38 second (for some processes), or 38 times.

 

The question is, what’s the real reason, and tried to find it in a lot of ways. Now I want your kindly help.

 

The following is the brief code:

 

       MPI_Init(&argc, &argv);

       MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);

       MPI_Comm_size(MPI_COMM_WORLD,&np)

 

        level = 0;

        key = my_rank;                              //in MPI_COMM_WORLD

        color = key/power_of_2[level+1];     //finally, colr = rank / 2^(levelno +1)

        //color = 0, 0, 1, 1, 2, 2, 3, 3...

        MPI_Comm_split(MPI_COMM_WORLD, color, key, &subComm[levelno][key]);

 

       //go on splitting, until construct a bi-tree of level ceiling(log(np)/log(2));

    //build a hierarchichal bi-tree of processes using the MPI_COMM_SPLIT

    //                                          0

    //                                  0                   8

    //                              0       4           8          12

    //                           0    2   4   6     8     10    12    14

    //                          0 1  2 3 4 5 6 7  8   9  10 11 12 13 14 15

 

         for(levelno=1; levelno<iLevel; levelno++)

        {

            color = MPI_UNDEFINED;

 

            if( key % power_of_2[levelno] == 0)     //2 to some power

                //if level=1, key is 0, 2, 4, 6, 8, ...

            {

                color = key/power_of_2[levelno+1];  //finally, colr = rank / 2^(levelno+1)

                    //key= 0, 1, 2, 3, 4, 5, 6, 7, 8, ...      //in MPI_COMM_WORLD

                    //  = 0, 0, 0, 0, 1, 1, 1, 1, 2, ...  //in new communicator

            }

            MPI_Comm_split(MPI_COMM_WORLD, color, key, &subComm[levelno][key]);

        }

 

    for(loop=0; loop < 2; loop ++)

    {

        // level 0, all invoke

        MPI_Reduce(buff1, buff2,size,MPI_DOUBLE,MPI_SUM,0,subComm[levelno][my_rank]);

        for(levelno=0; levelno<iLevel; levelno++ )

        {

            if( rankIsInLevel(my_rank, power_of_2[levelno]))              //this process is in this level, i.e. my_rank %2^level == 0

                MPI_Reduce(buff1, buff2, size, MPI_DOUBLE, MPI_SUM,0, subComm[levelno][my_rank]);

        }//for

}

 

   The running time of different processes varies greatly, for example, for process 0, the time is 0.01 * 2 = 0.02 seconds, but for process 1, the time may be 0.38*2.

 

By the way, the machine is a linux cluster composed by Dell processors.

 

 

 

Thank you very much!

 

Xiren