LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Pierre Valiron (Pierre.Valiron_at_[hidden])
Date: 2005-08-11 08:13:41


Hi all,

I am experimenting the MPI_Alltoall performance on a Gigabit cluster of
30 quadriprocessor nodes (Sun v40z running Solaris 10). This seems a
unique configuration in Europe and we are still lacking updated manuals
for all the subtelties of Solaris 10. As far as I can guess, the future
SunMPI might rely on OpenMPI, so it makes sense to work in the meanwhile
with LAM/MPI. I don't need to advertise LAM/MPI any more on this list
either ;-).

I am using LAM/MPI 7.1.1 compiled using the Sun Studio 10 compilers.
Simple MPI codes run fine over all the nodes and cpus. Complex MPI codes
are also running on individual 4-way nodes with excellent performances.

Now I am using a crude Alltoall benchmark to stress the whole cluster
and identify the bottlenecks and problems, and I hope I'll get some
useful feedback from this list.

- Using default setup for ethernet cards and for the HP ProCurve 2848
results in poor performances and even in freezing the bench when the
number of nodes is augmented.

- We then forced the speed to 1000-fdx and enabled flow control on the
switch. This resulted in a big performance improvement and reduced the
occurence of freezing issues with a large number of nodes.

So far, so good. However I come to performance issues.

I illustrate with an example using 13 nodes (the slice size is expressed
in bytes).

a) 13 procs on 13 nodes:

> lamboot 13.p
> mpirun N a.out
World size = 13 processors
nombre d'iterations pour chaque taille de buffer : 10
Alltoall test...
Input maximum message size: 1000000
slide=1, 10 iter, time/iter= 0.000921 s, total=0.169383 MB/s,
node=0.013029 MB/s
slide=2, 10 iter, time/iter= 0.000818 s, total=0.381278 MB/s,
node=0.029329 MB/s
slide=4, 10 iter, time/iter= 0.000873 s, total=0.714762 MB/s,
node=0.054982 MB/s
slide=8, 10 iter, time/iter= 0.000758 s, total=1.646636 MB/s,
node=0.126664 MB/s
slide=16, 10 iter, time/iter= 0.000828 s, total=3.014479 MB/s,
node=0.231883 MB/s
slide=32, 10 iter, time/iter= 0.000811 s, total=6.152254 MB/s,
node=0.473250 MB/s
slide=64, 10 iter, time/iter= 0.000837 s, total=11.928426 MB/s,
node=0.917571 MB/s
slide=128, 10 iter, time/iter= 0.000852 s, total=23.442176 MB/s,
node=1.803244 MB/s
slide=256, 10 iter, time/iter= 0.000920 s, total=43.394747 MB/s,
node=3.338057 MB/s
slide=512, 10 iter, time/iter= 0.000984 s, total=81.178504 MB/s,
node=6.244500 MB/s
slide=1024, 10 iter, time/iter= 0.000958 s, total=166.678665 MB/s,
node=12.821436 MB/s
slide=2048, 10 iter, time/iter= 0.001025 s, total=311.758090 MB/s,
node=23.981392 MB/s
slide=4096, 10 iter, time/iter= 0.001307 s, total=488.776553 MB/s,
node=37.598196 MB/s
slide=8192, 10 iter, time/iter= 0.002051 s, total=622.965434 MB/s,
node=47.920418 MB/s
slide=16384, 10 iter, time/iter= 0.003691 s, total=692.468179 MB/s,
node=53.266783 MB/s
slide=32768, 10 iter, time/iter= 0.067712 s, total=75.493148 MB/s,
node=5.807165 MB/s
slide=65536, 10 iter, time/iter= 0.122046 s, total=83.768256 MB/s,
node=6.443712 MB/s
slide=131072, 10 iter, time/iter= 0.064205 s, total=318.469431 MB/s,
node=24.497649 MB/s
slide=262144, 10 iter, time/iter= 0.035519 s, total=1151.347352 MB/s,
node=88.565181 MB/s
slide=524288, 10 iter, time/iter= 0.069661 s, total=1174.090857 MB/s,
node=90.314681 MB/s
clock resolution in seconds: 0.00000100

b) 52 procs on 13 nodes:

World size = 13 processors
nombre d'iterations pour chaque taille de buffer : 10
Alltoall test...
Input maximum message size: 1000000000
slice=1, 10 iter, time/iter= 0.000993 s, total=0.157132 MB/s,
node=0.012087 MB/s
slice=2, 10 iter, time/iter= 0.000795 s, total=0.392497 MB/s,
node=0.030192 MB/s
slice=4, 10 iter, time/iter= 0.000864 s, total=0.722477 MB/s,
node=0.055575 MB/s
slice=8, 10 iter, time/iter= 0.000840 s, total=1.486269 MB/s,
node=0.114328 MB/s
slice=16, 10 iter, time/iter= 0.000826 s, total=3.022224 MB/s,
node=0.232479 MB/s
slice=32, 10 iter, time/iter= 0.000854 s, total=5.848104 MB/s,
node=0.449854 MB/s
slice=64, 10 iter, time/iter= 0.000875 s, total=11.410335 MB/s,
node=0.877718 MB/s
slice=128, 10 iter, time/iter= 0.000903 s, total=22.112703 MB/s,
node=1.700977 MB/s
slice=256, 10 iter, time/iter= 0.000892 s, total=44.771530 MB/s,
node=3.443964 MB/s
slice=512, 10 iter, time/iter= 0.000911 s, total=87.645514 MB/s,
node=6.741963 MB/s
slice=1024, 10 iter, time/iter= 0.000970 s, total=164.602604 MB/s,
node=12.661739 MB/s
slice=2048, 10 iter, time/iter= 0.001075 s, total=297.255944 MB/s,
node=22.865842 MB/s
slice=4096, 10 iter, time/iter= 0.001325 s, total=482.320051 MB/s,
node=37.101542 MB/s
slice=8192, 10 iter, time/iter= 0.002076 s, total=615.645688 MB/s,
node=47.357361 MB/s
slice=16384, 10 iter, time/iter= 0.003660 s, total=698.259495 MB/s,
node=53.712269 MB/s
slice=32768, 10 iter, time/iter= 0.068230 s, total=74.920780 MB/s,
node=5.763137 MB/s
slice=65536, 10 iter, time/iter= 0.125220 s, total=81.645177 MB/s,
node=6.280398 MB/s
slice=131072, 10 iter, time/iter= 0.058053 s, total=352.214214 MB/s,
node=27.093401 MB/s
slice=262144, 10 iter, time/iter= 0.035119 s, total=1164.464091 MB/s,
node=89.574161 MB/s
slice=524288, 10 iter, time/iter= 0.065959 s, total=1239.990645 MB/s,
node=95.383896 MB/s
slice=1048576, 10 iter, time/iter= 0.127614 s, total=1281.821578 MB/s,
node=98.601660 MB/s
slice=2097152, 10 iter, time/iter= 0.241282 s, total=1355.906079 MB/s,
node=104.300468 MB/s
slice=4194304, 10 iter, time/iter= 0.474827 s, total=1378.000783 MB/s,
node=106.000060 MB/s
slice=8388608, 10 iter, time/iter= 0.955513 s, total=1369.549575 MB/s,
node=105.349967 MB/s
clock resolution in seconds: 0.00000100

c) 23 procs on 23 nodes

> lamboot 23.p
> mpirun N a.out
World size = 23 processors
nombre d'iterations pour chaque taille de buffer : 10
Alltoall test...
Input maximum message size: 1000000000
slice=1, 10 iter, time/iter= 0.001202 s, total=0.420936 MB/s,
node=0.018302 MB/s
slice=2, 10 iter, time/iter= 0.001084 s, total=0.933502 MB/s,
node=0.040587 MB/s
slice=4, 10 iter, time/iter= 0.001052 s, total=1.924132 MB/s,
node=0.083658 MB/s
slice=8, 10 iter, time/iter= 0.002041 s, total=1.983521 MB/s,
node=0.086240 MB/s
slice=16, 10 iter, time/iter= 0.001121 s, total=7.224758 MB/s,
node=0.314120 MB/s
slice=32, 10 iter, time/iter= 0.001088 s, total=14.886276 MB/s,
node=0.647229 MB/s
slice=64, 10 iter, time/iter= 0.001079 s, total=30.026603 MB/s,
node=1.305504 MB/s
slice=128, 10 iter, time/iter= 0.001147 s, total=56.457527 MB/s,
node=2.454675 MB/s
slice=256, 10 iter, time/iter= 0.001090 s, total=118.863542 MB/s,
node=5.167980 MB/s
slice=512, 10 iter, time/iter= 0.001123 s, total=230.755304 MB/s,
node=10.032839 MB/s
slice=1024, 10 iter, time/iter= 0.001186 s, total=436.878772 MB/s,
node=18.994729 MB/s
slice=2048, 10 iter, time/iter= 0.001447 s, total=716.406010 MB/s,
node=31.148087 MB/s
slice=4096, 10 iter, time/iter= 0.002087 s, total=993.090056 MB/s,
node=43.177829 MB/s
slice=8192, 10 iter, time/iter= 0.674983 s, total=6.141120 MB/s,
node=0.267005 MB/s
slice=16384, 10 iter, time/iter= 0.587798 s, total=14.104002 MB/s,
node=0.613217 MB/s
slice=32768, 10 iter, time/iter= 0.476085 s, total=34.827021 MB/s,
node=1.514218 MB/s
slice=65536, 10 iter, time/iter= 0.502088 s, total=66.046606 MB/s,
node=2.871592 MB/s
slice=131072, 10 iter, time/iter= 0.343736 s, total=192.946120 MB/s,
node=8.388962 MB/s
slice=262144, 10 iter, time/iter= 0.343521 s, total=386.132883 MB/s,
node=16.788386 MB/s
slice=524288, 10 iter, time/iter= 0.472416 s, total=561.560062 MB/s,
node=24.415655 MB/s
slice=1048576, 10 iter, time/iter= 0.608414 s, total=872.070204 MB/s,
node=37.916096 MB/s
slice=2097152, 10 iter, time/iter= 0.785687 s, total=1350.612802 MB/s,
node=58.722296 MB/s
slice=4194304, 10 iter, time/iter= 1.270416 s, total=1670.568532 MB/s,
node=72.633414 MB/s
slice=8388608, 10 iter, time/iter= 2.079299 s, total=2041.378015 MB/s,
node=88.755566 MB/s
slice=16777216, 10 iter, time/iter= 3.821460 s, total=2221.473168 MB/s,
node=96.585790 MB/s
clock resolution in seconds: 0.00000100

d) 92 procs on 23 nodes

> lamboot 23.p
> mpirun C a.out
World size = 92 processors
nombre d'iterations pour chaque taille de buffer : 10
Alltoall test...
Input maximum message size: 1000000000
slice=1, 10 iter, time/iter= 0.217072 s, total=0.038568 MB/s,
node=0.000419 MB/s
slice=2, 10 iter, time/iter= 0.262327 s, total=0.063829 MB/s,
node=0.000694 MB/s
slice=4, 10 iter, time/iter= 0.005088 s, total=6.581366 MB/s,
node=0.071537 MB/s
slice=8, 10 iter, time/iter= 0.005142 s, total=13.026014 MB/s,
node=0.141587 MB/s
slice=16, 10 iter, time/iter= 0.005807 s, total=23.066976 MB/s,
node=0.250728 MB/s
slice=32, 10 iter, time/iter= 0.004918 s, total=54.478632 MB/s,
node=0.592159 MB/s
slice=64, 10 iter, time/iter= 0.004915 s, total=109.019105 MB/s,
node=1.184990 MB/s
slice=128, 10 iter, time/iter= 0.005022 s, total=213.367035 MB/s,
node=2.319207 MB/s
slice=256, 10 iter, time/iter= 0.005834 s, total=367.368758 MB/s,
node=3.993139 MB/s
slice=512, 10 iter, time/iter= 0.005537 s, total=774.164554 MB/s,
node=8.414832 MB/s
slice=1024, 10 iter, time/iter= 0.470873 s, total=18.206448 MB/s,
node=0.197896 MB/s
slice=2048, 10 iter, time/iter= 24.120486 s, total=0.710842 MB/s,
node=0.007727 MB/s
(apparently frozen beyond)

 From above numbers one can draw some conclusions.

- except in the largest case, small slice sizes result in small elapsed
time, seemingly dominated by gigabit latencies.
- increasing the slice size improves the throughput with little impact
on elapsed time until some limiting value as indicated in table below:
nodes, procs, slice limit
13, 13, 16384
13, 52, 16384
23, 23, 4096
23, 92, 512 or 1024

Beyond this slice limiting value, the elapsed time increases first by 1
or 2 orders of magnitude and then generally *reduces* when the slice
size is augmented... Excepted for the biggest case, very good
performances are restored for very large slice sizes.

The behaviour seems erratic. I also experimented on another cluster
(linux, 13 bi-xeon nodes). The trend is similar, however the limiting
slice limits are different. I see no logic in this behaviour and I
suspect some bottleneck in the kernel IP stack, LAM/MPI or both. The
freezing issues and the beneficial role of enabling flow control on
gigabit let me suspect primarily IP congestion problems at system level.
However LAM/MPI has also some performances issues when the message size
goes beyond the "small message" 64K limit, and might also be unfair with
respect to IP bandwidth saturation.

Any similar experience on other Gigabit clusters under Linux or Solaris 10 ?
Any clue ?

For the curious I attach the all2all.c program.
Best regards.

-- 
Soutenez le mouvement SAUVONS LA RECHERCHE :
http://recherche-en-danger.apinc.org/
       _/_/_/_/    _/       _/       Dr. Pierre VALIRON
      _/     _/   _/      _/   Laboratoire d'Astrophysique
     _/     _/   _/     _/    Observatoire de Grenoble / UJF
    _/_/_/_/    _/    _/    BP 53  F-38041 Grenoble Cedex 9 (France)
   _/          _/   _/    http://www-laog.obs.ujf-grenoble.fr/~valiron/
  _/          _/  _/     Mail: Pierre.Valiron_at_[hidden]
 _/          _/ _/      Phone: +33 4 7651 4787  Fax: +33 4 7644 8821
_/          _/_/        

/* alltoall tweaked by PV, 11 aoug 2005 */
#include "mpi.h"
#include <stdio.h>
/* stdlib is needed for declaring malloc */
#include <stdlib.h>

#define MAX2(a,b) (((a)>(b)) ? (a) : (b))

int GlobalReadInteger();
void Hello();
void Ring();
void Alltoall();
/*
void Stress();
void Globals();
*/

int nbiter;

int main(argc,argv)
int argc;
char **argv;
{

    int me, option, namelen, size;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&me);
    MPI_Comm_size(MPI_COMM_WORLD,&size);

    if (size < 2) {
        fprintf(stderr, "systest requires at least 2 processes" );
/* MPI_Abort(MPI_COMM_WORLD,1);*/
    }

    MPI_Get_processor_name(processor_name,&namelen);

    fprintf(stderr,"Process %d is alive on %s\n",
            me, processor_name);

    while (1) {

        MPI_Barrier(MPI_COMM_WORLD);
        (void) fflush(stderr);

      again:
        if (me == 0) {
            /* Read user input for action */
            /* (void) printf("\nOptions: 0=quit, 1=Hello, 2=Ring, 3=Stress, ");
            (void) printf("4=Globals : "); */

            (void) printf("\nWorld size = %d processors\n", size);
            (void) fflush(stdout);
            (void) printf("\nOptions: 0=quit, 1=Hello, 2=Ring, 3=Alltoall : ");
            (void) fflush(stdout);
        }
        option = GlobalReadInteger();
        if ( (option < 0) || (option > 4) )
            goto again;

        if ( (option == 2) || (option == 3) ) {
          if (me == 0) {
            (void) printf("\nnombre d'iterations pour chaque taille de buffer : ");
            (void) fflush(stdout);
          }
            nbiter = GlobalReadInteger();
        }
        
        switch (option) {
          case 0:
            MPI_Finalize();
            return(0);
          case 1:
            Hello(); break;
          case 2:
            Ring(); break;
          case 3:
            Alltoall(); break;
/*
          case 4:
            Globals(); break;
*/
          default:
            fprintf(stderr,"systest: invalid option %d\n", option); break;
        }
    }
}

int GlobalReadInteger()
/*
  Process zero reads an integer from stdin and broadcasts
  to everyone else
*/
{
    int me, value, *msg, msg_len, type=999 ,zero=0;

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    if (me == 0) {
        if (scanf("%d", &value) != 1)
            fprintf(stderr,"failed reading integer value from stdin\n");
    }
    MPI_Bcast(&value, 1, MPI_INT, 0, MPI_COMM_WORLD);
    return value;
}

void Alltoall()
{
    int nproc, me;
    MPI_Status status;
    int type = 4;
    int i;
    char *buffer, *msg;
    int start, lenbuf, slice_len, max_len;
    double us_rate, us_rate_node;
    double start_ustime, end_ustime, used_ustime;

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    if (me == 0) {
        (void) printf("\nAlltoall test... time network performance\n---------\n\n");
        (void) printf("Input maximum message size: ");
        (void) fflush(stdout);
    }

    max_len = GlobalReadInteger()*nproc;
    if ( (max_len <= 0) || (max_len >= 16*1024*1024*nproc) )
                        max_len = 16*1024*1024*nproc;
    if ( (buffer = (char *) malloc((unsigned) max_len)) == (char *) NULL) {
            printf("process %d could not allocate buffer of size %d\n",me,max_len);
            MPI_Abort(MPI_COMM_WORLD,7777);
    }
    if ( (msg= (char *) malloc((unsigned) max_len)) == (char *) NULL) {
            printf("process %d could not allocate buffer of size %d\n",me,max_len);
            MPI_Abort(MPI_COMM_WORLD,7777);
    }

    lenbuf = nproc;
    slice_len= lenbuf/nproc;
      while (lenbuf <= max_len) {
        start_ustime = MPI_Wtime();
        for (i=1; i<=nbiter; i++) {
          MPI_Alltoall(buffer, slice_len, MPI_CHAR, msg, slice_len, MPI_CHAR, MPI_COMM_WORLD);
          MPI_Barrier(MPI_COMM_WORLD);
        }
        used_ustime = MPI_Wtime() - start_ustime;

        if (used_ustime > 0) { /* rate is megabytes per second */
          us_rate = (double)slice_len * ((double)nproc-1) * (double)nproc * (double)nbiter;
          us_rate = us_rate / (used_ustime*(double)1000000);
          us_rate_node = us_rate / (double)nproc;
          }
        else
          us_rate = 0.0;

        if (me == 0) {
            printf("slice=%d, %d iter, time/iter= %f s, total=%f MB/s, node=%f MB/s\n", slice_len, nbiter, used_ustime/nbiter, us_rate, us_rate_node);
           }
           lenbuf *= 2;
           slice_len *= 2;
       }
       if (me == 0)
           printf("clock resolution in seconds: %10.8f\n", MPI_Wtick());
    free(buffer);
}

void Hello()
/*
  Everyone exchanges a hello message with everyone else.
  The hello message just comprises the sending and target nodes.
*/
{
    int nproc, me;
    int type = 1;
    int buffer[2], node, length;
    MPI_Status status;

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    if (me == 0) {
        printf("\nHello test ... show network integrity\n----------\n\n");
        fflush(stdout);
    }

    for (node = 0; node<nproc; node++) {
        if (node != me) {
            buffer[0] = me;
            buffer[1] = node;
            MPI_Send(buffer, 2, MPI_INT, node, type, MPI_COMM_WORLD);
            MPI_Recv(buffer, 2, MPI_INT, node, type, MPI_COMM_WORLD, &status);

            if ( (buffer[0] != node) || (buffer[1] != me) ) {
                (void) fprintf(stderr, "Hello: %d!=%d or %d!=%d\n",
                               buffer[0], node, buffer[1], me);
                printf("Mismatch on hello process ids; node = %d\n", node);
            }

            printf("Hello from %d to %d\n", me, node);
            fflush(stdout);
        }
    }
}

void Ring() /* Time passing a message round a ring */
{
    int nproc, me;
    MPI_Status status;
    int type = 4;
    int left, right;
    int i;
    char *buffer, *msg;
    int start, lenbuf, max_len, msg_len;
    double us_rate;
    double start_ustime, end_ustime, used_ustime;

    MPI_Comm_rank(MPI_COMM_WORLD, &me);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    left = (me + nproc - 1) % nproc;
    right = (me + 1) % nproc;

    /* Find out how big a message to use */

    if (me == 0) {
        (void) printf("\nRing test...time network performance\n---------\n\n");
        (void) printf("Input maximum message size: ");
        (void) fflush(stdout);
    }
    max_len = GlobalReadInteger();
    if ( (max_len <= 0) || (max_len >= 4*1024*1024) )
        max_len = 512*1024;
    if ( (buffer = (char *) malloc((unsigned) max_len)) == (char *) NULL) {
        printf("process %d could not allocate buffer of size %d\n",me,max_len);
        MPI_Abort(MPI_COMM_WORLD,7777);
    }

    lenbuf = 1;
    while (lenbuf <= max_len) {
        start_ustime = MPI_Wtime();
      for (i=1; i<=nbiter; i++) {
        if (me == 0) {
            MPI_Send(buffer,lenbuf,MPI_CHAR,left, type,MPI_COMM_WORLD);
            MPI_Recv(buffer,lenbuf,MPI_CHAR,right,type,MPI_COMM_WORLD,&status);
        }
        else {
            MPI_Recv(buffer,lenbuf,MPI_CHAR,right,type,MPI_COMM_WORLD,&status);
            MPI_Send(buffer,lenbuf,MPI_CHAR,left, type,MPI_COMM_WORLD);
        }
      }
        used_ustime = MPI_Wtime() - start_ustime;

        if (used_ustime > 0) { /* rate is megabytes per second */
            us_rate = (double)nproc * (double)lenbuf * (double)nbiter;
            us_rate = us_rate / (used_ustime*(double)1000000);
        }
        else
            us_rate = 0.0;
        if (me == 0) {
            printf("len=%d bytes * %d iter, roundtrip= %f s, rate=%f Mbytes/s\n", lenbuf, nbiter, used_ustime/nbiter, us_rate);
        }
        lenbuf *= 2;
    }
    if (me == 0)
        printf("clock resolution in seconds: %10.8f\n", MPI_Wtick());
    free(buffer);
}