Here are few results I got running my code right now. I am giving runtime
details using linux command 'top'. May be this will give you better idea what
is going on with the execution of my parallel code.
3 processes on a single node took 152 seconds
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
6258 rrkuma0 14 0 5984 5984 820 S 54.9 0.7 1:10 10Blocking_Dyna
6257 rrkuma0 15 0 6072 6072 904 S 52.3 0.7 1:03 10Blocking_Dyna
6259 rrkuma0 14 0 5984 5984 820 R 50.5 0.7 1:03 10Blocking_Dyna
4 processes on a single node took 129 seconds
6365 rrkuma0 19 0 4852 4852 820 R 54.3 0.6 0:09 10Blocking_Dyna
6367 rrkuma0 18 0 4560 4560 820 S 51.4 0.5 0:08 10Blocking_Dyna
6364 rrkuma0 18 0 4944 4944 904 R 45.6 0.6 0:08 10Blocking_Dyna
6366 rrkuma0 16 0 4852 4852 820 R 44.6 0.6 0:08 10Blocking_Dyna
5 processes on a single node took 143 seconds
6429 rrkuma0 14 0 4280 4280 820 R 47.7 0.5 0:52 10Blocking_Dyna
6426 rrkuma0 13 0 4372 4372 904 S 39.0 0.5 0:52 10Blocking_Dyna
6427 rrkuma0 13 0 4284 4284 824 S 36.0 0.5 0:52 10Blocking_Dyna
6428 rrkuma0 14 0 4280 4280 820 R 36.0 0.5 0:51 10Blocking_Dyna
6430 rrkuma0 11 0 3152 3152 820 R 23.4 0.4 0:34 10Blocking_Dyna
3 processes run on 3 nodes took 121 seconds
6273 rrkuma0 18 0 6072 6072 904 R 62.4 0.7 0:10 10Blocking_Dyna
5661 rrkuma0 18 0 5980 5980 820 S 64.3 0.7 0:20 10Blocking_Dyna
5384 rrkuma0 16 0 5980 5980 820 S 55.8 0.7 0:36 10Blocking_Dyna
4 processes on 4 nodes took 110 seconds
6352 rrkuma0 17 0 4944 4944 904 R 55.3 0.6 0:05 10Blocking_Dyna
5741 rrkuma0 17 0 4848 4848 820 R 59.3 0.6 0:26 10Blocking_Dyna
5464 rrkuma0 19 0 4848 4848 820 S 54.9 0.6 0:39 10Blocking_Dyna
5255 rrkuma0 17 0 4556 4556 820 R 49.1 0.5 0:45 10Blocking_Dyna
5 processes on 5 nodes took 113 seconds
6448 rrkuma0 16 0 4372 4372 904 R 47.8 0.5 0:12 10Blocking_Dyna
5820 rrkuma0 17 0 4280 4280 824 S 51.4 0.5 0:19 10Blocking_Dyna
5543 rrkuma0 18 0 4276 4276 820 S 52.7 0.5 0:25 10Blocking_Dyna
5334 rrkuma0 16 0 4276 4276 820 S 49.5 0.5 0:32 10Blocking_Dyna
5821 rrkuma0 19 0 3148 3148 820 R 35.0 0.4 0:24 10Blocking_Dyna
Thanks,
Ravi R. Kumar
Quoting "Kumar, Ravi Ranjan" <rrkuma0_at_[hidden]>:
> Hello,
>
> I wrote a code in C++ using MPI. I divided a bigger block into smaller
> blocks and assigned each block to a different node/process. Below is the
> pseudo
> code:
>
>
> for(time=1;time<=Nt;time++)
>
> {
>
> do {
>
> //some data exchange between neigbouring (blocks) nodes/processes
>
> //some computation in each block (node/process)
>
> MPI_Allreduce(...to find convergence condition...);
>
> } while(convergence reached)
>
>
> }
>
> Results from parallel code agree quite well with the results from serial
> code.
> However, the scalability is poor. When I increase number of processes or
> nodes,
> there is not much improvement in the turnaround time. Serial code takes 225
> seconds
> whereas for the same conditions parallel code with:
>
> 3 processes on a single node take 167 seconds,
> 3 processes on 3 nodes take 115 seconds,
> 4 processes on 4 nodes take 111 seconds
> 10 processes on 5 nodes take 88 seconds,
>
> Is this much scalability acceptable? I was expecting better performance, at
> leat 10 times. Is there anyway I can improve the speedup? I am using LAM-MPI
> on
> linux (Linux k00 2.4.17 ) cluster conneted via Ethernet (TCP/IP - i do not
> know
> much about networking). Below is the cluster details:
>
> 20 + 2 spare nodes
> 44 1.4GHz Athlon processors
> 512 MB RAM per processor
> Channel-Bonded Network
> four 24-way switches
> four NICs per node
> 3-way+1 NFS or 4-way
>
> 40 GB hard drive per node
> Theoretical peak performance: 224 GFLOPS
>
>
> Is performance related to blocking/non-blocking send-recv too? By the way, I
> am
> using blocking MPI_Send/Recv. Or it is related to the communication overhead?
>
> How this communication overhead can be reduced? Pls give some idea.
>
> Thanking in advance,
> Ravi R. Kumar
>
>
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|