Hello,
I wrote a code in C++ using MPI. I divided a bigger block into smaller
blocks and assigned each block to a different node/process. Below is the pseudo
code:
for(time=1;time<=Nt;time++)
{
do {
//some data exchange between neigbouring (blocks) nodes/processes
//some computation in each block (node/process)
MPI_Allreduce(...to find convergence condition...);
} while(convergence reached)
}
Results from parallel code agree quite well with the results from serial code.
However, the scalability is poor. When I increase number of processes or nodes,
there is not much improvement in the turnaround time. Serial code takes 225
seconds
whereas for the same conditions parallel code with:
3 processes on a single node take 167 seconds,
3 processes on 3 nodes take 115 seconds,
4 processes on 4 nodes take 111 seconds
10 processes on 5 nodes take 88 seconds,
Is this much scalability acceptable? I was expecting better performance, at
leat 10 times. Is there anyway I can improve the speedup? I am using LAM-MPI on
linux (Linux k00 2.4.17 ) cluster conneted via Ethernet (TCP/IP - i do not know
much about networking). Below is the cluster details:
20 + 2 spare nodes
44 1.4GHz Athlon processors
512 MB RAM per processor
Channel-Bonded Network
four 24-way switches
four NICs per node
3-way+1 NFS or 4-way
40 GB hard drive per node
Theoretical peak performance: 224 GFLOPS
Is performance related to blocking/non-blocking send-recv too? By the way, I am
using blocking MPI_Send/Recv. Or it is related to the communication overhead?
How this communication overhead can be reduced? Pls give some idea.
Thanking in advance,
Ravi R. Kumar
|