On Mar 29, 2005, at 5:57 PM, Kumar, Ravi Ranjan wrote:
> However, I have another doubt. I run my code using 2 nodes (K00 & K02)
> in LAM. I used 10 processes to run my code hence 5 processes per node:
>
> [rrkuma0_at_k00 SOR]$ time mpirun -v -np 10 10Blocking_Dynamic_SOR_MPI
> 5491 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5513 10Blocking_Dynamic_SOR_MPI running on n1
> 5492 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5514 10Blocking_Dynamic_SOR_MPI running on n1
> 5493 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5515 10Blocking_Dynamic_SOR_MPI running on n1
> 5494 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5516 10Blocking_Dynamic_SOR_MPI running on n1
> 5495 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5517 10Blocking_Dynamic_SOR_MPI running on n1
>
> I was watching the output on the terminal. For first few time steps,
> all the ranks were priniting together. Later on I found that all the
> even numbered processes were printed out together and odd numbered
> processes printed out their outputs together. Even numbered processes
> finished their job quite earlier compared to odd numbered processes.
> Time taken by even numbered proceses were very less compared to that
> of odd numbered. What can be the reason for this differnce? Why odd
> numbered processes take too long? I dont think I put any workload
> differnce between even and odd numbered processes in my code. I just
> followed one thing in data_exchange_subroutine that even numbered
> rank first sends data then receive whereas odd numbered process first
> receives data then sends. that's all.
There's 2 issues here:
1. You really can't rely on the ordering of output. So regardless of
what it *looks* like, you really can't specifically say -- via
printf-style output -- what order things finished in. Getting the
*real* order is actually pretty hard; you have to account for the time
differences between cluster nodes, etc. If all your nodes ntp sync to
a common server (for example), it's a lot easier, but there still may
be small differences.
2. When you oversubscribe a node, you really can't compare performance
at all. There's too many factors involved when you start thrashing the
CPU and memory subsystems, etc.
> Again, I run my code using 10 processes on a single node. All the
> processes
> ended up simultaneously. See below:
>
> [rrkuma0_at_k00 SOR]$ time mpirun -v -np 10 10Blocking_Dynamic_SOR_MPI
> 5575 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5576 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5577 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5578 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5579 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5580 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5581 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5582 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5583 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> 5584 10Blocking_Dynamic_SOR_MPI running on n0 (o)
> Tue Mar 29 17:45:10 2005
>
> Time taken by 10 processes on a single node is 3 mins whereas time
> taken by 10
> processes distributed on 2 nodes is 5 mins. Why this is happening?
> Kindly
> clarify. Thanks a lot!
Keep in mind that MPI communication takes time. When it's all one one
node, it's done via shared memory and is very fast. When it's done
across multiple nodes, I suspect you're using a TCP network, and that's
orders of magnitude slower. So every Allreduce, Barrier, Send, Recv,
etc., takes time. Parallel codes are typically designed quite
carefully to minimize communication whenever possible, and/or overlap
communication and computation.
Here's a page that I wrote a long, long time ago that explains some of
this kind of stuff: http://www.osl.iu.edu/~jsquyres/bladeenc/ See the
Technical Details page.
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|