LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Kumar, Ravi Ranjan (rrkuma0_at_[hidden])
Date: 2005-03-29 17:57:03


Hello,

Thanks for clarifying my doubts. I checked my parallel code results with serial
code results. Both are in agreement (with and without barrier) even though time
step outputs on the terminal are not in synchronization. This also confirms
that it's matter of printing order only.

However, I have another doubt. I run my code using 2 nodes (K00 & K02) in LAM.
I used 10 processes to run my code hence 5 processes per node:

[rrkuma0_at_k00 SOR]$ time mpirun -v -np 10 10Blocking_Dynamic_SOR_MPI
5491 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5513 10Blocking_Dynamic_SOR_MPI running on n1
5492 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5514 10Blocking_Dynamic_SOR_MPI running on n1
5493 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5515 10Blocking_Dynamic_SOR_MPI running on n1
5494 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5516 10Blocking_Dynamic_SOR_MPI running on n1
5495 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5517 10Blocking_Dynamic_SOR_MPI running on n1

I was watching the output on the terminal. For first few time steps, all the
ranks were priniting together. Later on I found that all the even numbered
processes were printed out together and odd numbered processes printed out
their outputs together. Even numbered processes finished their job quite
earlier compared to odd numbered processes. Time taken by even numbered
proceses were very less compared to that of odd numbered. What can be the
reason for this differnce? Why odd numbered processes take too long? I dont
think I put any workload differnce between even and odd numbered processes in
my code. I just followed one thing in data_exchange_subroutine that even
numbered rank first sends data then receive whereas odd numbered process first
receives data then sends. that's all.

Again, I run my code using 10 processes on a single node. All the processes
ended up simultaneously. See below:

[rrkuma0_at_k00 SOR]$ time mpirun -v -np 10 10Blocking_Dynamic_SOR_MPI
5575 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5576 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5577 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5578 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5579 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5580 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5581 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5582 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5583 10Blocking_Dynamic_SOR_MPI running on n0 (o)
5584 10Blocking_Dynamic_SOR_MPI running on n0 (o)
Tue Mar 29 17:45:10 2005

Time taken by 10 processes on a single node is 3 mins whereas time taken by 10
processes distributed on 2 nodes is 5 mins. Why this is happening? Kindly
clarify. Thanks a lot!

Ravi R. Kumar

Quoting Jeff Squyres <jsquyres_at_[hidden]>:

> On Mar 29, 2005, at 4:30 PM, Kumar, Ravi Ranjan wrote:
>
> > Infact I am printing out the 'rank' and 'time' step just before
> > calling the
> > MPI_Barrier. When all the ranks except 1 and 6 reach 250th time step,
> > rank 1 & 6 just reaches 100th time step. Is this just because of the
> > printing order or am I doing some mistake in implementing what I want
> > from code? Also I am using MPI_Allreduce which should also enfore
> > parallel stepping of all the processes. isn't it? Pls clarify.
>
> Yes, you cannot rely on the ordering of this output, even with barriers
> -- particularly if your iteration is really short. Output can get
> buffered for a while on a remote node before it is sent over to mpirun,
> for example.
>
> Allreduce should also do a pretty good job of enforcing
> synchronization, since, by definition, every process needs to get
> result (specifically: it meets the same definition of barrier in this
> case -- assuming that every process contributes data [i.e., count>0],
> no process can leave allreduce until all processes have entered
> allreduce). So your barrier is actually redundant in this situation.
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>