On Mon, 5 May 2003, Pak, Anne O wrote:
> I have a program where use MPI_Comm_spawn to spawn processes on several
> remote nodes and then use MPI_Scatter to split and distribute some
> rather large input matrices to these spawned nodes.
>
> It seems like the bulk of the execution time for the program lies with
> using MPI_Scatter. Has anyone else experienced this or benchmarked this
> command's performance?
Let me tell you how LAM's MPI_Scatter call works: for 4 or less nodes, it
does a linear set of MPI_Sends from the root. For more than 4 nodes, a
binomial tree is used. So if you have one mega-communicator between the
master and all the spawned processes, it's likely that a binomial tree is
being used.
> Is the bottleneck the speed of the physical connection between the
> master node and all the slave nodes? or is there some overhead that i'm
> not taking into account?
The physicial connection going to be a part of it. It also depends on
what kind of network you have (e.g., shared vs. switched).
This is actually a canonical problem in parallel computing: how to get the
data to the workers (particularly when it's a large amount of input data).
There's lots of solutions to this problem, but it really depends on your
exact situation. For example, you may find it easier/faster to distribute
the data ahead of time to the slaves using a shared filesystem or the
batch queueing system, or perhaps staging the data on the local disk where
each compute process runs, etc.
Analysis of how to get the data to your compute processes should probably
take into things like (but not limited to): amount of data to be sent,
account network speed, available disk space, disk speed, amount of RAM,
frequency of sending the data, how often the [input] data changes, etc.
Keep in mind that there's some constants that you'll run into regardless
of whether you use MPI to distribute the data or not. For example, if you
have 1GB of data to send to each compute process, it will still take the
same [roughly] amount of time to send 1GB of data whether you use MPI or
raw TCP because MPI doesn't add that much overhead (might be somewhat
slower if you use NFS).
For example, if you use the same dataset for 100 runs, you might save a
lot of overhead by copying the entire datafile out to the local disk all
the nodes and having all processes call open() on the file. Hence, you
pay the "scatter cost" only once when you scp (or whatever) the file to
all the nodes. At run time, since all processes have immediate access to
the data, the master process only needs to scatter indices of who is
supposed to do which work (vs. sending the actual input data itself).
That's just an example -- you'll need to look at your requirements to see
what works for you.
> I am using C timing functions to measure the execution times. an
> excerpt of the code i'm using is as follows:
>
> #include <time.h>
> clock_t timer_start,timer_stop;
> double difference;
> timer_start=clock();
> MPI_SCATTER(.....);
> .
> timer_stop=clock();
> difference = (double) (timer_stop-timer_start)/CLOCKS_PER_SEC;
> printf("%4.3e seconds\n",difference)
Keep in mind that this is only showing the time that it takes for *one*
process. Given that a binomial tree may be used, it's hard to define how
long the *entire* operation takes because it's a distributed operation and
some nodes may/will complete before others.
> typically 'difference' is on the order of 0.5 seconds, but i can
> physically FEEL that its taking much longer than 0.5 seconds.. in fact,
> running an independent stop watch on my computer shows that the total
> execution time is more like on the order of 3 seconds or so, so i'm not
> sure whether this great discrepencies is because i'm using the timer
> functions incorrectly, or if there is some overhead incurred from using
> MPI_Scatter that i'm not measuring.
To get a better (although still rough) measurement of how long the overall
process is taking, put in a barrier after the scatter. Something like:
timer_start = clock();
MPI_Scatter(...);
MPI_Barrier(...);
timer_stop = clock();
Since everyone may finish the scatter at the same time, the barrier
afterwards will force a synchronization and make the end times much closer
together (a barrier is fairly quick; much quicker than a scatter to a
large number of nodes with a large amount of data).
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|