On Fri, 30 Aug 2002, Robbie (Rohit) Nadig wrote:
> I have a large file that I want to search for specific patterns. The
> results of the search are presented to the user in a TCL GUI.
>
> Currently , searching for patterns takes a long time and hence makes
> interactive use nearly impossible.
>
> I was wondering if it would be easy to split the large file and load the
> file into programs on a LAM-mpi node, that way each node performs the
> search on a small segment of the entire file.
The question that you need to answer here is to determine the amount of
overhead involved in doing this (i.e., making your program parallel). So
ignore LAM and MPI for the moment, and do some back-of-the-envelope kinds
of calculations...
Let me make some assumptions about your setup -- you should be able to
fill in whatever values are actually correct for your environment.
- you have N nodes
- you have a M Mbps (mega bits per second) network
- the file is S bytes; you will be sending S/N bytes to each node
- searching the entire file on one node taks T seconds
So using these values, you need to figure out whether it's worthwhile to
do this in parallel. This is a fairly common type of analysis. The trick
is to ensure that the actual work performed:
1. can actually be done in less time than in serial
2. the overhead of working in parallel will not cause #1 to be false
In a perfect parallel program, you will reduce your operating time to T/N.
So the first question is: is T/N noticably smaller than N? "Noticably
smaller" is for you to define, depending on your application. For
example, if T is 10, and N is 2, then you're probably better off leaving
it in serial because 5 seconds faster is not worth the complexity of
making a program parallel (unless you plan to scale T up to much larger
values). As another example -- if T is 1000, and N is 4, you can
[potentially] reduce the run time to 250 seconds. This is a fairly large
win.
So assuming that you T/N computation shows that you want to proceed, then
you need to realize that T/N is only for "perfect" programs, and is
rare/never happens. You need to add the overhead caused by running in
parallel (sending and receiving data between the individual processes,
startup/shutdown costs, etc.). Let's call this "O". So the time for
execution of your parallel program (very roughly modeled, of course) is
actually T/N+O.
Sidenote: you may be able to reduce O a lot in your particular case, and
come close to the T/N timing. See below.
So let's try to calculate (roughly) what O will be. O is comprised of
multiple things. Let's take the most obvious ones: communication costs,
startup costs, shutdown costs.
- Startup costs are essentially executing an "rsh" or "ssh" to each node
involved. It's essentially how long lamboot takes to run across all N
nodes, and is generally proportional to N. This is probably not a huge
cost, and can certainly be amortized over lots of runs (since you only run
lamboot [essentially] once per login), but it should be noted. If you
plan to have lots of runs, you can ignore this, though, on the
amortization argument. To be fair, mpirun itself isn't instantaneous
either, but it's relatively small (for small to mid-sized values of N,
say, N < 100 or so), so let's ignore that as well.
- Shutdown costs are essentially the same as mpirun costs, so let's ignore
those, too.
- Communication costs are the big issue here. I'm guessing that your
communication pattern will essentially be sending S/N bytes to each node
(perhaps not to the "master" node, since it already has the file). At the
end, each node will send back a short reply (R bytes -- R is probably
pretty small).
You do have a choice here of using a networked filesystem which each node
simply does a normal "open()" to read the file. In my experience, this is
typically more expensive (slower) than using MPI simply because of the
complex network protocols involved, and the network contention of all
nodes simultaneously accessing the file. However, YMMV -- depending on
your particular network setup (e.g., AFS caches the file on each node, and
successive runs may be very fast -- faster than MPI). Another strategy is
to pre-position the file on the local disk of all nodes, for example using
scp with compression enabled (if your file is text and you use scp [or
whatever] with compression ebaled, it may be *very* fast to send it to all
nodes, and then when they all do open() on the file, it's local, there's
no network cost, and no network contention). So measure all of these
things and see which comes out to be fastest. The goal here is to reduce
O as low as possible, but combine that with operational convenience (i.e.,
one method may be significantly faster than others, but manual and/or
complex/cumbersome such that users won't want to do it, or will
consistently do it wrong, etc.).
You can more-or-less guess how long MPI will take to send your file
because MPI adds very little overhead to sending the raw bytes across the
network. So assuming that S is large (more to the point, assuming that
S/N is large), the overhead (i.e., a few bytes -- say 32 or 64) that MPI
adds to each message that is sent across the network is fairly negligible
in comparison to the data of the message itself. So consider a simplistic
example, where the "master" node reads in the entire file and does an
MPI_Send to each other process, sending them S/N bytes. This means that
MPI does (N-1) sends, each of S/N bytes, in a [most likely] serial
fashion. So the amount of time required to do this:
Time to send a single messages of (S/N) bytes:
(S/N) bytes * 8 bits * 1 megabit * 1 second
------ --------- ----------
1 byte 2^20 bits M megabits
If that's the time to send one message, you can (roughly) multiply that by
(N-1) to find out how long it takes to send (N-1) messages.
Also figure that if you have an 100Mbps network, you really only probably
get about 80-90Mbps (on a good day -- with nothing else going across the
network). Use a program like NetPIPE to figure out your actual network
throughput (do a google search for NetPIPE).
If N gets to be large, you may want to do something more clever than a
straight serial send. For example, you may want to use MPI_Scatter() to
do the sending, which, even though you'll have the send the "master"'s
bytes to itself, can be faster if the MPI implementation uses a tree-based
sending algorithm, rather than a linear algorithm (which LAM does, if N >
4). That gets a little more complicated to calculate, so I'll leave that
as an exercise for the reader (and this mail is long enough already!).
:-)
So remember that your run time is effectively T/N + O, where O is the time
to send (N-1)(S/N) bytes, per formula that we calulcated above. If R is
significantly less than S/N, then ignore its contribution to O. If R is
large, then add it in per the same formula, above (remember (N-1)R bytes
total).
So now you need to look at T/N + O and seem if its significantly
("noticably") less than T. Also compare the different calculations for O,
depending on whether you use a networked filesystem (NFS or AFS or ...),
whether you pre-stage the file on each node, or whether you use
compression or not.
Sidenote on compression: you can use compression with pre-staging or with
MPI -- you can certainly link to zlib to do compression yourself, if
you're so bold, and then use MPI to send compressed data instead of
uncompressed data. Compression will essentially reduce S/N to [hopefully]
a much smaller number, but at the cost of programming complexity, plus a
contribution to O itself for the cost of compressing and decompressing.
The trick here is to find a compression/decompression cost that is
significantly less than the cost of sending across the network. This
requires a similar analysis to what we have discussed here in terms of
communication costs. Text compresses very well, for example -- even
though it may take a few seconds to compress S/N bytes [call it Tc], you
may end up with a total number of bytes such that the total time to send
it across the network [call it Ts] is still a "win" -- i.e., (Tc*2 + Ts) <
time_to_send_S/N_bytes) -- remember that you have to compress to send and
then uncompress to receive, hence (Tc*2). It's worth looking at, though.
There's a lot of estimates involved here, and a lot depends on your local
setup and exactly what you're trying to do -- you can make any of the
calculations that I listed here more complex and a closer model to
reality. I simply state the "big factors" here to get your started.
So if I were you, I'd do a few tests like those mentioned above, and some
back-of-the-envelope calculations to see (at least in terms of order of
magnitude) how much a parallel approach will actually save in terms of
execution time before you spend the time parallelizing your program.
Hope this helps.
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|