On Tue, 22 Nov 2005, Angel Tsankov wrote:
> OK, my delusion about the shared memory communication time stems
> from the assumption that the OS provides some efficient mechanism to
> share memory. Obviously this is not the case and copying cannot be
> avoided.
The OS does provide several mechanisms for sharing memory, but each
comes with strings attached and cannot be used as both fast and
portable. Furthermore, it's also the application that counts; if the
author just blindly calls function (including MPI ones) without
thinking of the most efficient way of _solving the problem_, it's no
wonder that performance is lacking.
For example, if the application (and sometimes compiler) allows using
threads, it might be best to not share memory between processes (as in
a typical MPI application), but between threads - there is no copy to
be made, all threads have access to all (common) data.
Another example is related to data buferring. A typical MPI
application allocates a send buffer that is filled with data and a
receive buffer that is used to gather data from other ranks. If the
data that is in the send buffer is not needed afterwards by the
sending rank, it's normally more efficient for the OS to unmap this
memory zone from the sender process and map it in the receiver process
as a receive buffer - this makes data go from one rank to another
without any data copying, but it's only the application that knows
that the sender buffer is no longer needed - the MPI library can't
know this.
This is similar to what is done when using shared memory (directly,
not through MPI) - the two processes allocate from the shared memory
pool and can use the allocated area for direct access, of course with
some kind of synchronization mechanism. This however requires again
the application to know that it wants to use shared memory, because
the allocation is not done through malloc().
All of the above use application specific knowledge and cannot be
implemented into a MPI library. One variation that can work more
generically with a bit of OS support is to allocate the memory area in
one process, map it into the other process without unmapping it from
the first, but using COW (copy-on-write) - if the receiver only reads
the content, there is no need for any additional operation; if the
receiver wants to modify the data, then the OS transparently creates a
copy (page-based) and the receiver modifies the copy. This however
doesn't come free - the virtual memory operations are somehow costly
and page-based, so even if you want to transfer 1 byte, a whole page
(4K on x86) will be needed and the COW behaviour will also copy the
whole page when maybe only 1 byte is changed. This increaseas the
latency and can lead to worse total transfer times than when using
simple memcpy().
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]
|