On Fri, 14 Dec 2001, Byna Surendra wrote:
> Thank you very much for the information. While asking about the
> default buffer size i was asking about the default shared buffer
> allocated for each MPI process. I know that in SGI MPI 3.0
> implementation, a default 256KB shared buffer is allocated to every
> MPI Process on its local node. This shared memory is visible to all
> the processors in the system. (i mean Origin 2000 system). Based on
> your answers to my prvious set of questions i think that it is 64KB in
> LAM (in both TCP communication and shared memory communication).
> Please correct me if I am wrong.
Not quite right -- 64kb buffer (in the operating system) on the socket for
each TCP peer MPI process. For the shared memory, similar to SGI's
implementation, there is a global shmem pool shared by all MPI processes
in a single parallel program on the same node. Its default size is (np -
1) * 512kb.
You should read the INSTALL file that comes with LAM. Here's a link to
the online version, starting with the section that you'll be interested
in:
http://www.lam-mpi.org/6.5/install.php#Usysv_and_sysv_transports
> I just started to trace your code for lamsend.c. In this fucntion,
> which call is exactly initializing the communication? i found some
Not sure what you're asking here. The communicator is initialized long
before lam_send() or lam_isend() is ever invoked.
MPI_COMM_WORLD is initialized during MPI_Init().
If you'd like to go source diving in LAM, it would probably be easiest if
you work with the current CVS copy so that we can provide information for
you at the head of the CVS tree (some parts are considerably different
than what is in the current stable version, 6.5.6). See
http://www.lam-mpi.org/cvs/ for information on how to obtain a CVS copy of
LAM.
Be sure also to read the README.cvs file in the top-level directory of the
CVS version of LAM.
> _mpi_xxxx functions. Could you please tell me where I can find the
> definition of these functions? If possible could you explain, what are
Several of these are actually macros -- we repeatedly use this
functionality but didn't want to pay the price of a function call for it.
Most of them are defined in share/include/mpisys.h.
> the sequence of operations and calls occur in an MPI_send and
> MPI_recv, in TCP communication and shared mem?
Both are long a complicated. :-) Let me give you the general overview of
the design. You might want to read the Request Progression Interface
(RPI) document -- the RPI is the lower layer of LAM. You can think of it
as the "device driver" layer. There's a TCP RPI, two flavors of shared
memory RPIs, and a [still beta] myrinet RPI.
Generally speaking, it's all about MPI requests.
The MPI layer in LAM maintains a queue of outstanding requests (both sends
and receives). For example, an MPI_Send creates an MPI request and places
it on this queue. The progression engine in the RPI is then triggered.
Note that user messages are sent in two parts: an envelope and the actual
message itself. The envelope consists of meta information about the
message: the communicator, the tag, etc.
The RPI is responsible for all bit-moving from one MPI process to another.
Hence, it's responsible for all of the actual message passing. The RPI
engine examines the queue and tries to make progress on any outstanding
request, whether it is a send or a receive.
If it's a send, the RPI will try to send it. The TCP RPI is somewhat
complicated; it has an elaborate state machine because TCP write()'s down
a socket may or may not be complete -- if I write() 200 bytes, it's
perfectly legal for the OS to come back and say, "ok, I sent 73 of those
200 bytes". The TCP RPI has to remember this and try to send the
remaining 117 bytes the next time around. The same goes for receives --
we may try to read() 200 bytes and only receive 73. The TCP RPI has to
remember this and try to receive the remaining 117 bytes the next time
around.
For sends, there is a max of one "active" request at a time -- this is the
request that is actively being sent. It will stay active until all the
bytes in the message have actually been sent. Depending on the type of
MPI send, when the MPI request will be marked complete when all the bytes
have been sent, or when an ACK is received from the receiver.
For receives, LAM simply polls the sockets looking for incoming data. If
it finds a socket with incoming data, it reads the envelope and looks for
a match in the list of pending receives. If it finds a match, the message
is accepted directly into the user's buffer (this is why we always
recomend that you post receives before you perform sends). If a match is
not found, a temporary buffer is allocated to receive the message. When
the user finally posts the matching receive (by calling MPI_Recv or one of
the other receive flavors), the message will be received "immediately"
because the underlying RPI has already received it into local memory --
the message is memcpy'ed into the user's buffer and the MPI request is
marked complete.
That's a general overview of the process. There's a lot more detail
involved, but that's the gist of it. When you're reading the code, keep
in mind that MPI has strict ordering guarantees: if MPI process A sends
two messages to MPI process B, where the messages are sent on the same
communicator with the same tag -- first message A1 followed by message A2
-- then process B must receive A1 before A2. That will help understand
why some of the code is the way that it is.
> My other question is, how are the processes initiated on each
> processor when mpirun is executed? is there one master process, which
> forks child processes in a number, specified in -np option? when do
> those child processes start running, is it when MPI_Init () is called
> or before that?
Loosely speaking, yes -- the LAM daemon (lamd). When you lamboot, there
is one lamd placed on each node. mpirun sends a message to each lamd
indicating what program to fork, the command line arugments, any relevant
environment variables, some run time flags, etc.
The lamd's then fork/exec the user program and send some meta data back to
mpirun. mpirun then listens for calls from the forked user processes.
Each user process, during MPI_Init, calls back to mpirun to give it its
location (i.e., its machine and PID) -- because of the LAM infrastructure
(for lack of a longer explanation right now), each forked user process
knows the "address" to call back to mpirun.
mpirun assmbles all the messages and creates the group that will become
MPI_COMM_WORLD, and broadcasts that out to all the processes. Each of the
processes then take that group and form MPI_COMM_WORLD from it.
Connections to each peer rank are then initiated (opening sockets,
creating/joining shared memort, etc.). As such, mpirun is the
synchronization point where all MPI processes can "meet" and become aware
of each other.
To summarize: each user process starts up independantly -- forked by its
respective lamd. When the process calls MPI_Init(), it communicates with
mpirun and becomes aware of its peers. At this point, these individual
processes can then be considered a parallel process.
As a sidenote -- you can think of MPI_COMM_SPAWN (and its friends) as a
special case of MPI_INIT. The same procedure effectively occurs during
MPI_COMM_SPAW as MPI_INIT, except one of the already-existing MPI
processes functions as mpirun.
> I guess i have posted a lot of questions. Thank you for your patience
> in advance.
No problems. Feel free to keep asking questions.
Be sure to search the LAM mailing list archives; some of your questions
may have been asked and answered on the list already.
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|