LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-16 18:49:08


On Mar 16, 2006, at 5:55 PM, Alexandre Carissimi wrote:

> I was looking for the paper:
>
> Brian Barret, Jeff Sqyres, Andrew Lumsdaine. LAM/MPI Design
> Document. Open Systems laboratory. Pervasive Technology Labs.
> Indiana University.
>
> Mentionned on some LAM publications but I couldn't find it.

You're right, we don't have such a document on the web page. I
believe that we decided that the documentation was not up-to-date, so
it was pulled. I will try to see if I can find the document, but
thus far I've failed.

> I would like to answer two questions about LAM RTE (Run Time
> Environment):
>
> (1) At lamboot command, a set of n lambd deamons are started
> on nodes described on hostfile defining a multiprocessor
> virtual machine (isn´t it?). My question is: the lamd
> stablishes a fully connected mesh among them? This is
> done using TCP connections?

At startup, contact information is shared between all lamd
processes. Communication between lamd processes is over UDP, which
means that we don't have to do fully connected meshes. When new lamd
processes are started (through lamgrow), the new processes share
their contact info with all other existing processes.

> (2) A MPI process communicates with another MPI process using
> lamd as intermediate element? I mean a MPI process does or
> not a TCP connection with another MPI on remote (even local)
> node? Each MPI process communicate with lamd using a unix
> pipe and lamd communicates among then using TCP ? Is this
> correct?

It is possible (but not the default) for MPI applications to use the
lamd communication channel for MPI communication. The default,
however, is to use a direct connection between MPI process. LAM/MPI
currently supports transfer over mixed shared memory and tcp, pure
tcp, Myrinet/GM, and InfiniBand.

> In fact, I have a third question: when I use the Checkpointing
> Restart support, mpirun loads two additional modules: CRLAM and
> CRMPI. These modules coordinates their behavior among the nodes
> using UDP or TCP? They make another TCP connections pairs dedicated
> to this function or they communicates using lamd?

The CR modules coordinate behavior over the out-of-band communication
channel provided by the lamds, so data is eventually transfered over
UDP.

> If someone could help me to answer theses questions or giving me
> pointers to it, I´ll appreciate. For the moment, I´m a little bit
> in rush 'to deep inside" MPI sources to look for these details. Any
> hits will be helpful.

Let us know if you have any other questions.

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/