LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Alexandre Carissimi (asc_at_[hidden])
Date: 2006-03-17 07:46:59


Hi, Brian;

Thanks a lot for your very interesting answers. But, now, based
on them, I´m curious about two thinks (again :)) :

(1) MPI process does TCP connections by demand. I mean, when a
     process A needs do send (or receive) something to/from a
     process B they stablishes a TCP connection. How long this
     connection exists? For example, if I have a simple ping-
     pong. The connection is stablished in the first time and
     all other ping messages flows for this connection or each
     pair send/receive does a new connection?

(2) Why exactly - if I understood well - exists a pipe between
     MPI process and lamd? It is to support the lam commands
     like lamgrow, lamclean, lamwipe etc? It seems that MPI_init()
     use this pipe.

My interest on knowing this details is because we are planning to
stop MPI applications and launch them on other nodes (same CPU
architecture and OS, off course) and we were having some problems.
It seems that pipe info is stored inside the checkpointing file
and this is breaking our migration process. On the same nodes
I can stop (checkpoint) and restart MPI process without problems.

Thanks again,

ASC

Brian Barrett wrote:
> On Mar 16, 2006, at 5:55 PM, Alexandre Carissimi wrote:
>
>
>>I was looking for the paper:
>>
>>Brian Barret, Jeff Sqyres, Andrew Lumsdaine. LAM/MPI Design
>>Document. Open Systems laboratory. Pervasive Technology Labs.
>>Indiana University.
>>
>>Mentionned on some LAM publications but I couldn't find it.
>
>
> You're right, we don't have such a document on the web page. I
> believe that we decided that the documentation was not up-to-date, so
> it was pulled. I will try to see if I can find the document, but
> thus far I've failed.
>
>
>>I would like to answer two questions about LAM RTE (Run Time
>>Environment):
>>
>>(1) At lamboot command, a set of n lambd deamons are started
>> on nodes described on hostfile defining a multiprocessor
>> virtual machine (isn´t it?). My question is: the lamd
>> stablishes a fully connected mesh among them? This is
>> done using TCP connections?
>
>
> At startup, contact information is shared between all lamd
> processes. Communication between lamd processes is over UDP, which
> means that we don't have to do fully connected meshes. When new lamd
> processes are started (through lamgrow), the new processes share
> their contact info with all other existing processes.
>
>
>>(2) A MPI process communicates with another MPI process using
>> lamd as intermediate element? I mean a MPI process does or
>> not a TCP connection with another MPI on remote (even local)
>> node? Each MPI process communicate with lamd using a unix
>> pipe and lamd communicates among then using TCP ? Is this
>> correct?
>
>
> It is possible (but not the default) for MPI applications to use the
> lamd communication channel for MPI communication. The default,
> however, is to use a direct connection between MPI process. LAM/MPI
> currently supports transfer over mixed shared memory and tcp, pure
> tcp, Myrinet/GM, and InfiniBand.
>
>
>>In fact, I have a third question: when I use the Checkpointing
>>Restart support, mpirun loads two additional modules: CRLAM and
>>CRMPI. These modules coordinates their behavior among the nodes
>>using UDP or TCP? They make another TCP connections pairs dedicated
>>to this function or they communicates using lamd?
>
>
> The CR modules coordinate behavior over the out-of-band communication
> channel provided by the lamds, so data is eventually transfered over
> UDP.
>
>
>>If someone could help me to answer theses questions or giving me
>>pointers to it, I´ll appreciate. For the moment, I´m a little bit
>>in rush 'to deep inside" MPI sources to look for these details. Any
>>hits will be helpful.
>
>
> Let us know if you have any other questions.
>
> Brian
>
>

-- 
___________________________________________________________________
CARISSIMI, Alexandre      Universidade Federal do Rio Grande do Sul
asc_at_[hidden]          Instituto de Informática
Tel: +55.51.33.16.61.69   Caixa Postal 15064
Fax: +55.51.33.16.73.08   CEP:91501-970 Porto Alegre - RS - Brasil
___________________________________________________________________