LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2003-06-20 07:07:40


On Thu, 19 Jun 2003, Andrey Slepuhin wrote:

> [snipped]
> The same socket. The main idea behind my question is that most MPI
> applications (especially mesh-based) do some computations, than
> MPI_Barrier(), than data exchange, so interprocess communications are
> not spreaded in time, but are done synchronously and this is a bottle
> neck.

So your main concern is to optimize the latency between MPI processes.
Correct?

> [snipped]
> Really what I want to have is something like this (in lam-bhost.def):

Minor note: not necessarily lam-bhost.def, but whatever application schema
is used (i.e., even if the user provides one on the command line).

> ...
> node-1 cpu=2 (192.168.0.1 192.168.0.2)
> node-2 cpu=2 (192.168.0.3 192.168.0.4)
> ...

In 7.0, this might not be too hard.

Sidenote: I would strongly advocate working with the 7.0 code tree
since:

a) it's a bit different (read: better organized) than the 6.5 tree,
   espectially w.r.t. the RPI code,
b) the RPI is much more modular in the 7.0 tree, and
c) the 6.5.x tree will likely be retired in the not-distant future.

See http://www.lam-mpi.org/cvs/ for details on how to get an anonymous
CVS checkout.

Off the top of my head, here's what I see would need to be done:

- make a new attribute in the boot schema (e.g., "addresses") to put in
  all IP addresses. For example:

        node1 cpu=2 addresses="192.168.0.1 192.168.0.2"
        node2 cpu=2 addresses="192.168.0.3 192.168.0.4"
        ...

  Without going into details, doing it this way makes the data
  available throughout the LAM code base -- arbitrary key=value pairs
  are cached on the boot schema data (this is new in 7.0; does not
  exist in 6.5.x). So there's no code involved in this step -- just
  deciding on the key name.

- in the TCP RPI (and both the shmem RPI's), there is a static
  function named connect_all() that takes care of connecting to new
  procs (both during MPI_INIT and MPI_COMM_SPAWN*). This is the
  function that you'll want to modify. It uses a "dance" algorithm to
  make the connections, something along these lines:

      open listening TCP socket
      foreach other_mpi_process
        if already_connected
          continue
        if my_id < other_process_id
          send listening socket IP port to other process (**)
          accept()
        else
          receive IP port number from other process (**)
          connect()
      close listening TCP socket

  The two (**) steps are done with LAM's out-of-band communication
  mechanism using the function calls nsend() and nrecv() (see their
  respective man pages).

  I think you would need to modify this function to something like the
  following:

      if (have_address_key_in_boot_schema)
        open listening TCP socket on all addresses
        make mapping of who should connect on which socket
      else
        open listening TCP socket on default address

      foreach other_mpi_process
        if already_connected
          continue
        if my_id < other_process_id
          send listening socket IP address and port to other process (**)
          accept()
        else
          receive IP address and port number from other process (**)
          connect()
      close listening TCP socket(s)

  i.e., if the "address" key is present, open multiple TCP sockets and
  then decide who is going to connect() on which socket.

Modifying this section should be sufficient; the rest of the TCP
progress engine doesn't know or care what IP address it's connected to
-- it just uses the sockets that were opened in connect_all().

Does that help?

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/