On Thu, Jun 19, 2003 at 01:39:01PM -0400, Jeff Squyres wrote:
> On Thu, 19 Jun 2003, Andrey Slepuhin wrote:
>
> > I read their articles, but it seems that they solve another problem:
> > having multiple interfaces and multiple switches how to route packet
> > depending on destination address. But I want to do the following: having
> > only one switch and two network interfaces on each node I want to attach
> > each of two MPI processes running on a node to separate network
> > interface to avoid collisions while keeping shmem communicztion between
> > processes on a same node.
>
> Gotcha.
>
> Have you considered channel bonding? I don't know what the current
> state-of-the-art is with regards to channel bonding. I've never tried it
> myself -- I've heard both success and failure stories about it. There's
> two factors here -- latency and bandwidth.
I heard that channel bonding doesn't work well for a single TCP connection
due to packet reordering. So probably it should be better to separate
TCP sessions rather than binding them to a single double-width interface.
>
> The reason that I ask is because regardless of the route you take (pardon
> the pun), LAM is still single threaded. Hence, the OS will be the one
> that makes progress on the underlying write()'s and read()'s. So whether
> they're going across two different TCP sockets or you have them channel
> bonded, the OS is responsible for making progress across those two
> sockets. My only point here is that if you can do a quick-n-dirty channel
> bonding setup and get that working, it may be easier than modifying LAM.
See notes above.
>
> Modifying LAM is certainly possible, there's multiple factors involved:
>
> - if every MPI process potentially has two IP addresses, you'll need to
> decide on which one goes to which process, or have every process listening
> on both. This may seem trivial, but it's a fair amount of logistical work
> (before you even get to the interesting stuff). For example, if you have
> 2 sends pending to the *same* process, do they use different sockets
> (which would be hard), or do they use the same socket, and you have some
The same socket. The main idea behind my question is that most MPI
applications (especially mesh-based) do some computations, than MPI_Barrier(),
than data exchange, so interprocess communications are not spreaded in time,
but are done synchronously and this is a bottle neck.
> kind of ordering such that you spread the use of the two NICs on a
> per-process basis, not a per-message basis (which means you mainly have to
> come up with a distribution scheme that will actually provide some benefit
> to your applications).
>
> - LAM currently gets the IP address of each MPI process from the lamd
> routing table. You'd have to modify this scheme (although it probably
> wouldn't be too hard) to have each MPI process send its IP address around
> to each of its non-local peers.
>
> So actually thinking about this a little more, and thinking about how the
> TCP components of the RPI setup and work, if you simply load balance per
> MPI process across the NICs, perhaps this wouldn't be *too* difficult to
> do... (but then again, keep in mind my bais as LAM's main RPI expert ;-).
> I think if you modify the initialization-time stuff, the rest of the
> progress engine will work exactly the same. But if you want to multiplex
> across the sockets, it'll be a bit harder.
>
> But it's still worth a try with channel bonding to see if that gives you
> want you want, too.
Really what I want to have is something like this (in lam-bhost.def):
...
node-1 cpu=2 (192.168.0.1 192.168.0.2)
node-2 cpu=2 (192.168.0.3 192.168.0.4)
...
I don't think it would be hard.
Regards,
Andrey.
--
A right thing should be simple (tm)
|