Hello,
Here is my (long) take on bonding, especially with gigabit ethernet...
On May 27, 2004, at 6:57 AM, jess michelsen wrote:
> Hence, I'm considering channel bonding (some call it link aggregation).
> As far as I have understood, this can be done in two ways. In both
> cases, the (two) NIC's will share the same IP number. They are either
> connected to two completely separate networks, or they are trunked.
>
> Which method would give best performance?
The clear choice for high-performance is two separate networks if you
can do it. Here's why:
1) Latency
For a given switch size, say 24 ports, if you use two separate
networks, a given node will have 23 neighbors that are reachable
with a single switch hop. If you use two-way trunking, that node
would have at most 11 neighbors that are only one switch hop away.
Each switch hop adds latency, especially for large packets, since
commodity ethernet switches do not do worm-hole routing.
2) Bandwidth
If a given node is talking to a node not on the same switch(es),
the packets will go through the bottleneck of an uplink to another
switch. The concept of a fat-tree would have the uplink be wider
than the links to the nodes. Thus, trunking the uplinks of
a switch would help, but you can only aggregate a maximum number
of uplinks together, usually 4. Thus, by going with separate networks,
the total number of links going "up" can be larger relative to the
node's individual link into the network. I'm horrible at ASCII art,
so I won't try and draw it.
3) Cost
Switches that support link aggregation or trunking cost more than
switches that don't, at least when I last looked.
The two situations that I am aware of that would indicate use
of trunking instead of separate networks are:
1) a fault tolerance setup, with the bonding mode set in one of the
various fault tolerant modes (not the default round-robin mode).
2) needing to directly connect the boded network to the outside
world via a non-bonded link/router/gateway.
However, there are issues with bonding gigabit ethernet that
go beyond the network topology... There have been multiple
discussions of this issue on the beowulf mailing list over
the past several years that might be worth googleing. I'll
try to summarize my understanding of the issues:
Basically, the problem with bonding GigE comes from multiple places:
1) Your motherboard's PCI might not be fast enough to support that
much data rate. PCI-X is plenty fast, but not plain 64-bit PCI.
Don't even think about 32-bit PCI...
2) The interrupt load from a single GigE can be high enough to
saturate the CPU, thus the introduction of NAPI in more recent
Linux kernels, as well as Jumbo packets. Not all network drivers
support NAPI yet, and Jumbo frames are not universally supported.
3) The packets seem to arrive out-of-order as far as the TCP stack
is concerned, and TCP will either:
a) Using an aggressive packet-loss-recovery scheme presume that
a packet was dropped and send out NACKs asking for resends... or
b) spend a lot of CPU time or memory bandwidth re-arranging the
packets back in order.
Problem #3 comes about because with packets arriving so quickly,
the driver will pull several packets from a single NIC before
getting packets from the next NIC. Yet, bonding sent them out
in round-robin order, so, for example, eth0 would get all
the even packets, and eth1 would get the odd packets. If you are
following closely, the NAPI solution to problem #2, causes problem #3...
I haven't looked recently to see if anyone has found a good solution
to problem #3. The /proc/sys/net/ipv4/tcp_reordering knob might
help.
I will have a cluster of opterons soon (the parts have started to
arrive this week) that I could test bonded GigE on, and will try the
tcp_reordering sysctl. If you are still early in the planning stages,
you might want to consider an FNN (Flat Neighborhood Network) for
your cluster to keep the latency and costs low. See some of
our work on FNNs here:
http://aggregate.org/FNN/
http://aggregate.org/KASY0/
Please contact me off the list if you think you might use an FNN.
Good luck with your cluster, and as Brian said, you might want
to build a small test cluster to check if bonding will help,
or even if you need more than a single GigE. You may have more
memory per node such that the ratio of compute to communications
might still be okay for just single GigE.
--
Tim Mattox - tmattox_at_[hidden] - http://homepage.mac.com/tmattox/
http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/
|