Load Balancing Algorithms and ECMP

NOTE: The initial material is from this classic paper. The later flow optimization material is from the instructor - itself is based on another classic paper.

Datacenter topologies must contain multiple paths between hosts to deliver the necessary aggregate bandwidth and to tolerate failures.Unfortunately, traditional L3 Internet routing protocols are optimized for delivering baseline connectivity rather than for load balancing among a set of available paths.

ECMP forwarding balances flows among multiple network paths. An ECMP-capable switch matches a packet’s destination address against the forwarding table. To choose among possible next hops, the switch computes a hash of a predefined set of fields in the packet header modulo the number of possible paths, to choose the particular next hop. By hashing on fields such as source and destination address, source and destination port number, and time to live, ECMP ensures that packets belonging to the same flow follow the same path through the network, mitigating concerns over packet reordering endemic to multipath forwarding protocols.

Although a step in the right direction,ECMP is oblivious to dynamically changing communication patterns. For any static hash algorithm, multiple flows can collide as shown below:

ecmp-collisions Equal-cost multipath (ECMP) forwarding collisions. As in this example, multiple flows can collide on a congested link, resulting in reduced throughput.

hedera-centralized-scheduler Dynamic flow placement

To address this limitation, dynamic flow scheduling in data centers with significant multipathing is needed. Two observations motivated the design:

First, when flows are uniform in size and their individual rates are small, ECMP works well in distributing flows given a reasonable hash function. Problems arise with ECMP’s static hashing when flow length varies. In this case, a relatively small number of large flows can account for most bandwidth. So, good load balancing requires efficiently identifying and placing these large flows among the set of available paths between source and destination.
Secondly, it’s insufficient to perform load balancing on the basis of measured communication patterns, because poor scheduling can mask the actual demand of a congestion-responsive flow (such as aTCP flow). Good scheduling then requires a mechanism to estimate a flow’s intrinsic bandwidth demand independent of any bottlenecks the network topology or past scheduling decisions might introduce.

Principle behind load balancing

In most instances that an SDN application needs to optimize flow rates, a Network Utility Maximization (NUM) approach is adopted. Also in what follows you can replace the word ‘source’ with ‘flow’ without loss of generality. The objective function Network Utility (U) is a function of the utilities of the network nodes and is in general non-decomposable. For our purposes though, we will assume that the network utility is the sum of all the user utilities ($U_k$) that are in the network. Each user is affiliated in general with one or more nodes and produces as well as consumes node commodities (resources). Commodities may include rate units, electrical power consumption units (KWh) and other quantities of interest. Each user utility is a function of one or more of these commodities. For example it is common to assume in the literature, the user utility as a continuous increasing differentiable function of the average consumed data rate $x$ for best effort applications, e.g.

$$U_k=\log(x)$$

The constraints of the problem are in most instances described by a rate or network stability region that is commonly assumed to be convex. For example it is common to represent problems such as the one above, as a queuing network problem whose queues have lengths (backlogs) that need to be stable. The algorithm that stabilizes such queues solves the problem.

Network with N sources and L paths

We consider a network with paths

$$\mathcal{L} = \{1, \dots, L\}$$

and sources

$$\mathcal{S} = \{1, \dots, N\}$$

Each source is characterized by a utility function $U_s(x_s)$ that is strictly concave, increasing and continuously differentiable function of the transmission rate $x_s$ . The path $L(s)$ is the set of links that the source is using. For each link let

$$S(l) = \{s \in \mathcal{S} | l \in \mathcal{L}\}$$

be the set of sources that use this link. The primal problem can be stated as follows:

$$\max_{x_s} \sum_s U_s(x_s)$$ $$\mathtt{s.t.}\sum_{s \in S(l)} x_s \le c_l, \mathtt{ } l=\{1, \dots, L\}$$

where the constraint simply says that the aggregate rate at any link does not exceed capacity. This primal optimization problem has a solution since the objective function is strictly concave and the feasible region is convex. The problem can be solved iteratively and it will converge as shown below.

convergence-rates-price Convergence of the solver with capacity $C=\{1, \dots, 10 \}$ and two sources sharing the same link. The example was made simple to allow easily to replicate the results*

flow-rate-allocation Allocation of the flow rates for various path capacity constraints

SDN controllers employ various optimization techniques and tools. You need to be familiar with one of them and the author suggests OR-Tools authored by Google.

Experiments

http://csie.nqu.edu.tw/smallko/sdn/dijkstra_ecmp.htm