Clos Network Architectures
Charles Clos published in 1953 in the Bell Labs Tech Journal a paper that is the basis of today’s data center network architectures and is shown below. They are also known as a fat-tree architectures or Ethernet fabric.
Clos Architecture
Folding the Clos architecture results in the leaf-spine connectivity that is popular today in data centers. Every leaf is connected to every spine node. The spines connect the leaves to one another, and the leaves connect the servers to the network. It is common practice to put the servers and the leaf switch in a single hardware rack, with the switch at the top of the rack (ToR).
Although not mandated by the Clos topology, the use of homogeneous equipment is a key benefit of this architecture. The spines serve a single purpose: to connect the different leaves. The spines do not provide any other service, in contrast to the aggregation switches in legacy data center architectures, although they structurally occupy the same position. In a Clos topology, all functionality is pushed to the edges of the network, the leaves and the servers themselves, rather than being provided by the center, represented by the spines. The spines are then used only to scale the available bandwidth between the edges and the control-plane load on a spine increases only marginally as more leaves are added to it. In contrast, in the previous architecture, the scaling of services was provided by beefing up the aggregation box’s CPUs.
In Clos, the Spanning Tree Protocol (STP) is not used as the switch interconnect control protocol - instead we use Equal-Cost Multipath (ECMP) routing to forwarded a packet along any of the available equal-cost paths. The key characteristic in Clos is that the aggregate bandwidth does not decrease from one tier to the next as we approach the core. Whether this is achieved by a larger number of low bandwidth links or a lower number of high bandwidth links does not fundamentally alter the fat tree premise.
In a fat tree topology, the number of links entering one switch in a given tier of the topology is equal to the number of links leaving that switch toward the next tier. This implies that as we approach the core, the number of links entering the switches is much greater than those toward the leaves of the tree. The root of the tree will have more links than any switch lower than it.
As we approach the core of the data center network, these interconnecting links are increasingly built from 100Gb links and it is costly to simply try to overprovision the network with extra links. Thus, part of the Ethernet fabric solution is to combine it with ECMP technology so that all of the potential bandwidth connecting one tier to another is available to the fabric.
Dimensioning Clos Architectures
In packet-switched networks, oversubscription of a switch is defined as the ratio of downlink to uplink bandwidth. Assuming equal speed links for uplink and downlink, a 1:1 oversubscription ratio means that every downlink has a corresponding uplink. A 1:1 oversubscribed network is also called a non-blocking network(technically, it is really non-contending), because traffic from one downlink to an uplink will not contend with traffic from other downlinks. However, even with an oversubscription ratio of 1:1, the Clos topology is almost non-blocking. Depending on the traffic pattern, flow hashing can result in packets from different downlinks using the same uplink. If you could rearrange the flows from different downlinks that end up on the same uplink to use other uplinks, you could make the network non blocking again.
Requirement | Formula |
---|---|
Example | |
Number of servers supported for 2-tiers and 1:1 oversubscription | $n^2/2$ Explanation: Looking at the example topology, leaves L1-L4 have two ports facing the server and one port to each spine. That allows each leaf to connect to two servers. The same four-port switch as a spine can hook up to four leaves. Thus four leaves multiplied by two server ports per leaf is eight servers. |
Number of required $n$-port switches for 2-tiers and a 1:1 oversubscription | $n + \frac{n}{2}$ Explanation: Each spine switch can support $n$ leaf switches. Every leaf switch can connect half its ports to the spines and the other half to the servers. A spine can connect all n ports to leaves. So an $n$-port switch will have n leaves connecting to $n/2$ spines |
ISL Dimensioning | The main reason for using higher-speed uplinks is that you can use fewer spine switches to support the same oversubscription ratio. For example, with 40 servers connected at 10GbE, the use of40GbE uplinks requires only 10 spines, whereas using 10GbE uplinks would require 40 spines. This four-fold reduction in the number of spines is important for most data center operators because it significantly reduces the cost of cabling. It also reduces the number of switches that they need to manage. A third reason is that a larger capacity uplink ISL reduces the chances of a few long-lived high-bandwidth flows overloading a link. All of these factors are desirable to most data center operators today. Also note that server facing / UL speeds of 10GbE/40GbE or 25GbE/100GbE combinations are common. |
Scaling up (Facebook) | One way to build a three-tier Clos topology is to take the two-tier Clos topology but instead of attaching servers directly to the leaves, create another tier by attaching another row of switches. Effectively we replace the S1 and S2 switches of the example topology (1st row of this table) above with itself - a kind of fractal expansion. |
Scaling up (Azure / AWS) | Another way to build a three-tier Clos topology is to take two ports from the spine switches in the two-tier Clos topology and use them to connect to another layer of switches (the top layer in the figure). |
Number of servers supported for 3-tiers and 1:1 oversubscription | $n^3/4$ (With 128-port switches, we can support $128^3/4 = 524,288$ servers) |
Number of required $n$-port switches for 3-tiers and a 1:1 oversubscription | $n + n^2$ (With 64-ports witches, we need $64 + (64^2) = 4,160$ switches.) |
Real-World Examples
In this section we outline a non-exhaustive list of publicized information on DCN designs. In one of the assignments we ask students to compare between Google’s and Facebook’s DCN designs presented below.
Google’s DCN
Facebook DCN
Introducing the data center fabric and the next-generation Facebook data center network