Clos Network Architectures
Charles Clos published in 1953 in the Bell Labs Tech Journal a paper that is the basis of today’s data center network architectures and is shown below. They are also known as a fat-tree architectures or Ethernet fabric.
Clos Architecture
Folding the Clos architecture results in the leaf-spine connectivity that is popular today in data centers. Every leaf is connected to every spine node. The spines connect the leaves to one another, and the leaves connect the servers to the network. It is common practice to put the servers and the leaf switch in a single hardware rack, with the switch at the top of the rack (ToR).
Although not mandated by the Clos topology, the use of homogeneous equipment is a key benefit of this architecture. The spines serve a single purpose: to connect the different leaves. The spines do not provide any other service, in contrast to the aggregation switches in legacy data center architectures, although they structurally occupy the same position. In a Clos topology, all functionality is pushed to the edges of the network, the leaves and the servers themselves, rather than being provided by the center, represented by the spines. The spines are then used only to scale the available bandwidth between the edges and the control-plane load on a spine increases only marginally as more leaves are added to it. In contrast, in the previous architecture, the scaling of services was provided by beefing up the aggregation box’s CPUs.
In Clos, the Spanning Tree Protocol (STP) is not used as the switch interconnect control protocol - instead we use Equal-Cost Multipath (ECMP) routing to forwarded a packet along any of the available equal-cost paths. The key characteristic in Clos is that the aggregate bandwidth does not decrease from one tier to the next as we approach the core. Whether this is achieved by a larger number of low bandwidth links or a lower number of high bandwidth links does not fundamentally alter the fat tree premise.
In a fat tree topology, the number of links entering one switch in a given tier of the topology is equal to the number of links leaving that switch toward the next tier. This implies that as we approach the core, the number of links entering the switches is much greater than those toward the leaves of the tree. The root of the tree will have more links than any switch lower than it.
As we approach the core of the data center network, these interconnecting links are increasingly built from 100Gb links and it is costly to simply try to overprovision the network with extra links. Thus, part of the Ethernet fabric solution is to combine it with ECMP technology so that all of the potential bandwidth connecting one tier to another is available to the fabric.
Dimensioning Clos Architectures
In packet-switched networks, oversubscription of a switch is defined as the ratio of downlink to uplink bandwidth. Assuming equal speed links for uplink and downlink, a 1:1 oversubscription ratio means that every downlink has a corresponding uplink. A 1:1 oversubscribed network is also called a non-blocking network(technically, it is really non-contending), because traffic from one downlink to an uplink will not contend with traffic from other downlinks. However, even with an oversubscription ratio of 1:1, the Clos topology is almost non-blocking. Depending on the traffic pattern, flow hashing can result in packets from different downlinks using the same uplink. If you could rearrange the flows from different downlinks that end up on the same uplink to use other uplinks, you could make the network non blocking again.
Requirement | Formula |
---|---|
Example | ![]() |
Number of servers supported for 2-tiers and 1:1 oversubscription | $n^2/2$ Explanation: Looking at the example topology, leaves L1-L4 have two ports facing the server and one port to each spine. That allows each leaf to connect to two servers. The same four-port switch as a spine can hook up to four leaves. Thus four leaves multiplied by two server ports per leaf is eight servers. |
Number of required $n$-port switches for 2-tiers and a 1:1 oversubscription | $n + \frac{n}{2}$ Explanation: Each spine switch can support $n$ leaf switches. Every leaf switch can connect half its ports to the spines and the other half to the servers. A spine can connect all n ports to leaves. So an $n$-port switch will have n leaves connecting to $n/2$ spines |
ISL Dimensioning | |
Scaling up (Facebook) | ![]() |
Scaling up (Azure / AWS) | ![]() |
Number of servers supported for 3-tiers and 1:1 oversubscription | $n^3/4$ (With 128-port switches, we can support $128^3/4 = 524,288$ servers) |
Number of required $n$-port switches for 3-tiers and a 1:1 oversubscription | $n + n^2$ (With 64-ports witches, we need $64 + (64^2) = 4,160$ switches.) |
Real-World Examples
In this section we outline a non-exhaustive list of publicized information on DCN designs. In one of the assignments we ask students to compare between Google’s and Facebook’s DCN designs presented below.
Google’s DCN
Facebook DCN
Introducing the data center fabric and the next-generation Facebook data center network