Legacy DC Architectures

DC types

Type	Description
Private Single-Tenant	Individual organizations which maintain their own data centers belong in this category. The data center is for the private use of the organization and there is only the one organization or tenant using the data center.
Private Multitenant	Organizations which provide specialized data center services on behalf of other client organizations belong in this category. IBM and EDS (now HP) are examples of companies which host such data centers. They are built and maintained by the organization providing the service and there are multiple clients whose data is stored there, suggesting the term multitenant. These data centers are private because they offer their services contractually to specific clients
Public Multitenant	Organizations which provide generalized data center services to any individual or organization belong in this category. Examples of companies which provide these services include Google and AWS. These data centers offer their services to the public. Anybody who wishes to use them may access them via the world-wide web, be they individuals or large organizations.

Why a new topology was needed.

The figure below shows the network design that dominated single tenant data centers around 2000-2005.

access-aggregation-network-architecture Hierarchical Access-aggregation-core network architecture

The endpoints or compute nodes are attached to access switches, which are the lowest layer of the figure above. The aggregation switches, also called distribution switches, in turn connect up to the core network (the root of the tree), thus connecting the access network to the rest of the world.

The reason for two aggregation switches was to avoid having the network isolated when one of the aggregation switches failed. A couple of switches was considered adequate for both throughput and redundancy at the time, but ultimately two proved inadequate.

Traffic between the access and aggregation switches is forwarded via bridging. North of the aggregation boxes, packets are forwarded using routing. This is indicated in the figure by the presence of both L2 and L3, referring to Layer 2 networking (bridging) and Layer 3 networking (routing).

At the turn of the century most of traffic flows in the above architecture were north-south reminiscent of a dominant client-server model. After the advent of microservices where one click in the client can trigger hundreds of microservices distributed in a data center, flows in the east-west increased exponentially.

The access-agg-core architecture is heavily dependent on bridging. Packet forwarding underwent a revolution when specialized Application-Specific Integrated Circuit (ASIC) processors were developed to forward packets. The advent of packet-switching silicon allowed significantly more interfaces to be connected to a single box and could forward packets at a much lower latency than was possible before. This hardware switching technology, however, initially supported only (L2) bridging.

Scalability and cost aspects

This scale up approach to datacenter networking has many draw-backs.

First, it requires aggregation to higher-speed, denser switches as its fundamental scaling mechanism. For example, in transitioning to 40-Gbps NICs, this approach requires 100-Gbps or 400-Gbps Ethernet switches. Unfortunately, the time to transition from one generation of Ethernet technology to the next has been increasing. In 1998, 1-Gbps Ethernet debuted; in 2002,10-Gbps Ethernet. The widespread transition to 10 Gbps to end hosts, and 40-Gbps Ethernet technology saw commercial deployment in 2010. During the transition period,the highest end switches command a significant price premium primarily because of limited volumes, substantially increasing the data center network’s cost.

cost-legacy-vs-fat-tree Cost of legacy vs clos / fat-tree topologies

Second, scale-up networking such as this limits the overall bisection bandwidth, in turn impacting the maximum degree of parallelism or overall system performance for large-scale distributed computations.

Issues of cost lead data center architects to oversubscribe their networks. For instance, in a typical data center organized around racks and rows, intra rack communication through a top-of-rack switch might deliver non blocking performance. Communication to a server in a remote rack but the same row might be oversubscribed by a factor of 5. Finally, inter-row communication might be oversubscribed by some larger factor. Oversubscription ratios of up to 240 to 1 have occurred in commercial deployments.

Protocol aspects

On top of the above limitations, there are additional ones originating from common networking protocols.

MAC Address Table Size

In switches and routers, the MAC Address Table is used by the device to quickly determine the port or interface out of which the device should forward the packet. For speed, this table is implemented in hardware. As such, it has a physical limit to its size.

With network virtualization and the use of Ethernet technology across WANs, the layer two networks are being stretched geographically as never before. With server virtualization, the number of servers possible in a single layer two network has increased dramatically. With numerous virtual NICs on each virtual server, this problem of a skyrocketing number of MAC addresses is exacerbated. Layer two switches are designed to handle the case of a MAC table miss by flooding the frame out all ports except the one on which it arrived, as shown in the figure below.

mac-address-limitations MAC address table limit resulting in packet flooding

The Address Resolution Protocol (ARP), the IPv4 protocol that is used to determine the MAC address for a given IP address, typically uses broadcast for its queries. So in a network of,say, 100 hosts, every host receives at least an additional 100 queries(one for each of the 99 other hosts and one for the default gateway).

VLAN requirements

The introduction of data centers created the need to segregate traffic so as to ensure security and separation of the various tenant networks’ traffic. The technology responsible for providing this separation is VLANs. As long as data centers remained single-tenant networks, the maximum number of 4094 VLANs seemed more than sufficient. To have expanded this twelve-bit field unnecessarily was not wise, as different tables in memory and in the ASICs had to be large enough to accommodate the maximum number. Thus, to have made the maximum larger than 4094 ($2^{12}-2$) had a definite downside at the time the design work was performed.

When data centers continued to expand, however, especially with multitenancy and server virtualization, that number of VLANs required could easily exceed 4094. When there are no more available VLANs, the ability to share the physical resources amongst the tenants quickly becomes complex. Since the number of bits allocated to hold the VLAN is fixed in size, and hardware has been built for years which depends on that specific size, it is nontrivial to expand the size of this field to accommodate more VLANs.

VLAN-number-limitation

This has the benefit of ensuring that the frame reaches its destination, if that destination exists on the layer two network. Under normal circumstances, when the destination receives that initial frame, this will prompt a response from the destination. Upon receipt of the response, the switch is able to learn the port on which that MAC address is located and populates its MAC table accordingly. This works well unless the MAC table is full, in which case it cannot learn the address. Thus, frames sent to that destination continue to be flooded. This is a very inefficient use of bandwidth and can have significant negative performance impact.

With respect to the Spanning Tree Protocol (STP), a loop is created by the triangle of the two aggregation switches and an access switch. STP breaks the loop by blocking the link connecting the access switch to the non root aggregation switch. Unfortunately,this effectively halves the possible network bandwidth because an access switch can use only one of the links to an aggregation switch. If the active aggregation switch dies or the link between an access switch and the active aggregation switch fails, STP will automatically unblock the link to the other aggregation switch.