How Data Center Networks Work: Leaf-Spine, BGP, and VXLAN

2026-06-12 · 11 min read

The network architecture inside a data center is nothing like a campus LAN scaled up. The traffic patterns, failure domains, protocol choices, and physical topology have all been redesigned over the past fifteen years to meet demands that traditional three-tier enterprise networking could not handle. Understanding how data center networks work is essential for understanding cloud infrastructure, distributed systems performance, and why protocols like BGP — originally designed for internet-scale routing between autonomous systems — now run inside buildings between individual server racks.

The Problem with Three-Tier Architecture

The traditional enterprise network was built for north-south traffic: clients on the access layer communicating with servers in the core. The three tiers — access, distribution, and core — formed a tree with a fat trunk and narrow branches. Traffic from a workstation to a server in the data center traveled up the tree and back down. This design worked when most computation happened on a handful of centralized servers that clients accessed.

Modern data center workloads are fundamentally different. A web request might trigger dozens of microservice calls, database queries, cache lookups, and storage operations — all between servers within the same data center. This east-west traffic (server-to-server within the data center) now dwarfs north-south traffic (clients entering from outside). Estimates from major hyperscalers put east-west at 70–85% of total data center traffic volume.

Three-tier networks failed at east-west for two reasons. First, all traffic between racks had to traverse the aggregation and core layers, creating bottlenecks at every uplink. Second, the architecture relied on Spanning Tree Protocol to prevent Layer 2 loops. STP blocked redundant uplinks, leaving capacity idle. Convergence after failures took 30–50 seconds in classic STP (better with RSTP, but still problematic at scale). And STP's failure modes — broadcast storms, topology oscillation — became catastrophic in large flat Layer 2 domains with hundreds of switches.

The Clos / Leaf-Spine Fabric

The solution, now universal among hyperscalers and rapidly adopted in enterprise data centers, is the Clos topology — specifically the two-tier leaf-spine variant. Charles Clos described the general multistage switching fabric in a 1953 Bell System paper. Modern data center networking rediscovered it because it achieves a key mathematical property: non-blocking or near-non-blocking interconnect.

In a leaf-spine fabric:

Leaf switches (also called ToR — Top of Rack) sit at the top of each server rack. Servers connect to their local leaf at 10G, 25G, or 100G. The leaf switch is the Layer 2 boundary; all inter-leaf communication is Layer 3.
Spine switches form the interconnect fabric. Every leaf connects to every spine with equal-cost uplinks, typically at 100G or 400G. Spines have no server-facing ports — they only connect to leaves and, optionally, to border leaves for WAN/internet egress.
Full mesh between leaf and spine — not between spines themselves (no spine-to-spine links). This is the defining property of the Clos topology and is what makes the non-blocking math work.

Non-Blocking Math and Oversubscription

A fabric is non-blocking if any server can communicate with any other server simultaneously at full line rate without contention. The math is straightforward: if each leaf has K downlinks to servers and K uplinks to spines, and there are K spine switches, then at any moment the uplink capacity equals the downlink capacity. Any server sending at full rate has a path through the fabric without interference.

In practice, most data center fabrics are oversubscribed — servers can collectively generate more traffic than the uplinks can carry simultaneously. A 4:1 oversubscription ratio means the total downlink bandwidth is 4× the total uplink bandwidth from that leaf. This is acceptable because real server workloads are bursty, not simultaneously at line rate, and the economics of full non-blocking at large scale are prohibitive. A 4:1 or 3:1 oversubscription at the leaf-to-spine boundary is common; 1:1 (non-blocking) is reserved for storage and high-performance computing fabrics.

Adding more spine switches increases bisection bandwidth — the maximum throughput between any arbitrary set of servers in different pods. If spines become the bottleneck, you add more spines. If leaf port density limits server count, you add more leafs. The architecture scales horizontally in both dimensions, which is why it replaced the tree architecture where the core was a fixed bottleneck.

ECMP: Equal-Cost Multi-Path Routing

Because every leaf has equal-cost paths to every spine, the routing protocol installs multiple equal-cost next-hops for any destination prefix. ECMP hashes each flow across these paths so that traffic is distributed across all available uplinks. A leaf with 4 spine uplinks sees each flow consistently routed to one spine (ensuring in-order delivery within a TCP flow) while the aggregate traffic spreads across all four spines.

The hashing function is typically a 5-tuple hash (source IP, destination IP, source port, destination port, protocol). This provides good distribution for diverse traffic with many flows but can cause imbalance when a few elephants (large long-lived flows) hash to the same spine. ECMP-aware implementations and adaptive load balancing schemes (DRILL, CONGA) address elephant flow imbalance, but 5-tuple ECMP remains the standard baseline.

Critically, ECMP requires a Layer 3 routed underlay — you cannot ECMP across Layer 2 paths because Spanning Tree would block all but one. The move to a fully routed leaf-spine fabric was inseparable from the abandonment of STP at the fabric level. Individual leaf switches still run STP or its equivalent for dual-homed servers, but the spine and inter-leaf paths are IP-routed.

eBGP as the Data Center Routing Protocol

RFC 7938 ("Use of BGP for Routing in Large-Scale Data Centers") formally documented the practice that hyperscalers had already adopted: running eBGP (External BGP) as the routing protocol throughout the data center fabric, even though every router is within the same organizational domain.

The choice of eBGP over link-state IGPs (OSPF, IS-IS) for the data center is counterintuitive but has strong justifications:

ASN-per-rack model. Each leaf switch is assigned a unique private ASN (from the 4-byte private range 4200000000–4294967294). Each spine has a shared ASN (one ASN per spine tier). eBGP between leaf and spine means the AS path length naturally reflects the fabric hop count, and BGP loop prevention (AS path checking) prevents routing loops without needing STP.
Incremental deployment. Adding a new leaf switch is simply adding a new BGP peer. No area reconfiguration, no LSA flooding storms, no need to drain the entire fabric for maintenance. Each leaf is topologically isolated — a misconfiguration on one leaf cannot corrupt the routing database of the entire fabric.
Policy expressiveness. BGP communities and route maps allow fine-grained traffic engineering, selective prefix announcement, and graceful maintenance withdrawal that IS-IS and OSPF cannot match natively.
Failure domain isolation. In IS-IS or OSPF, a link-state flooding bug or a router announcing malformed LSAs can destabilize the entire routing domain. In the eBGP model, each leaf's routes are contained within a BGP peering session; a broken peer affects only that leaf's prefixes.

The downside of eBGP for the data center is convergence speed — iBGP and link-state protocols typically converge faster after link failures. This is mitigated by using BFD for sub-second failure detection, tuning BGP timers aggressively (Keepalive 1s, Hold 3s), and using BGP Add-Paths to pre-install backup routes.

Top-of-Rack Switching and Cabling

The Top-of-Rack (ToR) design places a switch at the top of each server rack. Servers connect to the ToR via short (1–3 m) intra-rack cables at 10G, 25G, or 100G. The ToR connects to spines via longer structured cabling or fiber running through overhead trays or under-floor conduits. This design minimizes cable length, simplifies hot-swap of individual servers, and contains the Layer 2 domain to a single rack.

The alternative, End-of-Row (EoR), places larger switches at the end of a row of racks, with longer cables running from each server to the EoR. EoR reduces switch count (one EoR serves many racks) but requires more structured cabling and larger switch port density. EoR was more common in the three-tier era; ToR dominates in modern leaf-spine designs.

At hyperscaler scale, cabling becomes a massive operational challenge. Facebook/Meta published designs for structured cabling using modular fiber trunk cables with pre-terminated MPO connectors, allowing rapid recabling during fabric changes. The number of physical fiber runs in a large data center — millions of strands — rivals the complexity of the BGP routing table.

Overlay Networks: VXLAN and EVPN

A fully routed Layer 3 underlay is ideal for east-west traffic, but many workloads require Layer 2 adjacency across racks: live VM migration needs the VM's IP and MAC to remain valid when the VM moves between hypervisors in different racks; containers in a Kubernetes cluster may need to appear on the same subnet regardless of which worker node they run on. Solving this on a pure Layer 3 underlay requires an overlay.

VXLAN (Virtual Extensible LAN) encapsulates Ethernet frames in UDP packets (destination port 4789), allowing Layer 2 domains to span the routed Layer 3 underlay. A VXLAN Network Identifier (VNI) of 24 bits supports over 16 million logical networks, far exceeding the 4,094 VLAN limit. Each VTEP (VXLAN Tunnel Endpoint) — typically running on the leaf switch or the hypervisor's virtual switch — encapsulates outbound frames and decapsulates inbound ones. How VXLAN works covers the encapsulation mechanics in detail.

EVPN (Ethernet VPN, RFC 7432) provides a BGP-based control plane for VXLAN. Instead of flooding unknown frames to learn MAC-to-VTEP mappings (flood-and-learn), EVPN distributes MAC/IP bindings as BGP Type-2 routes. A leaf switch learns that MAC AA:BB:CC:DD:EE:01 is reachable via VTEP 10.0.0.5 from a BGP update, not by flooding a frame and waiting for a reply. EVPN eliminates most broadcast from the underlay, dramatically improving scale and reducing the risk of broadcast-induced congestion. VXLAN-EVPN is now the standard overlay architecture in modern data centers and in cloud VPC networking.

Failure Domains and Maintenance

One of the key design goals of leaf-spine is limiting the failure domain — the scope of impact when a component fails. In a properly designed Clos fabric:

A server NIC failure affects only that server.
A leaf switch failure affects only the servers in that rack, typically 20–48 servers. Dual-homing servers to two leaf switches (using LACP or multi-chassis LAG) reduces this to zero impact if one leaf fails.
A spine switch failure reduces available bandwidth by 1/N (where N is the number of spines) but causes no connectivity loss. Traffic is redistributed via ECMP across the remaining spines within BGP convergence time (typically sub-second with BFD).
A single link failure has no impact beyond the flow-level rebalancing that ECMP performs automatically.

Maintenance procedures exploit this failure domain isolation. Taking a spine switch out of service for firmware upgrades involves withdrawing its BGP advertisements (causing all leaves to remove it as a next-hop), performing the upgrade, and re-advertising routes. Traffic shifts gracefully to the remaining spines. This is hitless maintenance — services continue without manual failover or maintenance windows, as long as there is sufficient spare capacity in the remaining spines.

GPU Fabric and Rail-Optimized Networks

The emergence of large-scale AI training clusters has driven new networking requirements that the standard leaf-spine architecture does not optimally serve. Training large models on thousands of GPUs requires all-reduce collective operations where every GPU communicates with every other GPU simultaneously, with extremely high bandwidth and extremely low latency.

The rail-optimized fabric design emerged from this requirement. In a rail-optimized network, GPUs within a server are connected to dedicated NVLink or PCIe switches within the server. Between servers, each GPU has a dedicated port to a dedicated top-of-rack switch (a "rail"). Multiple servers' GPUs on the same rail can communicate at full bandwidth without traversing any shared switch. Between rails, spine-level connectivity provides cross-rail communication.

NVIDIA's InfiniBand-based DGX SuperPOD and the open Ultra Ethernet Consortium designs target sub-microsecond latency and terabit-per-second bisection bandwidth for AI workloads — requirements that exceed what standard Ethernet leaf-spine can provide at comparable cost. The network design for AI clusters is evolving rapidly, and the distinction between the compute fabric (between GPUs) and the storage/east-west fabric (between servers) is increasingly a first-class design concern in hyperscale data centers.

Explore It Live

Data center networks are the physical substrate under every cloud service. You can examine the ASNs operated by the largest cloud and data center providers to understand their network footprint:

AS16509 — Amazon Web Services (operates leaf-spine fabrics in dozens of regions; eBGP throughout)
AS15169 — Google (pioneered Clos data center design; Jupiter and Orion network architectures)
AS8075 — Microsoft Azure (uses eBGP-in-the-datacenter per RFC 7938)