How VXLAN Works: Virtual Extensible LAN Overlay Networking
Virtual Extensible LAN (VXLAN) is a network overlay technology that encapsulates Layer 2 Ethernet frames inside Layer 3 UDP packets, enabling the creation of virtualized Layer 2 networks that span across Layer 3 boundaries. Defined in RFC 7348, VXLAN addresses the scalability limitations of traditional VLANs (limited to 4,094 IDs) by providing a 24-bit segment identifier (VNI) that supports up to 16 million logical networks. VXLAN has become the dominant overlay technology in modern data center fabrics, cloud infrastructure, and container networking environments.
VXLAN matters to network engineers working with large-scale data centers because it decouples the logical network topology from the physical underlay. Virtual machines and containers can migrate freely between physical hosts while retaining their MAC addresses, IP addresses, and network membership — the VXLAN overlay provides a consistent Layer 2 domain regardless of the physical location. When combined with EVPN (Ethernet VPN) as the control plane, VXLAN becomes a sophisticated, scalable fabric technology that integrates with BGP for MAC/IP advertisement and route distribution.
Why VXLAN Exists: The VLAN Scalability Problem
Traditional VLANs use a 12-bit VLAN ID field in the IEEE 802.1Q tag, limiting the maximum number of VLANs to 4,094 (IDs 0 and 4095 are reserved). In a multi-tenant data center hosting thousands of customers, each requiring network isolation, 4,094 VLANs is insufficient. Even within a single large enterprise, application teams, development environments, and microservice architectures can easily exhaust the VLAN space.
Beyond the ID limitation, VLANs have other scaling problems:
- Spanning Tree: VLANs rely on STP (Spanning Tree Protocol) for loop prevention, which blocks redundant links, wastes bandwidth, and creates complex failure domains.
- Layer 2 flooding: Broadcast, Unknown unicast, and Multicast (BUM) traffic is flooded to every port in the VLAN, including across trunk links between switches, creating unnecessary load.
- Physical topology constraints: VLANs must be trunked across every physical switch in the path. Adding a VLAN to a new rack requires configuration changes on every switch in the chain.
- No routing across L3 boundaries: A VLAN is a Layer 2 domain that cannot natively span across routed (Layer 3) boundaries between data center pods or buildings.
VXLAN solves all of these problems by using the Layer 3 underlay (IP/UDP) as a transport for Layer 2 frames. The underlay provides routing, ECMP load balancing, and loop-free forwarding. The overlay provides the logical Layer 2 connectivity with a much larger identifier space.
VXLAN Encapsulation
VXLAN uses a MAC-in-UDP encapsulation scheme. An original Ethernet frame from a virtual machine or container is wrapped in a VXLAN header, then placed inside a UDP packet, which is in turn encapsulated in an outer IP packet and outer Ethernet frame for transport across the underlay network.
The encapsulation adds 50 bytes of overhead to every frame: 14 bytes outer Ethernet, 20 bytes outer IP, 8 bytes UDP, and 8 bytes VXLAN header. This means the underlay MTU must be at least 50 bytes larger than the inner frame size. For standard 1500-byte inner frames, the underlay needs at least an MTU of 1550 bytes. Most data center fabrics configure a jumbo frame MTU of 9214 on the underlay to accommodate VXLAN encapsulation without fragmenting inner frames.
The VNI (VXLAN Network Identifier)
The VNI is the 24-bit field that identifies the VXLAN segment — the logical Layer 2 network. It is analogous to a VLAN ID but with a vastly larger space: 16,777,216 possible values compared to 4,094 VLANs. Each VNI defines an isolated broadcast domain. Traffic from one VNI cannot reach another VNI without explicit routing (inter-VXLAN routing), providing tenant isolation in multi-tenant environments.
UDP Source Port Hashing
VXLAN uses UDP destination port 4789. The source port is computed as a hash of fields from the inner frame (typically the inner source/destination MAC, IP, and TCP/UDP ports). This source port entropy is critical because it enables the underlay network to perform ECMP (Equal-Cost Multi-Path) load balancing across multiple parallel paths. Without source port variation, all VXLAN traffic between two VTEPs would hash to the same ECMP path, wasting available bandwidth. The hash-based source port distributes VXLAN flows across all available underlay paths.
VTEPs: VXLAN Tunnel Endpoints
A VTEP (VXLAN Tunnel Endpoint) is the device that performs VXLAN encapsulation and decapsulation. VTEPs are the edge devices of the VXLAN overlay — they sit at the boundary between the Layer 2 domain (virtual machines, containers, bare-metal servers) and the Layer 3 underlay network.
VTEPs can be implemented in:
- Hardware switches: Top-of-rack (ToR) switches in a leaf-spine fabric. Each leaf switch acts as a VTEP, encapsulating traffic from locally connected servers and decapsulating traffic arriving from other leaf switches.
- Hypervisor virtual switches: The vSwitch in a hypervisor (e.g., Open vSwitch, VMware vDS) can act as a VTEP, encapsulating traffic from individual VMs. This moves the VXLAN boundary to the server, making the physical network pure Layer 3.
- Software routers/gateways: Dedicated gateway devices that bridge between VXLAN segments and traditional VLANs or external networks.
Each VTEP has at least one IP address on the underlay network (the VTEP IP, often a loopback address). VXLAN tunnels are established between VTEP IPs. The underlay routing protocol (OSPF, IS-IS, or BGP) provides reachability between VTEP IPs, and the underlay network handles forwarding the encapsulated packets.
BUM Traffic Handling
One of the biggest challenges in VXLAN is handling BUM (Broadcast, Unknown unicast, and Multicast) traffic. In a traditional VLAN, BUM traffic is flooded to all ports in the VLAN. In a VXLAN overlay, the equivalent would be sending BUM traffic to all VTEPs that participate in the same VNI. Two primary approaches exist:
Multicast-Based Flooding
The original RFC 7348 approach maps each VNI to an IP multicast group (e.g., VNI 10000 → 239.1.1.1). BUM traffic is encapsulated and sent to the multicast group address. The underlay network's multicast routing (PIM-SM) delivers it to all VTEPs that have joined the group. This approach is simple but requires multicast infrastructure in the underlay, which many operators are reluctant to deploy due to complexity and troubleshooting difficulty.
Ingress Replication (Head-End Replication)
The VTEP maintains a list of all remote VTEPs participating in each VNI and unicasts a copy of BUM traffic to each one individually. This eliminates the need for multicast in the underlay but increases the amount of traffic generated by the source VTEP proportionally to the number of remote VTEPs. For small to medium deployments (dozens of VTEPs per VNI), ingress replication works well. For very large deployments, the replication overhead can become significant.
EVPN-Based Suppression
The most modern and scalable approach uses EVPN (described below) to distribute MAC and IP information via BGP, enabling VTEPs to proxy-respond to ARP/ND requests without flooding them. This dramatically reduces BUM traffic. When a VM sends an ARP request, the local VTEP intercepts it, looks up the target IP in its EVPN-learned MAC/IP database, and responds directly. The ARP request never crosses the underlay.
EVPN: The Control Plane for VXLAN
RFC 7348 defined only the VXLAN data plane (encapsulation format) and used flood-and-learn for MAC address discovery — the same mechanism as traditional Ethernet switching, but tunneled through VXLAN. This data-plane-only approach has significant limitations: it requires either multicast or ingress replication for all BUM traffic, it has no control over which VTEPs participate in which VNIs, and MAC learning is reactive rather than proactive.
EVPN (Ethernet VPN, RFC 7432) provides a proper control plane for VXLAN using BGP as the routing protocol. EVPN with VXLAN encapsulation is defined in RFC 8365. With EVPN, MAC and IP addresses are advertised as BGP routes rather than learned from flooded data-plane traffic. Each VTEP runs BGP (typically iBGP with route reflectors) and advertises the MAC/IP addresses of locally connected endpoints.
EVPN Route Types
EVPN uses BGP to carry several route types, each serving a specific purpose in the overlay network:
- Type 1 — Ethernet Auto-Discovery Route: Advertises Ethernet segment membership for multi-homing scenarios. Used for fast convergence when a link in a multi-homed Ethernet segment fails, and for split-horizon filtering to prevent BUM traffic loops in active-active multi-homing.
- Type 2 — MAC/IP Advertisement Route: The most fundamental EVPN route type. Advertises a MAC address, optionally with an associated IP address (IPv4 or IPv6), the VNI, and the VTEP IP. This enables remote VTEPs to learn MAC-to-VTEP mappings via BGP rather than flood-and-learn. When a VM sends its first packet, the local VTEP advertises the VM's MAC/IP as a Type 2 route. Remote VTEPs install the route and know exactly which VTEP to send traffic to for that MAC.
- Type 3 — Inclusive Multicast Ethernet Tag Route: Advertises VTEP participation in a VNI for BUM traffic. When a VTEP joins a VNI, it advertises a Type 3 route containing its VTEP IP. Other VTEPs use this to build the ingress replication list for BUM traffic in that VNI.
- Type 4 — Ethernet Segment Route: Used for Designated Forwarder (DF) election in multi-homed Ethernet segments. Ensures that only one VTEP in a redundancy group forwards BUM traffic toward a multi-homed host, preventing duplicates.
- Type 5 — IP Prefix Route: Advertises IP prefixes for inter-VXLAN routing (between different VNIs). Enables L3 routing between VXLAN segments, carrying the prefix, next-hop VTEP, VNI, and routing information.
Symmetric and Asymmetric IRB
When traffic needs to be routed between different VXLAN segments (different VNIs), Integrated Routing and Bridging (IRB) is used. Two models exist:
Asymmetric IRB
In the asymmetric model, the ingress VTEP performs both the routing lookup (L3) and the bridging into the destination VNI. The packet crosses the underlay in the destination VNI. The egress VTEP only performs L2 bridging. This is "asymmetric" because the ingress does more work than the egress, and the return traffic path may use a different VNI.
The downside: the ingress VTEP must have both the source and destination VNIs configured, which means every VTEP must be configured with every VNI in the fabric. This does not scale in large multi-tenant environments.
Symmetric IRB
In the symmetric model, both the ingress and egress VTEPs perform routing. The ingress VTEP routes the packet from the source VNI into a shared L3 VNI (associated with a VRF), encapsulates it in VXLAN with the L3 VNI, and sends it across the underlay. The egress VTEP decapsulates, performs a routing lookup in the VRF, and bridges the packet into the destination VNI.
The advantage: each VTEP only needs to be configured with the VNIs of locally connected hosts plus the L3 VNI for the VRF. This scales much better in multi-tenant environments where different tenants have different sets of VNIs on different leaf switches. Symmetric IRB is the recommended and most widely deployed model in modern EVPN-VXLAN fabrics.
VXLAN in Data Center Leaf-Spine Fabrics
The most common VXLAN deployment architecture is the BGP-EVPN leaf-spine fabric:
- Leaf switches act as VTEPs. Each leaf connects to servers (hypervisors, bare-metal, containers) and provides the VXLAN overlay termination. Servers see traditional VLANs or untagged ports; the leaf handles VXLAN encapsulation.
- Spine switches form the IP underlay core. They route VXLAN-encapsulated traffic between leaf switches using ECMP. Spines are typically configured as BGP route reflectors for EVPN.
- eBGP underlay: Many designs use eBGP for the underlay routing (each leaf and spine has a unique private ASN), eliminating the need for an IGP like OSPF or IS-IS.
- iBGP EVPN overlay: EVPN routes are carried in iBGP, with spine switches serving as route reflectors. Alternatively, some designs use eBGP for both underlay and overlay.
This architecture provides a scalable, loop-free, multi-path fabric where any VM or container can communicate with any other, regardless of physical location. VXLAN segments can be stretched across multiple data centers by extending the EVPN control plane across a DCI (Data Center Interconnect) link.
VXLAN MTU and Performance Considerations
The 50-byte VXLAN encapsulation overhead introduces several practical considerations:
- MTU planning: The underlay MTU must accommodate the inner frame plus 50 bytes of encapsulation. For 1500-byte inner frames, the underlay needs at least 1550 bytes. The industry standard is to configure 9214-byte jumbo frames on the underlay, which accommodates inner jumbo frames up to 9164 bytes.
- Don't Fragment (DF) bit: VXLAN sets the DF bit in the outer IP header by default. If the encapsulated packet exceeds the underlay MTU, it is dropped rather than fragmented. This makes MTU consistency across the underlay critical — a single link with a smaller MTU will cause silent black-holing of large packets.
- ECMP hashing: The underlay must be configured to hash on the UDP source port for ECMP load balancing. Some older hardware hashes only on source/destination IP, which would send all traffic between two VTEPs down the same path.
- Hardware offload: Modern data center ASICs (Memory Broadcom Memory Memory Memory, Intel/Barefoot Tofino, NVIDIA Spectrum) support VXLAN encap/decap in hardware at line rate. Software-based VXLAN (e.g., in Open vSwitch) introduces measurable CPU overhead and latency, though kernel offload features like
tc flowerand hardware steering can mitigate this.
VXLAN and Container Networking
VXLAN is widely used in container networking to provide overlay connectivity between pods running on different hosts. Container networking solutions like Flannel, Calico (in VXLAN mode), and Cilium use VXLAN tunnels between nodes to encapsulate pod-to-pod traffic. The container runtime on each node acts as a VTEP, encapsulating traffic destined for pods on remote nodes.
In Kubernetes environments, VXLAN overlays allow pods to communicate using a flat IP address space regardless of the underlying network topology. Each node is assigned a subnet from the pod CIDR, and VXLAN tunnels provide the connectivity between these subnets without requiring the physical network to understand pod addressing.
VXLAN Security Considerations
VXLAN itself provides no encryption, authentication, or integrity protection. The encapsulated traffic is carried in plaintext UDP packets. Any device on the underlay network that can capture traffic can inspect the inner frames. Security considerations include:
- Underlay isolation: The underlay network should be a trusted, physically secured infrastructure. VXLAN traffic should not traverse untrusted networks without additional protection.
- Tenant isolation: VNI isolation is enforced by VTEPs. A compromised VTEP could potentially inject traffic into any VNI. Proper access controls on the underlay and BGP session authentication are essential.
- Encryption: For VXLAN traffic crossing untrusted links (e.g., DCI over the internet), IPsec or MACsec encryption should be used on the underlay. Some implementations support VXLAN-GPE (Generic Protocol Extension) with inline encryption.
GPE, GENEVE, and the Future of Overlays
VXLAN-GPE (RFC draft) extends the VXLAN header with a Next Protocol field, enabling encapsulation of protocols other than Ethernet (e.g., IP, NSH for service chaining). GENEVE (RFC 8926) is a more flexible alternative to VXLAN that supports variable-length TLV options in the header, designed to be extensible enough to subsume both VXLAN and NVGRE.
In practice, VXLAN with EVPN remains the dominant production deployment model. GENEVE is gaining traction in cloud provider networks and is used by some container networking implementations, but the installed base of EVPN-VXLAN is massive and the transition to GENEVE is gradual.
Explore Network Infrastructure
VXLAN fabrics are part of the data center infrastructure that connects to the broader internet via BGP. Data center leaf switches often peer with border routers that run eBGP with upstream transit providers and peering partners. To see how networks interconnect at the BGP level, use the god.ad BGP Looking Glass to look up any IP address or ASN and trace the AS path between networks.