How ECMP Works: Equal-Cost Multi-Path Routing and Flow Hashing
Equal-Cost Multi-Path (ECMP) is a routing technique that allows a router to install multiple next-hops for the same destination prefix when two or more paths have identical cost in the routing protocol's metric. Instead of picking a single best path and discarding the alternatives, the router distributes traffic across all equal-cost paths simultaneously. ECMP is supported by every major interior gateway protocol — OSPF, IS-IS, and with explicit configuration, BGP — and is the fundamental mechanism that makes modern data center fabrics, anycast deployments, and high-bandwidth backbone links work.
ECMP matters because network links have finite capacity. A single 100 Gbps link between two switches can only carry 100 Gbps regardless of how many prefixes route over it. By installing multiple equal-cost paths and distributing traffic across them, a network operator can achieve aggregate bandwidth that exceeds any single link's capacity. A leaf switch with four 100 Gbps uplinks running ECMP to four spine switches achieves 400 Gbps of aggregate uplink bandwidth — not by bonding the links into one logical interface, but by making independent forwarding decisions per flow across all available next-hops.
How ECMP Forwarding Works
When a router computes its routing table and discovers multiple paths with equal cost to a destination, it installs all of them as valid next-hops. For example, if OSPF computes three shortest paths to 10.0.0.0/24 with cost 20, all three next-hops appear in the forwarding information base (FIB). The question then becomes: for any given packet destined for 10.0.0.0/24, which of the three next-hops should the router use?
There are two fundamental approaches: per-packet and per-flow load balancing. Per-packet round-robins individual packets across next-hops, which maximizes link utilization but causes packet reordering. TCP interprets reordered packets as congestion (triggering duplicate ACKs and fast retransmit), so per-packet ECMP devastates TCP throughput. UDP applications may also suffer if they expect in-order delivery. For this reason, virtually all modern ECMP implementations use per-flow distribution.
Hash-Based Flow Distribution
Per-flow ECMP works by computing a hash over selected packet header fields and using that hash to choose a next-hop. The hash function ensures that all packets belonging to the same flow always select the same next-hop, preserving packet ordering within the flow while still distributing different flows across available paths.
The most common hash input is the 5-tuple: source IP address, destination IP address, IP protocol number, source port, and destination port. This produces fine-grained distribution because each TCP or UDP session between two endpoints on different ports maps to a potentially different hash value and therefore a different next-hop.
Some implementations use a 3-tuple (source IP, destination IP, protocol) when Layer 4 information is unavailable — for example, when packets are encapsulated in GRE or IPsec tunnels and the outer header has no meaningful port numbers. The 3-tuple produces coarser distribution: all traffic between two IP addresses follows the same path regardless of how many sessions exist between them.
The hash function itself varies by platform. Common choices include CRC-16, CRC-32, XOR-based folding, and the Toeplitz hash (used in RSS for NIC-level receive-side scaling). The ideal hash function produces a uniform distribution of output values across the input space to ensure even load balancing. A poor hash function can produce systematic bias — consistently mapping more flows to some next-hops than others — leading to uneven link utilization despite multiple paths being available.
After computing the hash, the router selects a next-hop using hash mod N, where N is the number of equal-cost next-hops. This is conceptually simple but has an important consequence: when N changes (a link goes down or a new path appears), the mod operation remaps most flows to different next-hops, disrupting existing TCP sessions. Resilient hashing, discussed below, addresses this problem.
ECMP in OSPF and IS-IS
OSPF and IS-IS natively support ECMP because they are shortest-path-first protocols. When Dijkstra's algorithm computes the shortest-path tree, it naturally discovers all equal-cost paths to every destination. If two paths to prefix X have cost 30, both are installed in the RIB with their respective next-hops.
The number of ECMP paths a router will install is configurable and platform-dependent. Cisco IOS defaults to 4 (configurable up to 32 with maximum-paths). Junos defaults to 16. Arista EOS supports up to 128. Linux kernel routing supports up to 256 next-hops per route. The practical limit often comes from hardware TCAM capacity in the forwarding ASIC rather than software constraints.
In OSPF, ECMP is automatic: if you have a symmetric topology where multiple paths have the same cost, all equal-cost next-hops appear in the routing table without any additional configuration. The key to achieving ECMP in OSPF is careful interface cost assignment. If you want four paths to be equal-cost, all four must have the same total metric. OSPF calculates interface cost as reference-bandwidth / interface-bandwidth by default (e.g., a 1 Gbps link gets cost 100 with the default 100 Mbps reference bandwidth, or cost 1 with a 100 Gbps reference). Setting all links in a tier to the same bandwidth naturally produces equal-cost paths.
IS-IS behaves similarly. Its default metric is 10 on all interfaces regardless of bandwidth, so by default all paths are equal-cost in IS-IS, which makes ECMP the norm unless you deliberately assign different metrics. IS-IS wide metrics (RFC 3784) allow metric values up to 16,777,215, giving operators granular control over path preference while still allowing ECMP when desired.
ECMP in BGP
BGP does not perform ECMP by default. The BGP best-path algorithm selects a single best path for each prefix and installs only that path in the RIB. This is by design: BGP is a policy routing protocol, and different paths may have very different policy implications even if they share the same AS path length.
To enable BGP ECMP, operators must explicitly configure maximum-paths (or the vendor-specific equivalent). Even then, BGP imposes strict conditions on which paths are considered "equal" for ECMP purposes. At minimum, the paths must match on AS path length. Depending on implementation, they may also need to match on origin code, MED, IGP metric to the next-hop, and other attributes.
There are two forms of BGP multipath:
- eBGP multipath — allows ECMP across paths learned from different eBGP neighbors. This is common at internet exchange points or when connecting to two upstream transit providers via separate links. The paths must have the same AS path length, and the
bestpath as-path multipath-relaxknob (Cisco) ormultipath multiple-as(Junos) is typically needed to allow ECMP across paths with different neighboring ASes. - iBGP multipath — allows ECMP across paths learned from different iBGP speakers (or route reflectors). The IGP metric to the BGP next-hop must be equal for the paths to qualify. This is used within large ASes to spread traffic across multiple exit points.
BGP add-path (RFC 7911) extends this further by allowing a BGP speaker to advertise multiple paths for the same prefix to its peers, rather than only the best path. Without add-path, a route reflector sends only its single best path to clients, hiding alternative paths that the clients might use for ECMP. With add-path enabled, the route reflector advertises multiple paths, and the client can independently evaluate them for multipath installation.
Unequal-Cost Multi-Path (UCMP)
Standard ECMP requires all paths to have exactly the same cost. In practice, network topologies are not always perfectly symmetric. A router might have two 100 Gbps uplinks and one 40 Gbps uplink — all reaching the same destination but with different link capacities or IGP metrics. Pure ECMP either uses all three paths equally (overloading the 40G link) or excludes the 40G path entirely (wasting capacity).
Unequal-Cost Multi-Path (UCMP) solves this by assigning weights to next-hops proportional to their capacity or metric. EIGRP was the first routing protocol to support UCMP natively through its variance command, which installs paths up to N times worse than the best path's metric. Modern implementations in OSPF and IS-IS achieve UCMP through explicit next-hop weights derived from link bandwidth ratios.
For example, with two 100G links and one 40G link, UCMP assigns weights 100:100:40, normalized to approximately 5:5:2. The hardware forwarding table implements this by allocating more hash buckets to the 100G links: 5 buckets each for the 100G links and 2 for the 40G link, out of 12 total. Flows then hash into these buckets proportionally, sending roughly 42%/42%/17% of traffic over each link.
UCMP is increasingly important in data centers that have undergone partial upgrades — for example, a fabric transitioning from 100G to 400G may have some spine links at 100G and others at 400G during the migration. Without UCMP, the 400G links would carry the same traffic volume as the 100G links, wasting 75% of the upgraded capacity.
The Polarization Problem
ECMP polarization occurs when multiple routers in a forwarding path use the same hash function with the same inputs, causing all of them to make the same next-hop selection for the same flow. The result is that traffic is not distributed evenly across links at each hop — flows that hash to path 0 at the first router also hash to path 0 at the second router, third router, and so on.
Consider a three-tier network: access, aggregation, and core. If all three tiers use an identical hash over the 5-tuple, a flow from host A to host B will always take the same relative "slot" at every tier. If that slot corresponds to the first next-hop at every level, links associated with the first next-hop at each tier will be overloaded while other links sit idle.
Solutions to polarization include:
- Per-router hash seed — most modern ASICs allow configuring a unique hash seed per router. The seed is XORed with the hash input, producing a different output for the same 5-tuple on each router. Arista, Cisco, and Juniper all support this. It is the most common and effective solution.
- Asymmetric hash functions — using different hash algorithms at different tiers. For example, CRC-16 at tier 1 and CRC-32 at tier 2. This is less common because it requires different ASIC configurations at different tiers.
- Including router-specific data — some implementations fold the router's own IP address or interface ID into the hash input, making the hash inherently different per device without explicit seed configuration.
Polarization is particularly damaging in Clos topologies where every packet traverses exactly the same number of hops. If polarization aligns the hash decisions at each tier, you can have a 64-port spine layer where a few spine switches carry the majority of traffic and the rest are nearly idle — negating the entire purpose of the multi-path fabric.
Resilient Hashing
Standard ECMP has a disruptive failure mode: when a next-hop is added or removed, the hash mod N operation changes for most flows because N has changed. If a router has 4 equal-cost paths and one goes down, hash mod 4 becomes hash mod 3, and approximately 75% of flows remap to a different next-hop. For stateful flows (TCP connections, load-balanced sessions), this remapping disrupts in-progress connections.
Resilient hashing (also called consistent hashing in ECMP context) minimizes flow remapping during next-hop changes. The technique pre-allocates a fixed-size hash table (typically 64 or 128 buckets) and assigns buckets to next-hops. When a next-hop is removed, only its buckets are redistributed to surviving next-hops. Flows that were already assigned to surviving next-hops keep their assignment. When a next-hop is added, buckets are moved from existing next-hops to the new one, again minimizing disruption.
For example, with 4 next-hops and 64 buckets, each next-hop owns 16 buckets. If next-hop 2 fails, its 16 buckets are distributed among next-hops 0, 1, and 3. The 48 flows already assigned to those next-hops do not move. Only the 16 flows that were on next-hop 2 are remapped — a disruption rate of 25% (1/N) instead of 75%. This is the theoretical minimum: you cannot avoid remapping the flows that were on the failed path.
Resilient hashing is supported on modern data center switching ASICs including Memory Memory's Memory Memory's Memory, Broadcom Memory's Memory, and others. Configuration syntax varies by platform: Arista EOS uses ip ecmp hash resilient, Cisco NX-OS uses hardware ecmp hash-resource resilient, and Cumulus Linux uses hash_config resilient in switchd.conf.
ECMP in Data Center Clos Fabrics
ECMP is the backbone of modern data center network design. The Clos (leaf-spine) topology is built around the assumption that every leaf switch has an equal-cost path to every other leaf switch through any spine. With N spine switches, there are exactly N equal-cost paths between any two leaf switches, and ECMP distributes traffic across all of them.
In a 5-stage Clos (leaf → spine → super-spine → spine → leaf), ECMP happens at two levels. A leaf chooses among its local spine switches (first-level ECMP), and each spine chooses among super-spine switches (second-level ECMP). The total number of paths between any two leaves in different pods is local_spines × super_spines, which can be enormous. A fabric with 4 local spines per pod and 8 super-spines has 32 equal-cost paths between any inter-pod leaf pair.
Data center fabrics overwhelmingly use eBGP as the routing protocol, following the design described in RFC 7938. Each switch runs its own unique autonomous system number, and every link between switches is an eBGP session. This means BGP multipath with as-path multipath-relax is required for ECMP, since the paths traverse different ASes. The alternative design — using OSPF or IS-IS as the underlay — gets ECMP natively but sacrifices BGP's policy flexibility and requires a separate protocol stack.
ECMP and Anycast
Anycast is the deployment of the same IP prefix from multiple locations. Within a data center, anycast is used extensively for services like DNS, load balancers, and distributed storage frontends. Multiple servers advertise the same /32 host route (or service VIP) into the fabric, and ECMP at the leaf and spine layers distributes traffic across all servers advertising that address.
This is effectively server-level ECMP. If four servers behind two leaf switches all announce 10.10.10.1/32, the fabric sees four equal-cost paths to that address: two directly connected at each leaf, plus additional paths via the spine layer. Traffic from anywhere in the fabric is ECMP'd across all four servers without any dedicated load balancer hardware in the path.
At the internet scale, anycast works the same way via BGP. When Cloudflare (AS13335) or Google (AS15169) announce the same prefix from dozens of PoPs worldwide, upstream networks see multiple BGP paths and select the one with the shortest AS path or best local preference. Within each upstream AS, if multiple exit points reach the anycast destination with equal cost, ECMP distributes flows across those exits.
ECMP-based anycast has an important property: because hash-based forwarding pins each flow to a single server, stateful protocols like TCP work correctly even though the same IP exists in multiple places. A flow's 5-tuple always hashes to the same next-hop, so all packets in a TCP connection reach the same server. This breaks only when the ECMP group changes — for example, when a server is added or removed — which is why resilient hashing is critical for anycast deployments.
Elephant Flows and ECMP Limitations
ECMP's per-flow granularity means that load distribution is only as good as the flow distribution. When traffic consists of many small, short-lived flows (mice flows), the hash function distributes them evenly and all links are similarly loaded. But a single large, long-lived flow (an elephant flow) — such as a database backup, VM migration, or MapReduce shuffle — can saturate one link while others remain underutilized.
If ten flows cross four ECMP paths and one flow represents 60% of total traffic, no hash function can balance the links: one link carries at least 60% load while the theoretical fair share is 25%. This is the fundamental limitation of per-flow ECMP — it cannot subdivide individual flows.
Several techniques mitigate elephant flow problems:
- Flowlet-based ECMP — when a flow has a gap between packets larger than the maximum reordering delay (typically tens of microseconds), the subsequent burst can be treated as a new "flowlet" and hashed independently. This effectively splits elephant flows at natural boundaries without causing reordering. Broadcom's Memory ASICs and Barefoot/Intel Memory support flowlet switching in hardware.
- Packet spraying — in tightly controlled environments (such as within a single data center pod), per-packet spraying with reordering tolerance can be used. Some modern transport protocols like MPTCP and RoCEv2 handle reordering gracefully, making packet spraying viable for specific workloads.
- Weighted ECMP / traffic engineering — detecting elephant flows (via sFlow, INT, or flow table sampling) and explicitly routing them to less-loaded paths. This requires a centralized controller or SDN framework.
- More paths — increasing the number of ECMP paths (more spine switches) reduces the impact of any single elephant flow. With 32 paths instead of 4, one elephant flow affects only 1/32 of the fabric's capacity.
ECMP and LAG Interaction
Link Aggregation Groups (LAGs / port channels / bonded interfaces) and ECMP both distribute traffic across multiple links, but they operate at different layers. LAG bundles multiple physical links into a single logical interface at Layer 2, presenting one interface to the routing protocol. ECMP distributes across multiple next-hops (which may themselves be LAG interfaces) at Layer 3.
In practice, these two mechanisms are often nested: a router may have ECMP across four next-hops, each reachable via a 2-member LAG. Traffic is first distributed across the four ECMP next-hops, then within each next-hop's LAG across the two physical links. This two-level hashing can cause polarization if both levels use the same hash function — another instance of the polarization problem discussed above.
Modern data center designs favor ECMP over LAG wherever possible. Instead of bonding two 100G links into a 200G LAG between a leaf and spine, the design uses two separate L3 interfaces, each running its own BGP session. This gives the routing protocol visibility into individual link failures and avoids LAG-specific issues like LACP timeouts and minimum-links thresholds. The trend in data center design is to eliminate LAG entirely and rely exclusively on L3 ECMP.
Weighted ECMP and Traffic Engineering
Standard ECMP treats all paths as equal, but real networks sometimes need finer control. Weighted ECMP (W-ECMP) assigns different weights to next-hops, controlling what fraction of traffic each path receives. This is related to UCMP but specifically refers to the practice of manipulating weights for traffic engineering purposes rather than compensating for unequal link capacities.
In service provider networks, weighted ECMP can steer traffic away from congested links without completely removing them from the ECMP group. For example, a provider with three transit paths might assign weights 40:40:20 to send less traffic over a path that is approaching congestion. This is more graceful than removing the path entirely (which causes hash mod N disruption) or relying on the IGP metric (which offers only binary in-or-out control).
BGP-based traffic engineering solutions like Google's Espresso and Facebook's Edge Fabric use weighted ECMP extensively. They monitor link utilization in real time and adjust BGP path weights to balance traffic across peering links. The weight adjustments happen at the BGP level, and the forwarding hardware implements them through unequal bucket allocation in the ECMP hash table.
ECMP Troubleshooting
Diagnosing ECMP issues requires understanding both the control plane (are equal-cost paths being installed?) and the data plane (is traffic actually being distributed?).
Common ECMP problems and their symptoms:
- Missing ECMP paths — the routing table shows fewer next-hops than expected. Causes:
maximum-pathsnot configured (BGP), asymmetric IGP metrics (OSPF/IS-IS), BGP attributes not matching (MED, origin, AS path), or hardware TCAM limits reached. Check:show ip route <prefix>to see installed next-hops,show ip bgp <prefix>to see all candidate paths and why some were not selected. - Uneven distribution — interface counters show significantly different traffic volumes across ECMP members. Causes: hash polarization, elephant flows, poor hash function, or asymmetric flow distribution. Check: per-interface byte counters, flow export data (sFlow/NetFlow), hash bucket allocation.
- Connection disruption after link events — TCP connections reset when ECMP members change. Cause: standard (non-resilient) hashing remapping flows. Fix: enable resilient hashing.
- Packet reordering — TCP performance degradation without obvious packet loss. Cause: per-packet load balancing or LAG/ECMP hash changes mid-flow. Check:
netstat -sfor TCP retransmit and reordering statistics on endpoints.
ECMP in Real Networks
ECMP is invisible in the BGP routing table because it is a forwarding-plane mechanism. You cannot tell from a BGP looking glass whether a particular AS is using ECMP internally. However, you can infer ECMP potential by examining the routing table for prefixes that have multiple paths with equal AS path lengths from the same vantage point.
Large transit networks like Lumen (AS3356), NTT (AS2914), and Hurricane Electric (AS6939) rely heavily on ECMP across their backbone links. Hyperscale networks like Google (AS15169), Meta (AS32934), and Amazon (AS16509) run massive Clos fabrics internally where ECMP is the primary traffic distribution mechanism.
At internet exchange points, a network may receive the same prefix from multiple peers with the same AS path length. With BGP multipath enabled, the router ECMP-distributes traffic to that destination across all those peers. This is common at large IXPs like DE-CIX, AMS-IX, and LINX where hundreds of networks peer, creating many equal-cost alternatives for popular prefixes.
Explore Multi-Path Routing in the Global Table
While ECMP decisions happen inside individual networks and are not directly visible in BGP, you can observe the raw material that enables ECMP: multiple BGP paths with the same AS path length to the same destination. Use the god.ad BGP Looking Glass to look up any prefix and see how many paths exist, their AS path lengths, and the diversity of next-hops available. Prefixes from major anycast deployments and well-connected ASes often show many alternative paths — each one a potential ECMP candidate inside transit networks.