How DNS Load Balancing Works: From Round-Robin to Global Server Load Balancing
DNS load balancing is the practice of distributing client traffic across multiple servers or data centers by returning different IP addresses in DNS responses. Unlike traditional load balancers that sit in the data path and forward every packet, DNS-based load balancing operates at the name resolution layer -- it steers clients before they ever open a TCP connection. At its simplest, DNS load balancing is a round-robin rotation of A/AAAA records. At its most sophisticated, it becomes Global Server Load Balancing (GSLB): a system that considers client geography, server health, real-time load, and network conditions to direct each client to the optimal endpoint. Every major internet service -- from Cloudflare (AS13335) to Google (AS15169) to AWS (AS16509) -- uses DNS-based traffic steering as a critical component of their global infrastructure.
DNS Round-Robin: The Simplest Form
The most basic DNS load balancing technique is round-robin: configure multiple A or AAAA records for the same domain name, and the authoritative DNS server rotates the order of records in each response. When a client receives multiple addresses, most DNS resolvers and operating systems use the first address in the list, so rotating the order distributes connections across servers.
; Simple DNS round-robin
example.com. 300 IN A 198.51.100.10
example.com. 300 IN A 198.51.100.11
example.com. 300 IN A 198.51.100.12
DNS round-robin has severe limitations that make it unsuitable as a primary load balancing mechanism for production services:
- No health awareness -- If server 198.51.100.11 crashes, DNS continues returning its address. Clients that receive the dead server's IP will experience connection failures until the record is manually removed or the TTL expires.
- Uneven distribution -- DNS resolvers cache responses, and a single recursive resolver may serve thousands or millions of users. All users behind that resolver get the same cached answer, concentrating traffic on one server.
- No load awareness -- DNS round-robin distributes names, not connections or requests. One server may be overloaded while others are idle, and the DNS system has no mechanism to detect or correct this.
- TTL-dependent failover -- Even if you remove a dead server's record, cached entries persist until the TTL expires. With a 300-second TTL, clients may continue sending traffic to a dead server for up to 5 minutes.
Despite these limitations, DNS round-robin is still used as a coarse distribution mechanism, often in combination with other techniques. Many CDNs and cloud providers use DNS round-robin to distribute traffic across multiple load balancer VIPs, each of which then performs proper health-checked load balancing.
TTLs and the DNS Caching Problem
Every DNS record has a Time-to-Live (TTL) value that tells recursive resolvers how long to cache the response. TTL is the fundamental constraint on DNS-based load balancing and failover speed. The engineering tradeoff is straightforward:
- Short TTLs (10-60 seconds) -- Enable fast failover and frequent re-steering, but increase DNS query volume to your authoritative servers. At scale, this means millions of additional queries per minute. Short TTLs also increase latency for clients: every TTL expiration requires a new DNS lookup before the client can connect.
- Long TTLs (300-3600 seconds) -- Reduce DNS query load and improve client performance (fewer lookups), but make failover slow. A 1-hour TTL means clients may hit a dead server for up to an hour after failure.
In practice, many resolvers and client libraries do not strictly honor TTLs. Some resolvers impose minimum TTLs (e.g., 30 seconds) regardless of what the authoritative server specifies. Some clients (particularly Java's InetAddress cache) cache DNS responses indefinitely unless explicitly configured otherwise. Browser DNS caches, operating system caches, and corporate DNS appliances add additional caching layers, each with their own TTL enforcement quirks. This means that even a 30-second TTL does not guarantee failover within 30 seconds -- the actual time depends on the caching behavior of every layer between the client and the authoritative server.
RFC 8767 (Serving Stale Data to Improve DNS Resiliency) explicitly encourages resolvers to serve expired cached records when the authoritative server is unreachable. This is good for resilience but means DNS changes may propagate even more slowly than the TTL suggests.
GeoDNS: Location-Aware Resolution
GeoDNS extends DNS load balancing with geographic intelligence. The authoritative DNS server determines the client's approximate location and returns the IP address of the nearest (or otherwise optimal) server or data center. This is the foundation of GSLB for most internet services.
How GeoDNS Determines Location
The DNS protocol itself does not carry the client's IP address -- the authoritative server only sees the IP address of the recursive resolver that forwarded the query. For large public resolvers like Google Public DNS (8.8.8.8) or Cloudflare DNS (1.1.1.1), the resolver's IP may be thousands of miles from the actual client. This is the fundamental problem that EDNS Client Subnet (ECS) solves.
EDNS Client Subnet (RFC 7871)
ECS is an extension to the DNS protocol that allows recursive resolvers to forward a truncated version of the client's IP address (typically /24 for IPv4, /56 for IPv6) to the authoritative server. The authoritative server can then use this subnet to determine the client's geographic location and return a geographically appropriate response.
ECS introduces a privacy tradeoff: the client's subnet is revealed to the authoritative server and any intermediate resolvers. This has led to debate in the DNS community. Some privacy-focused resolvers (notably Quad9 at 9.9.9.9) do not send ECS by default. The scope prefix length (how much of the client's address is revealed) is negotiable -- sending /24 reveals the client's /24 subnet (256 addresses), while /20 reveals less precision. The authoritative server responds with a "scope" indicating how location-specific its answer is, which determines how the resolver caches the response.
ECS also complicates resolver caching. Without ECS, a resolver caches one answer per domain name. With ECS, the resolver must cache different answers for different client subnets -- a single popular domain can generate thousands of cache entries, one per unique client subnet prefix. This dramatically increases resolver memory usage and is one reason not all resolvers support ECS.
Weighted DNS Records
Beyond simple round-robin, many DNS providers support weighted records that control the proportion of traffic each endpoint receives. AWS Route 53, Google Cloud DNS, and Cloudflare all support this natively:
; Weighted DNS records (Route 53 syntax)
; 70% of resolutions return the primary
example.com. 60 IN A 198.51.100.10 ; weight: 70
; 30% of resolutions return the secondary
example.com. 60 IN A 198.51.100.20 ; weight: 30
Weighted records are useful for gradual traffic migration (shift 10% of traffic to a new data center, monitor, increase), A/B testing at the DNS level, and capacity-proportional distribution across data centers with different sizes. Combined with health checks, weights can be dynamically adjusted: when a data center becomes unhealthy, its weight drops to zero and all traffic is steered to remaining healthy endpoints.
The granularity of weighted DNS is limited by caching. With a 60-second TTL and weighted records, the weight ratio is approximately achieved over many resolver cache misses, but any individual resolver caches a single answer for the TTL duration. This means weighted DNS provides statistical distribution over time, not per-request precision.
Health-Check-Driven DNS Failover
The critical improvement of GSLB over basic DNS round-robin is active health checking. A GSLB system continuously monitors the health of each endpoint and removes unhealthy endpoints from DNS responses.
Health check architectures vary by provider:
- AWS Route 53 -- Health checkers run from multiple AWS regions and probe endpoints via HTTP, HTTPS, or TCP. A configurable failure threshold (e.g., 3 consecutive failures from 3+ regions) triggers failover. Route 53 supports calculated health checks that aggregate multiple child checks with AND/OR logic.
- Cloudflare Load Balancing -- Health checks probe each origin pool from multiple Cloudflare data centers. If a pool fails, traffic is steered to the next pool in the failover order. Cloudflare's advantage is the speed of propagation: since Cloudflare is also the authoritative DNS server and the CDN edge, failover does not depend on TTL expiration -- Cloudflare can immediately stop returning unhealthy IPs.
- Google Cloud DNS -- Routing policies support health-checked endpoints with automatic failover to backup record sets. Health checks integrate with Google's global health checking infrastructure (the same system used for Cloud Load Balancing).
The fundamental limitation of DNS-based failover remains TTL caching. When the authoritative server removes an unhealthy IP from responses, clients that have already cached the unhealthy IP continue using it until the TTL expires. This is why DNS-based failover is typically combined with other mechanisms -- anycast withdrawal via BGP for L3 failover, or application-layer retries that try alternative addresses from the DNS response.
GSLB Architecture Patterns
Global Server Load Balancing combines GeoDNS, health checking, and traffic policies into a system that optimizes global traffic distribution. Several architectural patterns are common:
Active-Active with Geographic Steering
Multiple data centers actively serve traffic, with GeoDNS directing clients to the nearest one. Health checks continuously monitor each site, and failing sites are removed from DNS. This is the most common GSLB pattern and provides both performance optimization (low latency via geographic proximity) and high availability (automatic failover to remaining sites).
The challenge is data consistency: if users in Tokyo are served by the Tokyo data center and users in London by the London data center, application state must be replicated or partitioned across sites. Databases, session stores, and caches all need multi-region strategies.
Active-Passive with DNS Failover
One data center is designated as primary and serves all traffic under normal conditions. A secondary site is kept in standby. Health checks monitor the primary, and if it fails, DNS records are updated to point to the secondary. This pattern is simpler operationally (no multi-region data replication needed during normal operation) but wastes the secondary site's capacity during normal conditions and has slower failover due to DNS TTL propagation.
Latency-Based Routing
Instead of routing based on geographic proximity, latency-based routing measures actual network latency from each DNS resolver location to each data center and returns the lowest-latency endpoint. AWS Route 53's latency-based routing uses this approach, maintaining a global latency database built from periodic measurements.
Latency-based routing can produce counter-intuitive results. A data center that is geographically farther may have lower latency due to better peering, dedicated fiber, or less congested paths. For example, a user in Mumbai might get lower latency to a Singapore data center than to a closer one in Chennai, depending on the BGP routing and submarine cable topology between those points.
Anycast DNS: Eliminating the TTL Problem
Anycast DNS takes a fundamentally different approach to DNS-based traffic distribution. Instead of returning different IP addresses to steer clients, anycast announces the same IP address from multiple locations via BGP. Internet routing automatically directs each client's DNS query to the topologically nearest anycast instance.
Anycast DNS is the dominant architecture for authoritative DNS services. Cloudflare (AS13335) operates its 1.1.1.1 resolver and all authoritative DNS from anycast nodes in 300+ cities. The root DNS servers (a.root-servers.net through m.root-servers.net) are predominantly anycast -- the "J-root" operated by Verisign has over 200 anycast instances globally. When one instance fails, BGP withdraws its route and traffic seamlessly shifts to the next-nearest instance without any DNS record changes or TTL dependencies.
Anycast for DNS works particularly well because DNS queries are typically single UDP packets -- there is no persistent connection state to break during re-routing. For TCP-based services, anycast failover is more complex because existing TCP connections may be disrupted when routing changes shift traffic to a different instance. This is why anycast is universal for DNS but less common for stateful services that require long-lived connections.
Combining DNS Load Balancing with Other Techniques
In practice, DNS load balancing is rarely used in isolation. It is one layer in a multi-tier traffic management architecture:
- DNS GSLB -- GeoDNS or anycast directs clients to the nearest data center or edge location. This is the coarsest level of traffic steering, operating at the data center or region granularity.
- Anycast + ECMP -- Within a data center, the service IP is announced via BGP from multiple load balancer instances. ECMP at the top-of-rack switch distributes flows across load balancer instances.
- L4/L7 load balancing -- Each load balancer instance (HAProxy, NGINX, Envoy) distributes requests across application server pools with health checking, session persistence, and content-based routing.
This layered approach provides defense in depth: DNS GSLB handles site-level failures, BGP/ECMP handles load balancer failures, and L4/L7 load balancing handles individual server failures. Each layer has its own health checking and failover mechanism, and the combination provides end-to-end resilience.
DNS Load Balancing for Multi-CDN
Large content publishers often use multiple CDN providers simultaneously (multi-CDN) and steer traffic between them using DNS. A DNS-based traffic manager like Citrix Intelligent Traffic Management, NS1, or Cloudflare Load Balancing sits as the authoritative DNS for the content domain and directs each client to the CDN that offers the best performance for that client's location.
Multi-CDN DNS steering considers: real-user performance measurements (RUM beacons from JavaScript embedded in pages), synthetic monitoring from global probes, CDN-reported availability and capacity, cost (steering traffic to the cheapest CDN that meets performance requirements), and contractual commit levels (ensuring minimum traffic volumes to each CDN to meet contract terms).
This is one area where DNS-based load balancing offers capabilities that are difficult to replicate at other layers. Since the DNS decision happens before the client connects, you can steer an entire session to a specific CDN -- something that would be impossible with a packet-level load balancer that cannot see the CDN selection decision.
Challenges and Failure Modes
DNS load balancing has several well-known failure modes that network engineers must account for:
- Resolver caching beyond TTL -- Some resolvers and client libraries cache DNS responses longer than the TTL specifies. Java's default DNS caching is infinite for successful lookups unless
networkaddress.cache.ttlis explicitly set. This can cause clients to use stale addresses long after a failover event. - Negative caching (RFC 2308) -- If a DNS query returns NXDOMAIN or SERVFAIL, the negative response is cached. If your GSLB system temporarily returns errors during a configuration change, resolvers may cache the error response and refuse to re-query for the negative TTL period.
- Recursive resolver centralization -- A large fraction of internet DNS traffic flows through a small number of public resolvers (Google 8.8.8.8, Cloudflare 1.1.1.1, Quad9). When these resolvers cache an answer, it affects millions of users simultaneously. A misconfigured GSLB response cached by Google's resolver can steer a continent of traffic to the wrong location.
- EDNS Client Subnet inconsistency -- Not all resolvers support ECS. For resolvers that do not send ECS, GeoDNS falls back to resolver IP geolocation, which may be wildly inaccurate for large public resolvers. This creates a two-tier accuracy problem where some clients get precise geographic steering and others get approximate steering based on resolver location.
- DNS propagation during planned failover -- When you deliberately change DNS records (e.g., for maintenance), the old records remain cached in resolvers worldwide. Even with a 60-second TTL, full propagation takes 2-5 minutes due to resolver caching hierarchy. During this window, some clients hit the old servers and others hit the new ones.
Real-World DNS Load Balancing Implementations
Major services use DNS load balancing in distinct ways:
- Cloudflare (AS13335) -- Anycast DNS across 300+ cities for its CDN and security services. The authoritative DNS is the first tier of traffic steering; Cloudflare's network then routes requests to the nearest healthy origin via its internal backbone.
- AWS (AS16509) -- Route 53 provides latency-based routing, GeoDNS, weighted records, and health-checked failover. Route 53 itself is an anycast DNS service running from 200+ edge locations.
- Akamai (AS20940) -- One of the original GSLB implementations. Akamai's DNS infrastructure maps each client resolver to the optimal Akamai edge server based on real-time network conditions, server load, and content availability.
- Google (AS15169) -- Uses anycast DNS for its public services (8.8.8.8, google.com) and internal GSLB for distributing traffic across its global data center fleet.
DNS Load Balancing and BGP
BGP and DNS-based load balancing are the two primary mechanisms for global traffic distribution, and they serve complementary roles. BGP operates at the network layer, directing IP packets to the topologically closest announcement point. DNS operates at the application layer, directing clients to the IP address that the GSLB system deems optimal.
In sophisticated deployments, the two systems are tightly integrated. A GSLB controller monitors BGP routing tables to understand network topology and uses this information to make better DNS steering decisions. Conversely, when a BGP route is withdrawn (due to a site failure or maintenance), the GSLB system detects the loss of reachability and stops returning that site's IPs in DNS responses. This bidirectional integration ensures that DNS steering decisions are consistent with actual network reachability.
You can observe the interaction between DNS and BGP by looking up any major service on the god.ad BGP Looking Glass. The BGP routes show you which networks announce each service's IP prefixes, and the AS paths reveal the routing topology that determines where DNS-steered traffic actually flows.