How Load Balancers Work: L4 vs L7 Load Balancing
A load balancer distributes incoming network traffic across multiple backend servers so that no single server bears too much demand. Load balancers are what make it possible for sites like Google, Netflix, and Amazon to handle billions of requests per day. Without them, a single server failure would take down an entire service, and traffic spikes would overwhelm individual machines.
But not all load balancers are the same. The most fundamental distinction in load balancing is between Layer 4 (L4) and Layer 7 (L7) operation, referring to layers of the OSI model. An L4 load balancer makes routing decisions based on transport-layer information (IP addresses and TCP/UDP ports), while an L7 load balancer inspects application-layer data (HTTP headers, URLs, cookies) to make smarter decisions. Understanding this difference, along with the algorithms, architectures, and trade-offs involved, is essential for designing scalable infrastructure.
The OSI Model: Why "Layer 4" and "Layer 7"?
The OSI (Open Systems Interconnection) model is a conceptual framework that describes network communication in seven layers. Two of these layers are central to load balancing:
- Layer 4 — Transport Layer: This layer handles end-to-end communication. The key protocols here are TCP and UDP. L4 deals with source and destination IP addresses, port numbers, and connection state. An L4 load balancer sees packets and flows, but does not understand the application payload.
- Layer 7 — Application Layer: This is the layer where HTTP, HTTPS, gRPC, WebSocket, and other application protocols live. An L7 load balancer parses the full application request, inspecting things like the URL path, Host header, cookies, and request body. This lets it make content-aware routing decisions.
The distinction matters because it determines what information the load balancer can act on, how much processing overhead it introduces, and what kinds of routing policies it can enforce.
Layer 4 Load Balancing
An L4 load balancer operates at the transport layer. It receives incoming TCP or UDP connections and forwards them to a backend server based on limited information: the source IP, source port, destination IP, and destination port. It does not inspect the contents of the packets beyond this 4-tuple.
How L4 Load Balancing Works
When a client opens a TCP connection to the load balancer's virtual IP (VIP), the load balancer selects a backend server and either rewrites the packet's destination IP (NAT mode) or encapsulates it (tunneling/DSR mode). All subsequent packets in that TCP flow go to the same backend. The load balancer maintains a connection tracking table that maps each client flow to its assigned backend.
Because L4 load balancers do not need to parse HTTP or terminate TLS, they can process packets at extremely high throughput with minimal latency. Modern L4 load balancers implemented in the kernel (using technologies like IPVS, eBPF/XDP, or DPDK) can handle tens of millions of connections per second on commodity hardware.
Advantages of L4
- Speed — Minimal per-packet processing; operates at near line rate
- Protocol agnostic — Works with any TCP or UDP protocol, not just HTTP
- Simplicity — Less state to manage, fewer things to go wrong
- TLS passthrough — The load balancer never sees the encrypted payload, which is ideal when backends must handle their own TLS
Limitations of L4
- No content awareness — Cannot route based on URL, header, or cookie
- No connection multiplexing — Each client connection maps to exactly one backend connection
- Coarse health checks — Can only verify that a TCP port is open, not that the application is functioning correctly
- No request-level metrics — Cannot measure HTTP response codes, latency per endpoint, or request rates
Layer 7 Load Balancing
An L7 load balancer terminates the client's TCP (and usually TLS) connection, parses the application-layer protocol, and then opens a new connection to a backend server. Because it fully understands the application protocol, it can make decisions that are impossible at L4.
How L7 Load Balancing Works
The client connects to the load balancer, which accepts the TCP connection and performs the TLS handshake (if HTTPS). The load balancer then reads the full HTTP request — method, URL, headers, and potentially the body. Based on configured rules, it selects a backend and proxies the request over a separate connection. The response flows back through the load balancer to the client.
This is a full proxy architecture: there are two distinct TCP connections (client-to-LB and LB-to-backend), and the load balancer is an active participant in both. This gives it enormous power but also introduces additional latency and resource consumption compared to L4.
What L7 Can Do That L4 Cannot
- URL-based routing — Route
/api/*requests to API servers and/static/*to a CDN origin - Host-based routing — Route traffic for different domains arriving on the same IP to different backend pools
- Header/cookie inspection — Implement session affinity based on a session cookie, or A/B testing based on a custom header
- Request modification — Add, remove, or rewrite headers before forwarding to the backend
- Connection multiplexing — Maintain persistent connections to backends and multiplex many client requests over fewer backend connections (especially with HTTP/2)
- Compression and caching — Compress responses or serve cached content directly without hitting backends
- Rate limiting and WAF — Enforce per-URL or per-client rate limits, block malicious requests
- Rich health checks — Send HTTP requests to a health endpoint and verify the response code and body
L4 vs L7: A Visual Comparison
Load Balancing Algorithms
The algorithm a load balancer uses to choose which backend receives each connection (L4) or request (L7) is one of the most critical design decisions. Different algorithms optimize for different goals: fairness, locality, minimal latency, or cache efficiency.
Round Robin
The simplest algorithm. Requests are distributed to backends in sequential, circular order: server 1, server 2, server 3, server 1, server 2, and so on. Round robin assumes all backends are equally capable and all requests are equally expensive. It works well when both of these assumptions hold. It breaks down when backends have different capacities or when request costs vary widely.
Weighted Round Robin
An extension of round robin where each backend is assigned a weight. A backend with weight 3 receives three times as many requests as a backend with weight 1. This is useful when backends have different hardware specifications — a server with 64 cores should receive more traffic than one with 8. Cloud providers use weighted round robin during rolling deployments, gradually increasing the weight of new instances as confidence grows.
Least Connections
The load balancer tracks how many active connections each backend currently has and sends new connections to the backend with the fewest. This adapts naturally to backends with different processing speeds: a fast server completes connections quickly and therefore has fewer active connections, so it attracts more new ones. Least connections is the default choice for many production deployments because it handles heterogeneous backends and variable request costs better than round robin.
Weighted Least Connections
Combines weights with connection counting. The load balancer selects the backend with the lowest ratio of active connections to weight. A server with weight 4 and 20 active connections (ratio 5.0) would be preferred over a server with weight 2 and 12 connections (ratio 6.0).
Random
Selects a backend at random for each request. Surprisingly effective at scale. The "power of two random choices" variant picks two backends at random and sends the request to the one with fewer connections. This simple heuristic provides near-optimal load distribution and avoids the thundering herd problem that can affect deterministic algorithms.
IP Hash
Hashes the client's source IP address to deterministically select a backend. The same client always reaches the same server, providing natural session affinity without cookies. The downside: if a backend goes down, all clients mapped to it are redistributed, and when the backend returns, the hash distribution shifts again.
Consistent Hashing
A more sophisticated hashing approach that minimizes disruption when backends are added or removed. In consistent hashing, each backend is assigned multiple points on a hash ring. A request key (such as a URL or session ID) is hashed to a point on the ring, and the request goes to the nearest backend point clockwise. When a backend is removed, only the requests mapped to that backend's segment are redistributed; all other mappings remain stable.
Consistent hashing is essential for caching layers, where redistributing keys across all backends would invalidate caches and cause a "thundering herd" of cache misses hitting the origin servers simultaneously. CDNs use consistent hashing extensively to maintain cache locality.
Least Response Time
Tracks the average response time from each backend and sends requests to the fastest one. This is an L7-only algorithm (L4 load balancers cannot measure application response times). It works well for HTTP APIs where latency is the primary metric, but it can starve slow backends that are slow due to cold caches — creating a feedback loop where cold backends never warm up because they never receive traffic.
Direct Server Return (DSR)
Direct Server Return is a load balancing architecture where the return traffic from the backend bypasses the load balancer entirely and goes directly to the client. In a normal (non-DSR) setup, all traffic flows through the load balancer in both directions. DSR eliminates the return-path bottleneck.
How DSR Works
In DSR mode, the load balancer receives the incoming packet, rewrites the destination MAC address (L2 DSR) or encapsulates the packet in an IP-in-IP or GRE tunnel (L3 DSR), and forwards it to the backend. The backend must be configured with the VIP on a loopback interface so that it accepts the packet and responds directly to the client using the VIP as the source address. The client does not know the load balancer was involved.
DSR is exclusively an L4 technique. Because the return traffic does not traverse the load balancer, the load balancer never sees the response and cannot inspect or modify application-layer data. But the performance gains are significant: since response bodies are typically much larger than requests (think video streaming, large API responses, file downloads), removing the load balancer from the return path can reduce its bandwidth requirements by 10x or more.
When to Use DSR
- High-bandwidth services — Video streaming, large file transfers, where response traffic dwarfs request traffic
- UDP services — DNS, gaming, VoIP, where there is no persistent connection and each response packet is independent
- Scale constraints — When the load balancer's outbound bandwidth is the bottleneck
Health Checks
A load balancer must know which backends are healthy and able to serve traffic. Health checking is the mechanism for this. There are two fundamental approaches: active and passive.
Active Health Checks
The load balancer periodically sends probe requests to each backend and evaluates the response. Configuration typically includes:
- Interval — How often to probe (e.g., every 5 seconds)
- Timeout — How long to wait for a response before marking the probe as failed
- Unhealthy threshold — How many consecutive failures before marking the backend as down (e.g., 3 failures)
- Healthy threshold — How many consecutive successes before marking a down backend as up again
L4 health checks typically open a TCP connection to the backend port. If the three-way handshake completes, the backend is considered healthy. L7 health checks send an HTTP request (usually GET /health or GET /ready) and verify that the response has the expected status code (200) and optionally the expected body content.
L7 health checks are strictly more powerful: a backend might accept TCP connections but return 500 errors on every request due to a failed database connection. An L4 check would miss this; an L7 check would catch it.
Passive Health Checks
Also called outlier detection. Instead of sending probes, the load balancer monitors real traffic and marks backends as unhealthy if they exceed error thresholds. For example, if a backend returns five consecutive 5xx errors, it is temporarily ejected from the pool. Envoy Proxy popularized this approach, calling it "outlier ejection."
Passive checks are faster to react (they detect failures on real traffic, not on periodic probes) but they require actual traffic to detect problems. A backend that receives no traffic cannot be passively checked. Best practice is to use both active and passive health checks together.
Session Persistence (Sticky Sessions)
By default, most load balancing algorithms distribute requests without regard for previous decisions. This is a problem for stateful applications where a user's session data lives in server memory and the user must always reach the same backend.
Cookie-Based Persistence
The load balancer (L7 only) injects a cookie into the response that identifies the assigned backend. On subsequent requests, the client sends this cookie back, and the load balancer routes to the same backend. If the backend goes down, the cookie becomes invalid and the user is assigned to a new backend (losing session state).
Source IP Persistence
The load balancer (L4 or L7) maps the client's source IP to a backend and maintains this mapping for a configured duration. This breaks when multiple users share the same IP (corporate NATs, CDN edge proxies) because all users behind that IP are forced to the same backend.
The Modern Approach
Modern architectures avoid sticky sessions entirely by externalizing session state to a shared store like Redis or a database. This makes backends truly stateless and interchangeable, enabling any backend to serve any request. This is the pattern used by all major cloud-native applications.
SSL/TLS Termination
TLS termination refers to where in the request path the TLS encryption is decoded. There are three common patterns:
- TLS termination at the load balancer (L7) — The load balancer decrypts the TLS connection, inspects the HTTP content for routing, and forwards the request to backends over plaintext HTTP or re-encrypted HTTPS. This is the most common pattern because it allows full L7 routing and offloads CPU-intensive TLS operations from backends.
- TLS passthrough (L4) — The load balancer forwards the encrypted TCP stream to the backend without decrypting it. The backend handles TLS termination. This is required when the backend must see the original TLS certificate or when regulatory requirements prohibit decryption at intermediate points.
- TLS re-encryption (L7) — The load balancer terminates the client's TLS connection, inspects the content, then establishes a new TLS connection to the backend (mTLS). This provides end-to-end encryption while still allowing L7 routing. This is the standard pattern in zero-trust and service mesh architectures.
ECMP: Load Balancing at the Network Layer
Equal-Cost Multi-Path routing (ECMP) is a load balancing mechanism built into network routers and closely tied to BGP. When a router has multiple equally-good paths to the same destination (same AS path length, same local preference), it can distribute traffic across all of them instead of picking just one.
How ECMP Works
The router hashes each packet's flow identifier (typically the 5-tuple: source IP, destination IP, source port, destination port, protocol) and uses the hash to select one of the equal-cost paths. All packets in the same flow follow the same path, preventing out-of-order delivery. Different flows are distributed across the available paths.
ECMP is especially important for anycast deployments. When a prefix like 1.1.1.0/24 is announced from multiple locations, ECMP at each router along the path ensures that traffic is spread across multiple next hops rather than all flowing through a single one.
ECMP and Load Balancers
In large-scale deployments, ECMP is used to scale the load balancers themselves. Multiple load balancer instances are placed behind a router, each announcing the same VIP. The router distributes incoming traffic across the load balancer instances via ECMP. This pattern, sometimes called "anycast load balancing," is how hyperscalers like Google and Facebook distribute traffic at scale. It combines network-layer load balancing (ECMP across LB instances) with application-layer load balancing (L7 within each LB instance).
Hardware vs Software Load Balancers
The load balancing landscape has shifted dramatically from dedicated hardware appliances to software-based solutions. Understanding the history and current state helps explain the design choices behind modern tools.
Hardware Load Balancers
For decades, load balancing was dominated by specialized hardware appliances from vendors like F5 (BIG-IP), Citrix (NetScaler), and A10 Networks. These devices use custom ASICs and FPGAs to process packets at high speed. They are expensive (often $50,000-$500,000+), proprietary, and require specialized skills to configure.
Hardware load balancers still exist in enterprises and service providers that value vendor support contracts and predictable performance. But the trend is decisively toward software.
Software Load Balancers
Modern software load balancers run on commodity servers and are often open source. They have reached performance levels that rival or exceed hardware appliances, especially for L7 workloads. The most important ones:
HAProxy
HAProxy is the gold standard for high-performance L4/L7 load balancing. Written in C, it is single-process, event-driven, and extremely efficient. HAProxy handles millions of concurrent connections and hundreds of thousands of requests per second per core. It has been in production use since 2001 and is used by GitHub, Reddit, Airbnb, and many other large-scale services. Its configuration language is purpose-built for load balancing and offers fine-grained control over every aspect of traffic management.
NGINX
NGINX started as a web server but its reverse proxy and load balancing capabilities have made it one of the most deployed load balancers in the world. NGINX excels at L7 load balancing and is often used as a combined web server, reverse proxy, and load balancer. The open-source version covers most use cases; NGINX Plus adds active health checks, session persistence, and dynamic reconfiguration. NGINX is behind many of the world's busiest websites.
Envoy Proxy
Envoy was built at Lyft and is now a CNCF graduated project. It was designed from the ground up for modern microservice architectures. Key features that distinguish Envoy from older load balancers: hot restart (zero-downtime binary upgrades), a rich xDS API for dynamic configuration (no config file reloads), first-class support for gRPC and HTTP/2, built-in distributed tracing, and advanced outlier detection. Envoy is the default data plane for service mesh platforms like Istio.
Linux IPVS
IPVS (IP Virtual Server) is a transport-layer load balancer built into the Linux kernel. It operates at L4 and can handle extremely high throughput because it processes packets in kernel space without copying them to user space. IPVS supports DSR, NAT, and tunneling modes. It is the load balancing backend for Kubernetes Services (kube-proxy in IPVS mode) and for many large-scale L4 deployments. Because it runs in the kernel, it has fewer features than user-space load balancers but significantly higher raw performance.
Cloud Load Balancers
Every major cloud provider offers managed load balancing services that abstract away the operational complexity:
- AWS — Network Load Balancer (L4, extremely high throughput), Application Load Balancer (L7, HTTP/HTTPS routing), Gateway Load Balancer (for inline security appliances)
- Google Cloud — Cloud Load Balancing offers both L4 and L7 in a single product, with global anycast front-ends that route to the nearest healthy backend using Google's backbone network
- Azure — Azure Load Balancer (L4) and Application Gateway (L7)
- Cloudflare (AS13335) — Uses anycast and its global edge network to provide load balancing at every one of its 300+ PoPs
Cloud load balancers are built on the same fundamental principles as software load balancers but add managed TLS certificate provisioning, auto-scaling, DDoS protection, and integration with the cloud provider's network fabric. Under the hood, Google's load balancer uses Maglev (a consistent-hashing L4 balancer), and AWS NLB uses a similar flow-based design with Hyperplane.
Global Server Load Balancing (GSLB)
All the load balancing discussed so far operates within a single site or region. Global Server Load Balancing (GSLB) distributes traffic across multiple geographic locations — entire data centers or regions rather than individual servers. GSLB is what determines whether your request goes to a data center in Frankfurt or one in Virginia.
DNS-Based GSLB
The most common GSLB mechanism uses DNS to steer traffic. When a client resolves a domain name, the authoritative DNS server returns different IP addresses based on:
- Geographic proximity — The client's resolver IP is geolocated and the nearest data center's IP is returned
- Health status — Unhealthy data centers are removed from DNS responses
- Load — Overloaded data centers receive fewer DNS mappings
- Latency — Some GSLB systems actively measure latency from each DNS resolver to each data center
DNS-based GSLB has a limitation: DNS responses are cached by resolvers and clients, so changes take time to propagate (bounded by the TTL). This means failover is not instant — it typically takes 30 seconds to several minutes.
Anycast-Based GSLB
Anycast provides an alternative to DNS-based GSLB. By announcing the same IP prefix from multiple locations via BGP, the network routing system itself directs each client to the nearest instance. Anycast failover is driven by BGP convergence, which typically takes seconds rather than minutes. This is why Cloudflare and Google use anycast for their critical services — it provides faster, more reliable GSLB than DNS alone.
The trade-off is that anycast routing is based on BGP path selection, which optimizes for network topology rather than user-perceived latency. A data center that is topologically close (few AS hops) may not always be the lowest-latency option. Sophisticated operators combine anycast with performance-based DNS steering to get the best of both worlds.
Load Balancing and BGP
Load balancing and BGP are deeply interconnected. BGP is itself a form of traffic engineering: by controlling which prefixes are announced from which locations, and by using techniques like AS path prepending and selective announcements, network operators can influence how traffic is distributed across their infrastructure.
Large-scale deployments frequently use BGP as the glue between layers:
- ECMP via BGP — Load balancers announce a VIP to the upstream router via BGP. The router sees multiple equal-cost paths and distributes flows via ECMP.
- Anycast load balancing — The same VIP is announced from multiple sites via BGP. Anycast routing steers traffic to the nearest site, where local load balancers distribute it to backends.
- Health-aware BGP — When all backends behind a load balancer go down, the load balancer withdraws its BGP announcement. The upstream router removes that path from its ECMP set, and traffic shifts to remaining healthy instances. This provides automatic failover at the network layer.
- Traffic engineering with communities — BGP communities can signal traffic preferences to upstream autonomous systems, influencing inbound traffic distribution across peering points and transit links.
You can observe these patterns in production by looking up the prefixes of major services. Look up 1.1.1.1 (Cloudflare DNS) and examine the routes: the prefix 1.1.1.0/24 is announced from hundreds of locations, each serving as a load-balanced entry point.
Putting It All Together
A production-grade deployment at scale typically layers multiple load balancing techniques:
- GSLB (DNS + anycast) steers users to the nearest data center
- ECMP at the border router distributes flows across multiple load balancer instances
- L4 load balancing (IPVS or equivalent) terminates the flow and picks a backend pool
- L7 load balancing (Envoy, NGINX, or HAProxy) parses HTTP and routes to the correct service
- Service mesh sidecar (Envoy again) handles inter-service load balancing, retries, and circuit breaking
Each layer adds a capability that the layer below cannot provide, and each layer has its own health checking, failover, and traffic distribution logic. The result is a system where any individual component — a server, a load balancer, an entire data center — can fail without users noticing.
Explore Network Infrastructure
Load balancers sit at the heart of internet infrastructure, working alongside BGP routing, anycast, and CDNs to deliver fast, reliable services. You can explore the routing infrastructure behind major load-balanced services using the looking glass:
- 1.1.1.1 — Cloudflare's globally anycast-load-balanced DNS
- 8.8.8.8 — Google's load-balanced DNS infrastructure
- AS13335 — Cloudflare's network, one of the most extensively load-balanced on the internet
- AS16509 — Amazon Web Services, powering ALB, NLB, and CloudFront
- AS15169 — Google's network, home of Maglev and global load balancing
Look up any IP address or ASN to see the BGP routes, AS paths, and network topology behind the load-balanced services you use every day.