How Load Balancers Work: L4 vs L7 Load Balancing
A load balancer is a network device or software component that distributes incoming traffic across multiple backend servers to improve availability, throughput, and fault tolerance. Load balancers operate at different layers of the OSI model -- Layer 4 (transport) balancers route TCP/UDP connections based on IP addresses and ports, while Layer 7 (application) balancers inspect HTTP headers, URLs, cookies, and payload content to make intelligent routing decisions. Every major internet service -- from Google (AS15169) to Meta (AS32934) to AWS (AS16509) -- relies on multiple tiers of load balancing to serve billions of requests per day. Understanding how load balancers work at the packet, connection, and application level is essential for anyone designing systems that must handle real-world traffic.
Layer 4 vs Layer 7: Two Fundamentally Different Approaches
The distinction between L4 and L7 load balancing is not merely about which OSI layer the balancer inspects -- it determines the entire forwarding architecture, performance envelope, and feature set available.
Layer 4 (Transport) Load Balancing
An L4 load balancer makes routing decisions based on information available in the TCP or UDP header: source IP, source port, destination IP, and destination port. It does not inspect the application-layer payload. When a client initiates a TCP connection, the L4 balancer selects a backend server and forwards all packets belonging to that connection to the same server for the connection's entire lifetime.
L4 balancers typically operate in one of two modes: NAT mode, where the balancer rewrites the destination IP address in each packet to the chosen backend server (and rewrites the source IP on return packets), or Direct Server Return (DSR), where only the initial SYN is intercepted by the balancer and return traffic flows directly from the backend to the client, bypassing the balancer entirely. DSR is critical for bandwidth-asymmetric workloads like video streaming, where the response is orders of magnitude larger than the request.
Because L4 balancers do not parse application protocols, they are extremely fast. Linux's IPVS (IP Virtual Server) module can handle millions of connections per second with microsecond-level latency overhead. The tradeoff is limited intelligence: an L4 balancer cannot route based on HTTP path, cannot insert headers, cannot terminate TLS, and cannot implement cookie-based session persistence. Every packet in a flow goes to the same backend, even if that backend is the wrong one for the specific request.
Layer 7 (Application) Load Balancing
An L7 load balancer terminates the client's TCP connection (and optionally TLS), fully parses the application-layer protocol (typically HTTP), and opens a separate connection to the selected backend server. This is a full reverse proxy architecture -- there are two independent TCP connections, and the balancer can inspect, modify, and route each request independently.
L7 balancing enables content-aware routing: requests to /api/v2/* can go to one backend pool, requests with a specific cookie to another, and requests from certain geolocations to a third. The balancer can add headers (like X-Forwarded-For), rewrite URLs, compress responses, cache content, and enforce rate limits. With HTTP/2 multiplexing, a single client connection can carry hundreds of concurrent requests, each potentially routed to a different backend.
The cost is latency and resource consumption. An L7 balancer must maintain state for both the client connection and backend connection, buffer request and response data, and perform protocol parsing for every transaction. Where an L4 balancer adds microseconds of latency, an L7 balancer typically adds 0.5-5ms depending on request size, TLS termination, and header processing. Software L7 balancers like HAProxy and NGINX can still handle hundreds of thousands of requests per second on commodity hardware, but they are doing fundamentally more work per request than an L4 balancer.
Load Balancing Algorithms
The algorithm that selects which backend receives each connection or request is the core decision engine of any load balancer. The choice of algorithm has profound effects on latency distribution, backend utilization, cache hit rates, and behavior under failure conditions.
Round-Robin
The simplest algorithm. Requests are distributed to backends in sequential order: server 1, server 2, server 3, server 1, server 2, server 3, and so on. Weighted round-robin assigns each server a weight proportional to its capacity -- a server with weight 3 receives three times as many requests as a server with weight 1. Round-robin is the default in most load balancers including HAProxy and NGINX.
Round-robin works well when requests are roughly uniform in cost and backends are roughly uniform in capacity. It fails when requests have high variance in processing time -- a few slow requests can queue up on one server while others sit idle. It also ignores server health beyond binary up/down checks; a server experiencing degraded performance (high CPU, garbage collection pauses, disk I/O saturation) continues receiving its full share of traffic.
Least Connections
Routes each new connection to the backend with the fewest active connections. This naturally adapts to heterogeneous backends and variable request costs: a faster server completes requests sooner, reducing its active connection count and attracting more traffic. A slow server accumulates connections and receives fewer new ones. Weighted least-connections combines this with static weights.
Least-connections is the best general-purpose algorithm for L4 balancing. For L7 balancing, least outstanding requests (sometimes called least-pending or least-busy) is the analogous metric, counting HTTP requests in flight rather than TCP connections. This distinction matters with HTTP/2 and gRPC, where a single TCP connection carries multiple concurrent requests.
Consistent Hashing
Maps each request to a backend using a hash function, typically applied to the client IP, a request header, a URL, or a cookie value. The "consistent" part is critical: when a backend is added or removed, only 1/N of the keys are remapped (where N is the number of backends), rather than the near-total remapping that occurs with simple modular hashing. This stability is essential for caching workloads where remapping a key means a cache miss.
Consistent hashing is implemented using a hash ring (introduced by Karger et al. in 1997). Each backend is assigned multiple points (virtual nodes) on a circular hash space. A request's hash value maps to a point on the ring, and the request is routed to the next backend clockwise. More virtual nodes per backend produces more uniform distribution at the cost of slightly more lookup time. Maglev hashing (Google, 2016) improves on the basic ring by ensuring a more even distribution of keys across backends even with small numbers of virtual nodes.
Random with Two Choices (Power of Two)
A surprisingly effective algorithm backed by queueing theory. For each request, the balancer randomly selects two backends and routes the request to whichever has fewer active connections. This achieves near-optimal load distribution -- the maximum load on any server is O(log log N) compared to O(log N) for purely random selection. The "power of two choices" paradigm was proven by Mitzenmacher (1996) and is used in modern systems like Envoy Proxy and Netflix's Zuul.
Least Response Time
Routes requests to the backend with the lowest observed response time, optionally weighted by active connections. This requires the balancer to maintain a running estimate of each backend's latency, typically using an exponentially weighted moving average (EWMA). It is the most adaptive algorithm but also the most susceptible to feedback loops: if a backend has temporarily low latency (perhaps because its cache just warmed up), it may receive a thundering herd of requests that overwhelm it.
Health Checks
A load balancer is only useful if it avoids sending traffic to failed backends. Health checking is the mechanism that continuously monitors backend availability and removes unhealthy servers from the rotation.
Active health checks are periodic probes sent by the balancer to each backend. The simplest is a TCP connect check -- the balancer opens a TCP connection to the backend's port and considers it healthy if the handshake completes. HTTP health checks send a GET request to a specific path (e.g., /healthz or /ready) and verify the response status code and optionally the body content. gRPC health checks use the standard grpc.health.v1.Health service. Sophisticated health checks can verify database connectivity, downstream dependency availability, and application-specific readiness.
Passive health checks (also called outlier detection) monitor actual client traffic rather than sending synthetic probes. If a backend returns a threshold number of errors (e.g., 5 consecutive HTTP 5xx responses or TCP connection failures), it is marked unhealthy and removed from rotation. Envoy Proxy implements this as "outlier ejection" with configurable thresholds and ejection durations.
The interaction between active and passive checks matters. Most production configurations use both: passive checks provide fast failure detection (sub-second, since they piggyback on real traffic), while active checks detect failures when no traffic is flowing and verify recovery before re-adding a backend. The fall and rise parameters in HAProxy prevent flapping: a server must fail N consecutive checks to be marked down and pass M consecutive checks to be marked up again.
Session Persistence (Sticky Sessions)
Many applications maintain server-side state that is tied to a specific client -- session data, shopping cart contents, authentication tokens, WebSocket connections. Session persistence (or "sticky sessions") ensures that subsequent requests from the same client are routed to the same backend server.
Common persistence mechanisms include:
- Source IP persistence -- Hash the client's IP address to determine the backend. Simple but broken by CGNAT, corporate proxies, and mobile networks where thousands of users share a single public IP.
- Cookie-based persistence -- The load balancer inserts a cookie (e.g.,
SERVERID=srv2) in the response. Subsequent requests from the same client include this cookie, and the balancer routes to the specified server. This is the most reliable L7 persistence mechanism. HAProxy supports this viacookie SERVERID insert indirect nocache. - Header or URL-based persistence -- Hash a specific header value, query parameter, or URL path component. Useful for API traffic where a consistent client identifier exists in the request.
Session persistence has an inherent tension with load balancing: by pinning clients to specific servers, you reduce the balancer's ability to distribute load evenly. If a popular user or a bot is pinned to one server, that server may become overloaded while others are idle. The best architectural solution is to eliminate server-side session state entirely (store sessions in Redis, Memcached, or a database), making all backends interchangeable. When that is not feasible, use cookie-based persistence with a fallback to round-robin when the specified server is down.
Direct Server Return (DSR)
In a standard NAT-based load balancing configuration, all traffic -- both client-to-server and server-to-client -- flows through the load balancer. For bandwidth-asymmetric workloads like video streaming, file downloads, and API responses with large payloads, the return path through the load balancer becomes a bottleneck. DSR solves this by having the backend server respond directly to the client, bypassing the load balancer on the return path.
DSR works by configuring the VIP address on a loopback interface on each backend server (with ARP suppression so backends do not respond to ARP requests for the VIP). The load balancer receives the client's packet destined for the VIP, selects a backend, and forwards the packet to the backend's real IP -- typically by encapsulating it in IP-in-IP or GRE, or by rewriting the destination MAC address (L2 DSR). The backend decapsulates or receives the packet, sees that the destination IP matches its loopback VIP, processes the request, and sends the response directly to the client with the VIP as the source address. The client sees a normal response from the VIP and is unaware of the DSR mechanism.
Linux's IPVS supports DSR via the -g (gatewaying/DR) and -i (IP tunneling) modes. L2 DSR (gatewaying) requires the backends to be on the same L2 segment as the load balancer. IP-in-IP tunneling (IPIP or GRE) removes this restriction, allowing backends to be in different subnets or even different data centers, at the cost of additional encapsulation overhead and reduced MTU.
Connection Draining (Graceful Shutdown)
When a backend server needs to be removed from rotation -- for maintenance, deployment, or because it is failing -- simply dropping all connections causes errors for in-flight requests. Connection draining (also called graceful shutdown or deregistration delay) solves this by stopping new connections to the server while allowing existing connections to complete naturally.
The draining process works as follows: the server is marked as "draining" in the load balancer configuration. New connections and requests are routed to other backends. Existing connections continue to be served until they complete or a timeout expires (typically 30-300 seconds). Once all connections have drained or the timeout is reached, the server is fully removed.
In HAProxy, you can drain a server via the stats socket: set server backend/srv1 state drain. The server stops receiving new connections but continues serving existing ones. AWS ALB calls this "deregistration delay" and defaults to 300 seconds. Kubernetes implements it via the terminationGracePeriodSeconds pod spec field combined with a preStop hook, which gives the application time to finish serving in-flight requests before receiving SIGKILL.
TLS Termination and Re-encryption
Modern load balancers typically terminate TLS at the balancer, decrypting traffic and forwarding plain HTTP to backends. This centralizes certificate management, offloads CPU-intensive cryptographic operations from application servers, and enables L7 inspection and routing on the decrypted traffic. The cost is that traffic between the load balancer and backends is unencrypted on the internal network.
For environments requiring end-to-end encryption (PCI-DSS, HIPAA, zero-trust architectures), load balancers support TLS re-encryption: terminate the client's TLS connection, inspect the traffic for routing, then establish a new TLS connection to the backend. This doubles the TLS handshake overhead but maintains encryption throughout the path. Alternatively, TLS passthrough at L4 forwards encrypted traffic directly to backends without termination, preserving end-to-end encryption but sacrificing all L7 inspection capabilities.
Mutual TLS (mTLS) adds another dimension: the load balancer can verify client certificates on the frontend and present its own client certificate to backends. This is common in service mesh architectures where every service-to-service connection is authenticated via mTLS, and the sidecar proxy acts as both the TLS termination point and the load balancer.
The gRPC Load Balancing Problem
gRPC presents a unique challenge for load balancers. Because gRPC uses HTTP/2, all RPCs between a client and server are multiplexed over a single long-lived TCP connection. An L4 load balancer pins the entire connection to one backend, so all RPCs go to the same server regardless of the number of backends available. This completely defeats the purpose of load balancing.
The solution is L7 balancing that understands HTTP/2 framing and can route individual requests (HEADERS+DATA frame sequences) to different backends. Envoy, HAProxy (in HTTP mode), and gRPC's built-in client-side balancer all implement per-RPC balancing. The balancer terminates the client's HTTP/2 connection, extracts individual gRPC requests, and distributes them across backend connections using the configured algorithm.
An alternative is client-side load balancing, where the gRPC client itself maintains connections to all backends and distributes RPCs directly. This eliminates the proxy hop but requires the client to discover backends (via DNS, a service registry, or the gRPC name resolution API) and implement balancing logic. gRPC's grpclb and the newer xDS-based balancing (used with Envoy's control plane) provide standardized protocols for this.
Load Balancer Architectures in Practice
Production systems typically use multiple tiers of load balancing. A common architecture at scale:
- BGP/Anycast layer -- Multiple edge locations announce the same IP prefix via BGP. Internet routing directs clients to the nearest location. This is the GSLB tier.
- L4 load balancer (ECMP) -- Within each edge location, an L4 balancer (IPVS, Maglev, or hardware) distributes connections across L7 balancer instances using ECMP. Handles millions of connections per second with DSR.
- L7 load balancer -- HAProxy, NGINX, or Envoy terminates TLS, parses HTTP, routes requests to backend pools based on content, and maintains connection pools to backends.
- Application backends -- The actual service instances, often running in Kubernetes pods with their own service mesh sidecar proxies providing another layer of L7 balancing.
Google's Maglev (described in their 2016 paper) is the canonical example of this architecture. Maglev is a distributed L4 load balancer where each machine in a pool of Maglev nodes independently selects the same backend for a given connection using consistent hashing with connection tracking. This allows the Maglev pool itself to be load-balanced via ECMP without breaking connection affinity -- even if consecutive packets for the same connection arrive at different Maglev nodes, they are forwarded to the same backend.
Hardware vs Software Load Balancers
Traditional hardware load balancers (F5 BIG-IP, Citrix NetScaler/ADC, A10 Networks) are dedicated appliances with custom ASICs for packet processing. They offer high throughput (100+ Gbps), deterministic latency, and enterprise features (global traffic management, WAF, DDoS protection) in a single box. The downsides are cost ($50K-$500K+ per appliance), vendor lock-in, inflexible scaling (you cannot add half an appliance), and operational complexity of proprietary management interfaces.
Software load balancers (HAProxy, NGINX, Envoy, Linux IPVS, Katran) run on commodity hardware or virtual machines. They scale horizontally by adding instances, integrate with automation and CI/CD pipelines, and can be deployed anywhere -- bare metal, VMs, containers, cloud instances. Modern software L4 balancers using kernel bypass (eBPF/XDP, DPDK) achieve performance comparable to hardware appliances at a fraction of the cost.
The industry trend is decisively toward software. Facebook's Katran (open-sourced in 2018) uses eBPF/XDP to implement L4 load balancing entirely in the kernel's network driver layer, achieving line-rate packet processing without context switches to user space. Cloudflare's Unimog similarly uses XDP for L4 balancing across their global network. Even traditional hardware vendors now offer virtual editions of their appliances that run as software on standard servers.
Cloud Load Balancers
Cloud providers offer managed load balancing services that abstract away the infrastructure:
- AWS -- Network Load Balancer (L4, millions of requests/second, static IPs), Application Load Balancer (L7, content-based routing, WebSocket support), Gateway Load Balancer (transparent network gateway for appliance insertion). Classic Load Balancer is legacy and should not be used for new deployments.
- Google Cloud -- Global external L7 load balancer (anycast IP, cross-region), regional L4/L7 load balancers, internal L4/L7 load balancers. Google's external L7 LB uses the same Maglev/GFE infrastructure that serves Google Search.
- Azure -- Azure Load Balancer (L4), Application Gateway (L7 with WAF), Front Door (global L7 with CDN), Traffic Manager (DNS-based GSLB).
Cloud load balancers are deeply integrated with their respective platforms: auto-scaling groups automatically register/deregister backends, health checks are built in, and TLS certificate management is handled by the cloud's certificate service (ACM, Cloud Certificates). The tradeoff is limited customization and potential cost at high traffic volumes.
Load Balancing and BGP
Load balancers and BGP are complementary technologies in internet infrastructure. BGP handles inter-domain routing -- directing traffic from clients across the internet to the correct network and data center. Load balancers handle intra-site distribution -- spreading traffic across the specific servers within a data center that can serve the request.
At the point where these layers meet, load balancer instances announce service VIPs via BGP to their upstream routers. The router uses ECMP to distribute flows across load balancer instances. If a load balancer instance fails, it stops announcing the VIP, and the router's BGP session drops, redistributing traffic to surviving instances within seconds. This BGP-based health-aware routing is how hyperscale operators achieve both global and local traffic distribution without single points of failure.
You can observe this infrastructure in action by looking up any major service's IP address. The god.ad BGP Looking Glass shows the BGP routes, AS paths, and origin ASNs behind globally load-balanced services -- the routing layer that directs your traffic before any load balancer algorithm ever touches it.