How Load Balancers Work: L4 vs L7 Load Balancing

A load balancer distributes incoming network traffic across multiple backend servers so that no single server bears too much demand. Load balancers are what make it possible for sites like Google, Netflix, and Amazon to handle billions of requests per day. Without them, a single server failure would take down an entire service, and traffic spikes would overwhelm individual machines.

But not all load balancers are the same. The most fundamental distinction in load balancing is between Layer 4 (L4) and Layer 7 (L7) operation, referring to layers of the OSI model. An L4 load balancer makes routing decisions based on transport-layer information (IP addresses and TCP/UDP ports), while an L7 load balancer inspects application-layer data (HTTP headers, URLs, cookies) to make smarter decisions. Understanding this difference, along with the algorithms, architectures, and trade-offs involved, is essential for designing scalable infrastructure.

The OSI Model: Why "Layer 4" and "Layer 7"?

The OSI (Open Systems Interconnection) model is a conceptual framework that describes network communication in seven layers. Two of these layers are central to load balancing:

The distinction matters because it determines what information the load balancer can act on, how much processing overhead it introduces, and what kinds of routing policies it can enforce.

Layer 4 Load Balancing

An L4 load balancer operates at the transport layer. It receives incoming TCP or UDP connections and forwards them to a backend server based on limited information: the source IP, source port, destination IP, and destination port. It does not inspect the contents of the packets beyond this 4-tuple.

How L4 Load Balancing Works

When a client opens a TCP connection to the load balancer's virtual IP (VIP), the load balancer selects a backend server and either rewrites the packet's destination IP (NAT mode) or encapsulates it (tunneling/DSR mode). All subsequent packets in that TCP flow go to the same backend. The load balancer maintains a connection tracking table that maps each client flow to its assigned backend.

Because L4 load balancers do not need to parse HTTP or terminate TLS, they can process packets at extremely high throughput with minimal latency. Modern L4 load balancers implemented in the kernel (using technologies like IPVS, eBPF/XDP, or DPDK) can handle tens of millions of connections per second on commodity hardware.

Advantages of L4

Limitations of L4

Layer 7 Load Balancing

An L7 load balancer terminates the client's TCP (and usually TLS) connection, parses the application-layer protocol, and then opens a new connection to a backend server. Because it fully understands the application protocol, it can make decisions that are impossible at L4.

How L7 Load Balancing Works

The client connects to the load balancer, which accepts the TCP connection and performs the TLS handshake (if HTTPS). The load balancer then reads the full HTTP request — method, URL, headers, and potentially the body. Based on configured rules, it selects a backend and proxies the request over a separate connection. The response flows back through the load balancer to the client.

This is a full proxy architecture: there are two distinct TCP connections (client-to-LB and LB-to-backend), and the load balancer is an active participant in both. This gives it enormous power but also introduces additional latency and resource consumption compared to L4.

What L7 Can Do That L4 Cannot

L4 vs L7: A Visual Comparison

L4 vs L7 Load Balancing Layer 4 (Transport) Client TCP SYN L4 LB IP + Port only srv-1 srv-2 srv-3 L4 Sees: Src IP: 203.0.113.5 Dst IP: 10.0.0.1 Src Port: 52431 Dst Port: 443 Protocol: TCP Flags: SYN Payload: [encrypted, opaque to L4] Layer 7 (Application) Client HTTPS L7 LB Full HTTP parse /api /web /static L7 Sees: GET /api/users HTTP/1.1 Host: app.example.com Cookie: session=abc123 X-Forwarded-For: 203.0.113.5 Comparison Property L4 L7 Throughput Very high Moderate Content routing No Yes (URL, header, cookie) SSL termination Passthrough Terminates + re-encrypts Health checks TCP only HTTP (code + body) Connection multiplexing No Yes Protocol support Any TCP/UDP HTTP(S), gRPC, WS Use case TCP proxying, DB, DNS Web apps, APIs, microservices

Load Balancing Algorithms

The algorithm a load balancer uses to choose which backend receives each connection (L4) or request (L7) is one of the most critical design decisions. Different algorithms optimize for different goals: fairness, locality, minimal latency, or cache efficiency.

Round Robin

The simplest algorithm. Requests are distributed to backends in sequential, circular order: server 1, server 2, server 3, server 1, server 2, and so on. Round robin assumes all backends are equally capable and all requests are equally expensive. It works well when both of these assumptions hold. It breaks down when backends have different capacities or when request costs vary widely.

Weighted Round Robin

An extension of round robin where each backend is assigned a weight. A backend with weight 3 receives three times as many requests as a backend with weight 1. This is useful when backends have different hardware specifications — a server with 64 cores should receive more traffic than one with 8. Cloud providers use weighted round robin during rolling deployments, gradually increasing the weight of new instances as confidence grows.

Least Connections

The load balancer tracks how many active connections each backend currently has and sends new connections to the backend with the fewest. This adapts naturally to backends with different processing speeds: a fast server completes connections quickly and therefore has fewer active connections, so it attracts more new ones. Least connections is the default choice for many production deployments because it handles heterogeneous backends and variable request costs better than round robin.

Weighted Least Connections

Combines weights with connection counting. The load balancer selects the backend with the lowest ratio of active connections to weight. A server with weight 4 and 20 active connections (ratio 5.0) would be preferred over a server with weight 2 and 12 connections (ratio 6.0).

Random

Selects a backend at random for each request. Surprisingly effective at scale. The "power of two random choices" variant picks two backends at random and sends the request to the one with fewer connections. This simple heuristic provides near-optimal load distribution and avoids the thundering herd problem that can affect deterministic algorithms.

IP Hash

Hashes the client's source IP address to deterministically select a backend. The same client always reaches the same server, providing natural session affinity without cookies. The downside: if a backend goes down, all clients mapped to it are redistributed, and when the backend returns, the hash distribution shifts again.

Consistent Hashing

A more sophisticated hashing approach that minimizes disruption when backends are added or removed. In consistent hashing, each backend is assigned multiple points on a hash ring. A request key (such as a URL or session ID) is hashed to a point on the ring, and the request goes to the nearest backend point clockwise. When a backend is removed, only the requests mapped to that backend's segment are redistributed; all other mappings remain stable.

Consistent hashing is essential for caching layers, where redistributing keys across all backends would invalidate caches and cause a "thundering herd" of cache misses hitting the origin servers simultaneously. CDNs use consistent hashing extensively to maintain cache locality.

Least Response Time

Tracks the average response time from each backend and sends requests to the fastest one. This is an L7-only algorithm (L4 load balancers cannot measure application response times). It works well for HTTP APIs where latency is the primary metric, but it can starve slow backends that are slow due to cold caches — creating a feedback loop where cold backends never warm up because they never receive traffic.

Direct Server Return (DSR)

Direct Server Return is a load balancing architecture where the return traffic from the backend bypasses the load balancer entirely and goes directly to the client. In a normal (non-DSR) setup, all traffic flows through the load balancer in both directions. DSR eliminates the return-path bottleneck.

How DSR Works

In DSR mode, the load balancer receives the incoming packet, rewrites the destination MAC address (L2 DSR) or encapsulates the packet in an IP-in-IP or GRE tunnel (L3 DSR), and forwards it to the backend. The backend must be configured with the VIP on a loopback interface so that it accepts the packet and responds directly to the client using the VIP as the source address. The client does not know the load balancer was involved.

DSR is exclusively an L4 technique. Because the return traffic does not traverse the load balancer, the load balancer never sees the response and cannot inspect or modify application-layer data. But the performance gains are significant: since response bodies are typically much larger than requests (think video streaming, large API responses, file downloads), removing the load balancer from the return path can reduce its bandwidth requirements by 10x or more.

When to Use DSR

Health Checks

A load balancer must know which backends are healthy and able to serve traffic. Health checking is the mechanism for this. There are two fundamental approaches: active and passive.

Active Health Checks

The load balancer periodically sends probe requests to each backend and evaluates the response. Configuration typically includes:

L4 health checks typically open a TCP connection to the backend port. If the three-way handshake completes, the backend is considered healthy. L7 health checks send an HTTP request (usually GET /health or GET /ready) and verify that the response has the expected status code (200) and optionally the expected body content.

L7 health checks are strictly more powerful: a backend might accept TCP connections but return 500 errors on every request due to a failed database connection. An L4 check would miss this; an L7 check would catch it.

Passive Health Checks

Also called outlier detection. Instead of sending probes, the load balancer monitors real traffic and marks backends as unhealthy if they exceed error thresholds. For example, if a backend returns five consecutive 5xx errors, it is temporarily ejected from the pool. Envoy Proxy popularized this approach, calling it "outlier ejection."

Passive checks are faster to react (they detect failures on real traffic, not on periodic probes) but they require actual traffic to detect problems. A backend that receives no traffic cannot be passively checked. Best practice is to use both active and passive health checks together.

Session Persistence (Sticky Sessions)

By default, most load balancing algorithms distribute requests without regard for previous decisions. This is a problem for stateful applications where a user's session data lives in server memory and the user must always reach the same backend.

Cookie-Based Persistence

The load balancer (L7 only) injects a cookie into the response that identifies the assigned backend. On subsequent requests, the client sends this cookie back, and the load balancer routes to the same backend. If the backend goes down, the cookie becomes invalid and the user is assigned to a new backend (losing session state).

Source IP Persistence

The load balancer (L4 or L7) maps the client's source IP to a backend and maintains this mapping for a configured duration. This breaks when multiple users share the same IP (corporate NATs, CDN edge proxies) because all users behind that IP are forced to the same backend.

The Modern Approach

Modern architectures avoid sticky sessions entirely by externalizing session state to a shared store like Redis or a database. This makes backends truly stateless and interchangeable, enabling any backend to serve any request. This is the pattern used by all major cloud-native applications.

SSL/TLS Termination

TLS termination refers to where in the request path the TLS encryption is decoded. There are three common patterns:

ECMP: Load Balancing at the Network Layer

Equal-Cost Multi-Path routing (ECMP) is a load balancing mechanism built into network routers and closely tied to BGP. When a router has multiple equally-good paths to the same destination (same AS path length, same local preference), it can distribute traffic across all of them instead of picking just one.

How ECMP Works

The router hashes each packet's flow identifier (typically the 5-tuple: source IP, destination IP, source port, destination port, protocol) and uses the hash to select one of the equal-cost paths. All packets in the same flow follow the same path, preventing out-of-order delivery. Different flows are distributed across the available paths.

ECMP is especially important for anycast deployments. When a prefix like 1.1.1.0/24 is announced from multiple locations, ECMP at each router along the path ensures that traffic is spread across multiple next hops rather than all flowing through a single one.

ECMP and Load Balancers

In large-scale deployments, ECMP is used to scale the load balancers themselves. Multiple load balancer instances are placed behind a router, each announcing the same VIP. The router distributes incoming traffic across the load balancer instances via ECMP. This pattern, sometimes called "anycast load balancing," is how hyperscalers like Google and Facebook distribute traffic at scale. It combines network-layer load balancing (ECMP across LB instances) with application-layer load balancing (L7 within each LB instance).

ECMP + Load Balancer Architecture Internet Border Router BGP + ECMP ECMP hashing distributes flows LB Instance 1 VIP: 10.0.0.1 LB Instance 2 VIP: 10.0.0.1 LB Instance 3 VIP: 10.0.0.1 web-01 web-02 web-03 web-04 web-05 web-06 web-07 web-08 web-09 All LB instances share the same VIP (anycast). ECMP at the router distributes flows; each LB handles L4/L7 for its flows.

Hardware vs Software Load Balancers

The load balancing landscape has shifted dramatically from dedicated hardware appliances to software-based solutions. Understanding the history and current state helps explain the design choices behind modern tools.

Hardware Load Balancers

For decades, load balancing was dominated by specialized hardware appliances from vendors like F5 (BIG-IP), Citrix (NetScaler), and A10 Networks. These devices use custom ASICs and FPGAs to process packets at high speed. They are expensive (often $50,000-$500,000+), proprietary, and require specialized skills to configure.

Hardware load balancers still exist in enterprises and service providers that value vendor support contracts and predictable performance. But the trend is decisively toward software.

Software Load Balancers

Modern software load balancers run on commodity servers and are often open source. They have reached performance levels that rival or exceed hardware appliances, especially for L7 workloads. The most important ones:

HAProxy

HAProxy is the gold standard for high-performance L4/L7 load balancing. Written in C, it is single-process, event-driven, and extremely efficient. HAProxy handles millions of concurrent connections and hundreds of thousands of requests per second per core. It has been in production use since 2001 and is used by GitHub, Reddit, Airbnb, and many other large-scale services. Its configuration language is purpose-built for load balancing and offers fine-grained control over every aspect of traffic management.

NGINX

NGINX started as a web server but its reverse proxy and load balancing capabilities have made it one of the most deployed load balancers in the world. NGINX excels at L7 load balancing and is often used as a combined web server, reverse proxy, and load balancer. The open-source version covers most use cases; NGINX Plus adds active health checks, session persistence, and dynamic reconfiguration. NGINX is behind many of the world's busiest websites.

Envoy Proxy

Envoy was built at Lyft and is now a CNCF graduated project. It was designed from the ground up for modern microservice architectures. Key features that distinguish Envoy from older load balancers: hot restart (zero-downtime binary upgrades), a rich xDS API for dynamic configuration (no config file reloads), first-class support for gRPC and HTTP/2, built-in distributed tracing, and advanced outlier detection. Envoy is the default data plane for service mesh platforms like Istio.

Linux IPVS

IPVS (IP Virtual Server) is a transport-layer load balancer built into the Linux kernel. It operates at L4 and can handle extremely high throughput because it processes packets in kernel space without copying them to user space. IPVS supports DSR, NAT, and tunneling modes. It is the load balancing backend for Kubernetes Services (kube-proxy in IPVS mode) and for many large-scale L4 deployments. Because it runs in the kernel, it has fewer features than user-space load balancers but significantly higher raw performance.

Cloud Load Balancers

Every major cloud provider offers managed load balancing services that abstract away the operational complexity:

Cloud load balancers are built on the same fundamental principles as software load balancers but add managed TLS certificate provisioning, auto-scaling, DDoS protection, and integration with the cloud provider's network fabric. Under the hood, Google's load balancer uses Maglev (a consistent-hashing L4 balancer), and AWS NLB uses a similar flow-based design with Hyperplane.

Global Server Load Balancing (GSLB)

All the load balancing discussed so far operates within a single site or region. Global Server Load Balancing (GSLB) distributes traffic across multiple geographic locations — entire data centers or regions rather than individual servers. GSLB is what determines whether your request goes to a data center in Frankfurt or one in Virginia.

DNS-Based GSLB

The most common GSLB mechanism uses DNS to steer traffic. When a client resolves a domain name, the authoritative DNS server returns different IP addresses based on:

DNS-based GSLB has a limitation: DNS responses are cached by resolvers and clients, so changes take time to propagate (bounded by the TTL). This means failover is not instant — it typically takes 30 seconds to several minutes.

Anycast-Based GSLB

Anycast provides an alternative to DNS-based GSLB. By announcing the same IP prefix from multiple locations via BGP, the network routing system itself directs each client to the nearest instance. Anycast failover is driven by BGP convergence, which typically takes seconds rather than minutes. This is why Cloudflare and Google use anycast for their critical services — it provides faster, more reliable GSLB than DNS alone.

The trade-off is that anycast routing is based on BGP path selection, which optimizes for network topology rather than user-perceived latency. A data center that is topologically close (few AS hops) may not always be the lowest-latency option. Sophisticated operators combine anycast with performance-based DNS steering to get the best of both worlds.

Load Balancing and BGP

Load balancing and BGP are deeply interconnected. BGP is itself a form of traffic engineering: by controlling which prefixes are announced from which locations, and by using techniques like AS path prepending and selective announcements, network operators can influence how traffic is distributed across their infrastructure.

Large-scale deployments frequently use BGP as the glue between layers:

You can observe these patterns in production by looking up the prefixes of major services. Look up 1.1.1.1 (Cloudflare DNS) and examine the routes: the prefix 1.1.1.0/24 is announced from hundreds of locations, each serving as a load-balanced entry point.

Putting It All Together

A production-grade deployment at scale typically layers multiple load balancing techniques:

  1. GSLB (DNS + anycast) steers users to the nearest data center
  2. ECMP at the border router distributes flows across multiple load balancer instances
  3. L4 load balancing (IPVS or equivalent) terminates the flow and picks a backend pool
  4. L7 load balancing (Envoy, NGINX, or HAProxy) parses HTTP and routes to the correct service
  5. Service mesh sidecar (Envoy again) handles inter-service load balancing, retries, and circuit breaking

Each layer adds a capability that the layer below cannot provide, and each layer has its own health checking, failover, and traffic distribution logic. The result is a system where any individual component — a server, a load balancer, an entire data center — can fail without users noticing.

Explore Network Infrastructure

Load balancers sit at the heart of internet infrastructure, working alongside BGP routing, anycast, and CDNs to deliver fast, reliable services. You can explore the routing infrastructure behind major load-balanced services using the looking glass:

Look up any IP address or ASN to see the BGP routes, AS paths, and network topology behind the load-balanced services you use every day.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained