gRPC Load Balancing: Strategies and Patterns
gRPC is built on top of HTTP/2, and that single design choice makes load balancing fundamentally harder than it is for traditional HTTP/1.1 services. A naive deployment behind a standard TCP load balancer will appear to work, pass every health check, and still route 100% of traffic to a single backend. Understanding why this happens, and the spectrum of strategies to fix it, is essential knowledge for anyone operating gRPC services at scale.
This article covers the full landscape of gRPC load balancing: why Layer 4 fails, how Layer 7 proxies solve it, client-side approaches, the lookaside load balancing protocol, the xDS API, health checking, connection draining, retries, and hedging.
Why Layer 4 Load Balancing Fails for gRPC
Traditional load balancers operate at Layer 4 (TCP). They accept an incoming TCP connection, pick a backend using a balancing algorithm (round-robin, least-connections, etc.), and establish a corresponding upstream connection. All bytes on that connection flow to the same backend for the lifetime of the connection. For HTTP/1.1, this works well enough because clients open many short-lived connections and each request typically gets its own connection, so load distributes naturally across backends.
HTTP/2 changes this equation entirely. The protocol was designed to multiplex many concurrent requests over a single TCP connection. gRPC inherits this behavior: a gRPC client establishes one HTTP/2 connection to a server and sends all RPCs as multiplexed streams on that connection. The connection stays open for the lifetime of the client process, which may be hours or days.
When you place a Layer 4 load balancer in front of gRPC backends, here is what happens:
- The client opens a TCP connection. The L4 load balancer forwards it to Backend A.
- The client sends all RPCs over that single connection. Every RPC goes to Backend A.
- Backends B, C, and D sit idle. The load balancer reports all backends as healthy and the connection count as balanced (one connection each to one backend).
- If you scale up and add Backends E and F, nothing changes. The existing connection still goes to Backend A.
The load balancer's connection-level metrics look fine. But the request-level distribution is catastrophically skewed. In a cluster with ten backends, one backend may handle 100% of the load while nine are idle. Even with many clients, the distribution is determined by which backend each client's single connection landed on, not by any per-request balancing.
This problem is not theoretical. It is the single most common operational surprise teams encounter when deploying gRPC for the first time. Load balancers like AWS NLB, bare HAProxy in TCP mode, or any iptables/IPVS-based solution (including the default Kubernetes kube-proxy in iptables mode) will all exhibit this behavior with gRPC.
Layer 7 Load Balancing: The Proxy-Based Solution
The fix is to move the load balancing decision from the connection level to the request level. A Layer 7 (application-level) load balancer terminates the HTTP/2 connection from the client, inspects each individual gRPC request (which is an HTTP/2 stream), and routes it to a backend independently. This means one client connection can have its RPCs spread across all available backends.
Envoy Proxy
Envoy is the gold standard for gRPC load balancing. It was built at Lyft specifically to solve microservice communication challenges, and gRPC is a first-class citizen. Envoy understands the HTTP/2 framing, can parse gRPC metadata and trailers, and supports every load balancing algorithm you need.
A minimal Envoy configuration for gRPC load balancing looks like this:
static_resources:
listeners:
- address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: AUTO
stat_prefix: ingress
route_config:
virtual_hosts:
- name: grpc_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: grpc_backends }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: grpc_backends
type: STRICT_DNS
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: grpc_backends
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: backend1, port_value: 50051 }
- endpoint:
address:
socket_address: { address: backend2, port_value: 50051 }
The critical detail is http2_protocol_options on the upstream cluster. This tells Envoy to speak HTTP/2 to the backends, which is required for gRPC. Envoy then multiplexes upstream connections, distributing individual gRPC streams across backends according to the configured lb_policy.
Envoy supports several load balancing algorithms for gRPC traffic:
- ROUND_ROBIN -- distributes RPCs sequentially across backends.
- LEAST_REQUEST -- sends each RPC to the backend with the fewest active requests. This is generally the best default for gRPC because RPCs vary in cost.
- RING_HASH -- consistent hashing based on a request header or other key. Useful when you need session affinity (e.g., routing all RPCs for a user to the same backend).
- MAGLEV -- Google's consistent hashing algorithm, with better distribution properties than ring hash at the cost of more memory.
- RANDOM -- picks a random backend for each RPC. Surprisingly effective when the number of backends is large.
nginx
nginx added native gRPC proxying support in version 1.13.10 (March 2018). It can terminate HTTP/2 connections from clients and proxy gRPC requests to upstream backends. The configuration is straightforward:
upstream grpc_backends {
server backend1:50051;
server backend2:50051;
server backend3:50051;
}
server {
listen 8080 http2;
location / {
grpc_pass grpc://grpc_backends;
}
}
nginx balances gRPC requests per-RPC across the upstream servers. However, nginx has a significant limitation compared to Envoy: its gRPC health checking, observability, and advanced routing capabilities are less mature. nginx Plus (the commercial version) adds gRPC health checks, but the open-source version requires passive health checking (detecting failures after they happen).
HAProxy
HAProxy 1.9+ supports HTTP/2 and can act as a gRPC-aware Layer 7 proxy. The key is configuring it in HTTP mode (not TCP mode) with HTTP/2 enabled on both the frontend and backend:
frontend grpc_front
bind *:8080 proto h2
default_backend grpc_servers
backend grpc_servers
balance roundrobin
server backend1 backend1:50051 proto h2
server backend2 backend2:50051 proto h2
server backend3 backend3:50051 proto h2
The proto h2 directive is essential on both sides. Without it on the backend, HAProxy will attempt HTTP/1.1 to the upstream, which gRPC cannot use. HAProxy's per-request balancing in HTTP mode ensures proper distribution across gRPC backends.
Client-Side Load Balancing
Proxy-based load balancing adds a network hop, introduces a potential single point of failure, and adds latency. For high-performance or latency-sensitive gRPC deployments, client-side load balancing eliminates the proxy entirely. The gRPC client itself maintains connections to multiple backends and distributes RPCs across them.
gRPC has a built-in load balancing framework with a pluggable architecture. The system has three components:
- Name resolver -- translates a service name (like
dns:///my-service) into a list of backend addresses. - Load balancing policy -- decides which backend to send each RPC to, given the list of addresses.
- Subchannel management -- maintains the actual HTTP/2 connections (called subchannels) to each backend.
pick_first
pick_first is the default load balancing policy in gRPC. It takes the list of addresses from the name resolver, tries to connect to the first one, and sends all RPCs to that single backend. If the connection fails, it tries the next address. This is effectively no load balancing at all -- it is connection-level failover.
pick_first exists because many gRPC deployments use a proxy-based load balancer (where the client only sees a single virtual IP), or the service has only one backend. For these cases, the overhead of maintaining multiple connections is unnecessary.
round_robin
round_robin creates a subchannel (HTTP/2 connection) to every address returned by the name resolver and distributes RPCs across them in round-robin order. This is the simplest form of true client-side load balancing.
In Go, enabling round-robin is a one-line change:
conn, err := grpc.Dial(
"dns:///my-service.example.com",
grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`),
)
With DNS-based resolution, the client resolves the service name to multiple A/AAAA records and creates a connection to each. RPCs are distributed round-robin across these connections. When DNS records change (backends are added or removed), the client updates its connection pool.
The limitation of round_robin is that it treats all backends equally. It does not account for differences in backend capacity, current load, or response latency. A slow backend receives the same share of traffic as a fast one.
weighted_round_robin
The weighted_round_robin policy extends round-robin with dynamic weight adjustment. Backends periodically report their utilization through an out-of-band mechanism (typically the ORCA load reporting protocol), and the client adjusts the proportion of traffic each backend receives. A backend reporting high CPU utilization will receive fewer RPCs, while an underloaded backend receives more.
xDS-Based Policies
The most sophisticated client-side load balancing uses the xDS API (discussed in detail below). xDS-aware gRPC clients can implement policies like priority-based failover, locality-aware routing, and circuit breaking -- all without a proxy in the data path. The xDS control plane pushes configuration to the client, and the client's built-in load balancer executes the policy.
Lookaside Load Balancing (gRPC-LB Protocol)
Between the simplicity of client-side round-robin and the complexity of a full xDS control plane, there is an intermediate approach: lookaside load balancing, originally defined in the gRPC-LB protocol (sometimes called the "grpclb" protocol).
In this model, the client does not pick backends itself. Instead, it contacts a load balancer service (the "balancer" or "lookaside LB") to get a list of backend addresses. The flow works like this:
- The client resolves the service name via DNS. The DNS response includes an SRV record pointing to the load balancer, or the name resolver is configured to return the balancer's address.
- The client opens a gRPC stream to the load balancer, sending information about itself (client stats, requested service name).
- The load balancer responds with a server list -- an ordered list of backend addresses the client should use.
- The client connects directly to the backends in the server list and sends RPCs to them, typically using round-robin across the list.
- The load balancer can update the server list at any time by sending a new response on the stream. The client updates its connections accordingly.
- The client periodically reports load statistics (RPCs sent, latency, errors) back to the load balancer, which uses this information to compute updated server lists.
This design separates the data plane (direct client-to-backend RPCs) from the control plane (the load balancer deciding which backends each client should use). The load balancer has a global view of all backends and all clients, so it can make intelligent decisions: directing traffic away from overloaded backends, implementing weighted distribution, or draining backends for maintenance.
Google uses this pattern internally at enormous scale. The external load balancer has visibility into backend health, capacity, and geographic location, and it produces per-client server lists optimized for both load distribution and latency.
The gRPC-LB protocol is now considered legacy, superseded by the more general xDS API. But the architectural pattern -- a dedicated control plane service that tells clients where to send traffic -- lives on in xDS.
xDS API and Control Plane
The xDS API is the configuration protocol originally developed for Envoy proxy but now adopted by gRPC clients directly. "xDS" stands for "x Discovery Service," where x is a placeholder for the various resource types: LDS (Listener), RDS (Route), CDS (Cluster), EDS (Endpoint), and SDS (Secret).
When gRPC adopted xDS, it gained the ability to receive the same rich configuration that Envoy uses -- but directly in the client library, without a proxy. A gRPC client configured to use xDS connects to a control plane (like Istio's istiod, Traffic Director, or a custom xDS server) and receives:
- EDS (Endpoint Discovery) -- the list of backend addresses, grouped by locality (zone, region) with weights and health status. This replaces DNS-based backend discovery with a dynamic, push-based system.
- CDS (Cluster Discovery) -- cluster-level configuration including the load balancing policy, circuit breaker thresholds, connection limits, and outlier detection settings.
- RDS (Route Discovery) -- routing rules that determine which cluster handles a given RPC based on the service name, method name, headers, or other metadata. This enables traffic splitting (e.g., send 5% of traffic to a canary cluster).
- LDS (Listener Discovery) -- listener configuration including the HTTP filter chain, which can include authentication, rate limiting, and fault injection.
xDS enables features that are impossible with simple client-side load balancing:
- Locality-aware routing -- prefer backends in the same availability zone, failing over to other zones only when local capacity is insufficient. This minimizes cross-zone data transfer costs and latency.
- Priority-based failover -- define primary and secondary backend groups. Traffic only flows to the secondary group when the primary is unhealthy.
- Traffic splitting -- route a percentage of traffic to a new version of a service for canary deployments.
- Circuit breaking -- limit the maximum number of concurrent requests, connections, or retries to a backend to prevent cascading failures.
- Outlier detection -- automatically eject backends that return too many errors, re-admitting them after a cooldown period.
Envoy as xDS Client
In a service mesh architecture, Envoy runs as a sidecar proxy alongside each service instance. An xDS control plane (such as Istio's istiod) pushes configuration to every Envoy sidecar. When Service A calls Service B, the request goes through Service A's local Envoy sidecar, which uses its xDS configuration to select a backend instance of Service B. The sidecar handles load balancing, retries, timeouts, and observability transparently.
This is the "service mesh" model: the application code is unaware of load balancing. It sends RPCs to localhost, and the sidecar handles everything. The advantage is that any language or framework gets the same sophisticated load balancing. The disadvantage is the added latency of two extra network hops (client to sidecar, sidecar to backend) and the operational complexity of managing sidecar proxies.
Proxyless gRPC (xDS in the Client)
Proxyless gRPC eliminates the sidecar. The gRPC client library itself speaks the xDS protocol, connecting directly to the control plane and applying the configuration internally. The client opens connections directly to backends -- no proxy, no sidecar, no extra hops.
Google's Traffic Director and Istio (1.14+) both support proxyless gRPC. The client is configured with a bootstrap file specifying the xDS control plane server:
{
"xds_servers": [
{
"server_uri": "trafficdirector.googleapis.com:443",
"channel_creds": [{"type": "google_default"}],
"server_features": ["xds_v3"]
}
],
"node": {
"id": "projects/12345/networks/default/nodes/client-abc",
"locality": {
"zone": "us-central1-a"
}
}
}
The client then uses the xds:/// name resolver scheme instead of dns:///:
conn, err := grpc.Dial("xds:///my-service.example.com")
This gives you the full power of Envoy's load balancing -- locality awareness, outlier detection, traffic splitting -- with zero proxy overhead. The tradeoff is that only languages with mature xDS support in the gRPC library can use it (Go, Java, and C++ have the best support as of 2025).
Health Checking
Load balancing is only as good as your health checking. Sending traffic to an unhealthy backend defeats the purpose of balancing entirely. gRPC defines a standard Health Checking Protocol (in the grpc.health.v1 package) that both proxies and clients can use to determine backend health.
The protocol is defined as a gRPC service:
service Health {
rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}
message HealthCheckRequest {
string service = 1;
}
message HealthCheckResponse {
enum ServingStatus {
UNKNOWN = 0;
SERVING = 1;
NOT_SERVING = 2;
SERVICE_UNKNOWN = 3;
}
ServingStatus status = 1;
}
The Check RPC is a simple request-response health check: the caller sends a service name and gets back a serving status. The Watch RPC is a server-streaming call: the caller sends a service name and receives status updates whenever the serving status changes. This is more efficient than polling with repeated Check calls.
The service field in the request allows per-service health checking on servers that host multiple gRPC services. An empty string checks the overall server health. A server that is healthy for one service but overloaded for another can report different statuses for each.
Envoy can be configured to use gRPC health checking on upstream clusters:
clusters:
- name: grpc_backends
health_checks:
- timeout: 1s
interval: 10s
unhealthy_threshold: 3
healthy_threshold: 2
grpc_health_check:
service_name: "my.service.Name"
This tells Envoy to call the gRPC health check endpoint every 10 seconds. A backend is marked unhealthy after 3 consecutive failures and healthy again after 2 consecutive successes. During the unhealthy period, Envoy removes the backend from the load balancing rotation.
For client-side load balancing, gRPC libraries support health checking natively. When enabled, the client calls Watch on each subchannel and removes unhealthy subchannels from the balancing pool. The subchannel is re-added when the backend reports SERVING again.
Connection Draining
Graceful shutdown is critical for gRPC services. Because HTTP/2 connections are long-lived and may carry in-flight RPCs, abruptly terminating a backend can cause widespread RPC failures. Connection draining is the process of gracefully removing a backend from service without dropping active requests.
The draining process works in several layers:
- Health check transition -- The backend updates its health status from
SERVINGtoNOT_SERVING. Load balancers and clients that are watching health status stop sending new RPCs to this backend. - HTTP/2 GOAWAY frame -- The backend sends a GOAWAY frame on each HTTP/2 connection. This tells the client that no new streams should be opened on this connection, but existing streams can complete. The GOAWAY frame includes the ID of the last stream the server will process.
- In-flight completion -- The backend continues processing all in-flight RPCs until they complete or a deadline is reached.
- Connection closure -- After all in-flight RPCs complete (or a grace period expires), the backend closes the TCP connections.
In a Kubernetes environment, this maps to the pod termination lifecycle:
- Kubernetes sends SIGTERM to the pod.
- The gRPC server stops accepting new RPCs and begins draining.
- The pod's readiness probe fails, causing the Kubernetes endpoints controller to remove it from the Service endpoints.
- The gRPC server finishes in-flight RPCs.
- After
terminationGracePeriodSeconds, Kubernetes sends SIGKILL if the pod has not exited.
A common mistake is setting the termination grace period too short. If your RPCs can take up to 30 seconds to complete, your grace period must be at least 30 seconds (plus buffer for the readiness probe update to propagate). Otherwise, Kubernetes will SIGKILL the pod while RPCs are still in flight.
In Go, graceful shutdown looks like:
// On SIGTERM:
server.GracefulStop() // stops accepting new RPCs, waits for in-flight to complete
// Or with a deadline:
go func() {
time.Sleep(30 * time.Second)
server.Stop() // force stop after 30s
}()
server.GracefulStop()
Retry Policies
Network failures, backend crashes, and transient errors are inevitable. gRPC has a built-in retry mechanism that can automatically retry failed RPCs without application code needing to handle retries explicitly.
Retries are configured in the service config, which can be returned by the name resolver, set statically on the client, or pushed via xDS:
{
"methodConfig": [
{
"name": [{"service": "my.package.MyService"}],
"retryPolicy": {
"maxAttempts": 3,
"initialBackoff": "0.1s",
"maxBackoff": "1s",
"backoffMultiplier": 2,
"retryableStatusCodes": [
"UNAVAILABLE",
"DEADLINE_EXCEEDED"
]
}
}
]
}
The retry policy specifies:
- maxAttempts -- the total number of attempts, including the original. A value of 3 means one original attempt plus up to two retries.
- initialBackoff / maxBackoff / backoffMultiplier -- exponential backoff between retries. The delay is randomized (jittered) to prevent thundering herds.
- retryableStatusCodes -- which gRPC status codes trigger a retry.
UNAVAILABLE(the server is down or unreachable) is the most common.RESOURCE_EXHAUSTEDmight be retryable if the overload is transient.INTERNALis usually not retryable because it may indicate a bug.
Retries interact with load balancing in an important way: when an RPC is retried, the load balancer picks a different backend for the retry attempt. This means that a retry after a backend crash will land on a healthy backend, not the same crashed one. This is true for both proxy-based and client-side load balancing.
There are important constraints on retries:
- Retry budgets -- to prevent retry storms, gRPC limits the total number of retry attempts across all RPCs on a channel. By default, no more than 20% of RPCs on a channel can be retries, and retries are throttled when the channel's retry budget is exhausted.
- Committed responses -- once the server has sent response headers (indicating it has started processing the RPC), the RPC is "committed" to that server and will not be retried, even if the server later fails. This prevents duplicate processing of an RPC that the server has already partially handled.
- Non-idempotent RPCs -- retrying a mutating RPC (like "transfer $100") is dangerous because the first attempt may have succeeded but the response was lost. Applications must design idempotent APIs or use idempotency keys to make retries safe.
Hedging
Hedged requests are a more aggressive form of redundancy than retries. Instead of waiting for an RPC to fail before sending a retry, hedging sends the same RPC to multiple backends simultaneously and uses the first successful response. The remaining in-flight attempts are cancelled.
Hedging is configured in the service config alongside (but mutually exclusive with) retry policies:
{
"methodConfig": [
{
"name": [{"service": "my.package.MyService", "method": "Search"}],
"hedgingPolicy": {
"maxAttempts": 3,
"hedgingDelay": "0.5s",
"nonFatalStatusCodes": ["UNAVAILABLE", "INTERNAL"]
}
}
]
}
The hedgingDelay is the key parameter. The client sends the first attempt immediately. If no response arrives within 500ms, it sends a second attempt to a different backend (via the load balancer). If neither responds within another 500ms, it sends a third. The first successful response wins.
Hedging is powerful for tail-latency reduction. If your P99 latency is 500ms but your P50 is 10ms, a slow response from one backend will be rescued by a fast response from another. Google's research showed that hedging can reduce tail latency by 5-10x with only a modest increase in overall load (because most hedged attempts are cancelled quickly).
However, hedging has strict requirements:
- The RPC must be idempotent. Hedging sends the same request to multiple backends simultaneously, so any side effects will happen multiple times.
- Hedging increases backend load. With
maxAttempts: 3, the worst case is 3x the load on the backend fleet. ThehedgingDelaymitigates this: most RPCs complete before the delay expires, so additional attempts are never sent. nonFatalStatusCodescontrols which error responses are treated as "not yet failed" (waiting for other hedged attempts). Status codes not in this list cause immediate failure of the whole hedged RPC.
Hedging and retries cannot be used on the same method. Retries are for handling failures (reactive). Hedging is for reducing latency (proactive). Choose based on your requirements.
Putting It All Together: Choosing a Strategy
The right gRPC load balancing strategy depends on your environment, scale, and operational maturity. Here is a decision framework:
Use L7 proxy load balancing (Envoy, nginx, HAProxy) when:
- You need to support clients in any language, including those without sophisticated gRPC library support.
- You want centralized control over routing, retries, and observability.
- The additional latency hop is acceptable (typically <1ms within a datacenter).
- You are running in Kubernetes and using an ingress controller or service mesh.
Use client-side load balancing (round_robin, weighted_round_robin) when:
- You need the lowest possible latency and cannot tolerate a proxy hop.
- Your backend fleet is discovered through DNS and changes infrequently.
- You are running in a trusted internal environment where clients and backends are in the same network.
Use xDS-based load balancing (proxyless gRPC or Envoy sidecar) when:
- You need advanced traffic management: canary deployments, locality-aware routing, circuit breaking.
- You are already running a service mesh (Istio, Linkerd) or a managed control plane (Traffic Director).
- You want the sophistication of Envoy's load balancing without the proxy hop (proxyless gRPC).
Regardless of which strategy you choose, always implement the gRPC health checking protocol, configure retries for idempotent methods with UNAVAILABLE as a retryable status, and ensure your backends support graceful draining. These are not optional -- they are the foundation of reliable gRPC communication.
Common Pitfalls
Even with proper L7 load balancing, there are several traps that can lead to unbalanced load or degraded reliability:
- DNS caching -- gRPC clients cache DNS responses. If backends change IP addresses (common in Kubernetes), the client may not discover new backends until the DNS TTL expires. Use short TTLs or xDS for dynamic environments.
- Streaming RPCs -- long-lived server-streaming or bidirectional-streaming RPCs are load-balanced at the stream level, not the message level. A streaming RPC that runs for hours is pinned to a single backend for its entire lifetime, just like an HTTP/1.1 connection. Design streaming RPCs with bounded lifetimes and reconnect periodically.
- Subchannel exhaustion -- with client-side load balancing, the client creates one subchannel per backend address. In a large cluster with thousands of backends, this means thousands of HTTP/2 connections per client, which can exhaust file descriptors and memory. Use subsetting (only connecting to a subset of backends) or a proxy for very large clusters.
- Retry amplification -- retries at multiple layers (application, gRPC library, proxy, service mesh) can cause exponential retry storms during an outage. Configure retries at exactly one layer and ensure retry budgets are in place.
- Head-of-line blocking -- while HTTP/2 solves head-of-line blocking at the HTTP layer, it still exists at the TCP layer. A single lost TCP packet stalls all streams on that connection until the packet is retransmitted. For very latency-sensitive workloads, consider using multiple HTTP/2 connections per backend or gRPC over QUIC (HTTP/3) when it becomes available.
Understanding these dynamics is part of the broader discipline of operating networked services at scale. Whether you're managing gRPC microservices or running load balancers in front of any protocol, the principles of connection management, health checking, and graceful degradation are universal. You can explore how these concepts apply at the network layer by looking up real-world infrastructure in the BGP looking glass -- examining how large-scale services like Google (AS15169) and Cloudflare (AS13335) use anycast and BGP to achieve global load distribution at the routing level.