gRPC Load Balancing: Strategies and Patterns

gRPC is built on top of HTTP/2, and that single design choice makes load balancing fundamentally harder than it is for traditional HTTP/1.1 services. A naive deployment behind a standard TCP load balancer will appear to work, pass every health check, and still route 100% of traffic to a single backend. Understanding why this happens, and the spectrum of strategies to fix it, is essential knowledge for anyone operating gRPC services at scale.

This article covers the full landscape of gRPC load balancing: why Layer 4 fails, how Layer 7 proxies solve it, client-side approaches, the lookaside load balancing protocol, the xDS API, health checking, connection draining, retries, and hedging.

Why Layer 4 Load Balancing Fails for gRPC

Traditional load balancers operate at Layer 4 (TCP). They accept an incoming TCP connection, pick a backend using a balancing algorithm (round-robin, least-connections, etc.), and establish a corresponding upstream connection. All bytes on that connection flow to the same backend for the lifetime of the connection. For HTTP/1.1, this works well enough because clients open many short-lived connections and each request typically gets its own connection, so load distributes naturally across backends.

HTTP/2 changes this equation entirely. The protocol was designed to multiplex many concurrent requests over a single TCP connection. gRPC inherits this behavior: a gRPC client establishes one HTTP/2 connection to a server and sends all RPCs as multiplexed streams on that connection. The connection stays open for the lifetime of the client process, which may be hours or days.

When you place a Layer 4 load balancer in front of gRPC backends, here is what happens:

  1. The client opens a TCP connection. The L4 load balancer forwards it to Backend A.
  2. The client sends all RPCs over that single connection. Every RPC goes to Backend A.
  3. Backends B, C, and D sit idle. The load balancer reports all backends as healthy and the connection count as balanced (one connection each to one backend).
  4. If you scale up and add Backends E and F, nothing changes. The existing connection still goes to Backend A.

The load balancer's connection-level metrics look fine. But the request-level distribution is catastrophically skewed. In a cluster with ten backends, one backend may handle 100% of the load while nine are idle. Even with many clients, the distribution is determined by which backend each client's single connection landed on, not by any per-request balancing.

L4 (TCP) Load Balancing with gRPC: All Traffic Hits One Backend Clients Client 1 Client 2 Client 3 L4 Load Balancer TCP Proxy 1 conn per client Backends Backend A 3 conns, 100% RPCs Backend B 0 conns, 0% RPCs Backend C 0 conns, 0% RPCs Backend D 0 conns, 0% RPCs All 3 clients' HTTP/2 connections land on Backend A. Backends B, C, D receive zero traffic despite being healthy.

This problem is not theoretical. It is the single most common operational surprise teams encounter when deploying gRPC for the first time. Load balancers like AWS NLB, bare HAProxy in TCP mode, or any iptables/IPVS-based solution (including the default Kubernetes kube-proxy in iptables mode) will all exhibit this behavior with gRPC.

Layer 7 Load Balancing: The Proxy-Based Solution

The fix is to move the load balancing decision from the connection level to the request level. A Layer 7 (application-level) load balancer terminates the HTTP/2 connection from the client, inspects each individual gRPC request (which is an HTTP/2 stream), and routes it to a backend independently. This means one client connection can have its RPCs spread across all available backends.

L7 (HTTP/2) Load Balancing with gRPC: Per-Request Distribution Clients Client 1 Client 2 Client 3 L7 Load Balancer HTTP/2 Proxy Terminates H2 connection Inspects each stream Routes per-RPC Envoy / nginx / HAProxy Backends Backend A ~25% RPCs Backend B ~25% RPCs Backend C ~25% RPCs Backend D ~25% RPCs Each RPC is independently routed to a backend. All backends receive approximately equal load.

Envoy Proxy

Envoy is the gold standard for gRPC load balancing. It was built at Lyft specifically to solve microservice communication challenges, and gRPC is a first-class citizen. Envoy understands the HTTP/2 framing, can parse gRPC metadata and trailers, and supports every load balancing algorithm you need.

A minimal Envoy configuration for gRPC load balancing looks like this:

static_resources:
  listeners:
  - address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: AUTO
          stat_prefix: ingress
          route_config:
            virtual_hosts:
            - name: grpc_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: grpc_backends }
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: grpc_backends
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http2_protocol_options: {}
    load_assignment:
      cluster_name: grpc_backends
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: backend1, port_value: 50051 }
        - endpoint:
            address:
              socket_address: { address: backend2, port_value: 50051 }

The critical detail is http2_protocol_options on the upstream cluster. This tells Envoy to speak HTTP/2 to the backends, which is required for gRPC. Envoy then multiplexes upstream connections, distributing individual gRPC streams across backends according to the configured lb_policy.

Envoy supports several load balancing algorithms for gRPC traffic:

nginx

nginx added native gRPC proxying support in version 1.13.10 (March 2018). It can terminate HTTP/2 connections from clients and proxy gRPC requests to upstream backends. The configuration is straightforward:

upstream grpc_backends {
    server backend1:50051;
    server backend2:50051;
    server backend3:50051;
}

server {
    listen 8080 http2;

    location / {
        grpc_pass grpc://grpc_backends;
    }
}

nginx balances gRPC requests per-RPC across the upstream servers. However, nginx has a significant limitation compared to Envoy: its gRPC health checking, observability, and advanced routing capabilities are less mature. nginx Plus (the commercial version) adds gRPC health checks, but the open-source version requires passive health checking (detecting failures after they happen).

HAProxy

HAProxy 1.9+ supports HTTP/2 and can act as a gRPC-aware Layer 7 proxy. The key is configuring it in HTTP mode (not TCP mode) with HTTP/2 enabled on both the frontend and backend:

frontend grpc_front
    bind *:8080 proto h2
    default_backend grpc_servers

backend grpc_servers
    balance roundrobin
    server backend1 backend1:50051 proto h2
    server backend2 backend2:50051 proto h2
    server backend3 backend3:50051 proto h2

The proto h2 directive is essential on both sides. Without it on the backend, HAProxy will attempt HTTP/1.1 to the upstream, which gRPC cannot use. HAProxy's per-request balancing in HTTP mode ensures proper distribution across gRPC backends.

Client-Side Load Balancing

Proxy-based load balancing adds a network hop, introduces a potential single point of failure, and adds latency. For high-performance or latency-sensitive gRPC deployments, client-side load balancing eliminates the proxy entirely. The gRPC client itself maintains connections to multiple backends and distributes RPCs across them.

gRPC has a built-in load balancing framework with a pluggable architecture. The system has three components:

  1. Name resolver -- translates a service name (like dns:///my-service) into a list of backend addresses.
  2. Load balancing policy -- decides which backend to send each RPC to, given the list of addresses.
  3. Subchannel management -- maintains the actual HTTP/2 connections (called subchannels) to each backend.

pick_first

pick_first is the default load balancing policy in gRPC. It takes the list of addresses from the name resolver, tries to connect to the first one, and sends all RPCs to that single backend. If the connection fails, it tries the next address. This is effectively no load balancing at all -- it is connection-level failover.

pick_first exists because many gRPC deployments use a proxy-based load balancer (where the client only sees a single virtual IP), or the service has only one backend. For these cases, the overhead of maintaining multiple connections is unnecessary.

round_robin

round_robin creates a subchannel (HTTP/2 connection) to every address returned by the name resolver and distributes RPCs across them in round-robin order. This is the simplest form of true client-side load balancing.

In Go, enabling round-robin is a one-line change:

conn, err := grpc.Dial(
    "dns:///my-service.example.com",
    grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`),
)

With DNS-based resolution, the client resolves the service name to multiple A/AAAA records and creates a connection to each. RPCs are distributed round-robin across these connections. When DNS records change (backends are added or removed), the client updates its connection pool.

The limitation of round_robin is that it treats all backends equally. It does not account for differences in backend capacity, current load, or response latency. A slow backend receives the same share of traffic as a fast one.

weighted_round_robin

The weighted_round_robin policy extends round-robin with dynamic weight adjustment. Backends periodically report their utilization through an out-of-band mechanism (typically the ORCA load reporting protocol), and the client adjusts the proportion of traffic each backend receives. A backend reporting high CPU utilization will receive fewer RPCs, while an underloaded backend receives more.

xDS-Based Policies

The most sophisticated client-side load balancing uses the xDS API (discussed in detail below). xDS-aware gRPC clients can implement policies like priority-based failover, locality-aware routing, and circuit breaking -- all without a proxy in the data path. The xDS control plane pushes configuration to the client, and the client's built-in load balancer executes the policy.

Lookaside Load Balancing (gRPC-LB Protocol)

Between the simplicity of client-side round-robin and the complexity of a full xDS control plane, there is an intermediate approach: lookaside load balancing, originally defined in the gRPC-LB protocol (sometimes called the "grpclb" protocol).

In this model, the client does not pick backends itself. Instead, it contacts a load balancer service (the "balancer" or "lookaside LB") to get a list of backend addresses. The flow works like this:

  1. The client resolves the service name via DNS. The DNS response includes an SRV record pointing to the load balancer, or the name resolver is configured to return the balancer's address.
  2. The client opens a gRPC stream to the load balancer, sending information about itself (client stats, requested service name).
  3. The load balancer responds with a server list -- an ordered list of backend addresses the client should use.
  4. The client connects directly to the backends in the server list and sends RPCs to them, typically using round-robin across the list.
  5. The load balancer can update the server list at any time by sending a new response on the stream. The client updates its connections accordingly.
  6. The client periodically reports load statistics (RPCs sent, latency, errors) back to the load balancer, which uses this information to compute updated server lists.

This design separates the data plane (direct client-to-backend RPCs) from the control plane (the load balancer deciding which backends each client should use). The load balancer has a global view of all backends and all clients, so it can make intelligent decisions: directing traffic away from overloaded backends, implementing weighted distribution, or draining backends for maintenance.

Google uses this pattern internally at enormous scale. The external load balancer has visibility into backend health, capacity, and geographic location, and it produces per-client server lists optimized for both load distribution and latency.

The gRPC-LB protocol is now considered legacy, superseded by the more general xDS API. But the architectural pattern -- a dedicated control plane service that tells clients where to send traffic -- lives on in xDS.

xDS API and Control Plane

The xDS API is the configuration protocol originally developed for Envoy proxy but now adopted by gRPC clients directly. "xDS" stands for "x Discovery Service," where x is a placeholder for the various resource types: LDS (Listener), RDS (Route), CDS (Cluster), EDS (Endpoint), and SDS (Secret).

When gRPC adopted xDS, it gained the ability to receive the same rich configuration that Envoy uses -- but directly in the client library, without a proxy. A gRPC client configured to use xDS connects to a control plane (like Istio's istiod, Traffic Director, or a custom xDS server) and receives:

xDS enables features that are impossible with simple client-side load balancing:

Envoy as xDS Client

In a service mesh architecture, Envoy runs as a sidecar proxy alongside each service instance. An xDS control plane (such as Istio's istiod) pushes configuration to every Envoy sidecar. When Service A calls Service B, the request goes through Service A's local Envoy sidecar, which uses its xDS configuration to select a backend instance of Service B. The sidecar handles load balancing, retries, timeouts, and observability transparently.

This is the "service mesh" model: the application code is unaware of load balancing. It sends RPCs to localhost, and the sidecar handles everything. The advantage is that any language or framework gets the same sophisticated load balancing. The disadvantage is the added latency of two extra network hops (client to sidecar, sidecar to backend) and the operational complexity of managing sidecar proxies.

Proxyless gRPC (xDS in the Client)

Proxyless gRPC eliminates the sidecar. The gRPC client library itself speaks the xDS protocol, connecting directly to the control plane and applying the configuration internally. The client opens connections directly to backends -- no proxy, no sidecar, no extra hops.

Google's Traffic Director and Istio (1.14+) both support proxyless gRPC. The client is configured with a bootstrap file specifying the xDS control plane server:

{
  "xds_servers": [
    {
      "server_uri": "trafficdirector.googleapis.com:443",
      "channel_creds": [{"type": "google_default"}],
      "server_features": ["xds_v3"]
    }
  ],
  "node": {
    "id": "projects/12345/networks/default/nodes/client-abc",
    "locality": {
      "zone": "us-central1-a"
    }
  }
}

The client then uses the xds:/// name resolver scheme instead of dns:///:

conn, err := grpc.Dial("xds:///my-service.example.com")

This gives you the full power of Envoy's load balancing -- locality awareness, outlier detection, traffic splitting -- with zero proxy overhead. The tradeoff is that only languages with mature xDS support in the gRPC library can use it (Go, Java, and C++ have the best support as of 2025).

Health Checking

Load balancing is only as good as your health checking. Sending traffic to an unhealthy backend defeats the purpose of balancing entirely. gRPC defines a standard Health Checking Protocol (in the grpc.health.v1 package) that both proxies and clients can use to determine backend health.

The protocol is defined as a gRPC service:

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckRequest {
  string service = 1;
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
    SERVICE_UNKNOWN = 3;
  }
  ServingStatus status = 1;
}

The Check RPC is a simple request-response health check: the caller sends a service name and gets back a serving status. The Watch RPC is a server-streaming call: the caller sends a service name and receives status updates whenever the serving status changes. This is more efficient than polling with repeated Check calls.

The service field in the request allows per-service health checking on servers that host multiple gRPC services. An empty string checks the overall server health. A server that is healthy for one service but overloaded for another can report different statuses for each.

Envoy can be configured to use gRPC health checking on upstream clusters:

clusters:
- name: grpc_backends
  health_checks:
  - timeout: 1s
    interval: 10s
    unhealthy_threshold: 3
    healthy_threshold: 2
    grpc_health_check:
      service_name: "my.service.Name"

This tells Envoy to call the gRPC health check endpoint every 10 seconds. A backend is marked unhealthy after 3 consecutive failures and healthy again after 2 consecutive successes. During the unhealthy period, Envoy removes the backend from the load balancing rotation.

For client-side load balancing, gRPC libraries support health checking natively. When enabled, the client calls Watch on each subchannel and removes unhealthy subchannels from the balancing pool. The subchannel is re-added when the backend reports SERVING again.

Connection Draining

Graceful shutdown is critical for gRPC services. Because HTTP/2 connections are long-lived and may carry in-flight RPCs, abruptly terminating a backend can cause widespread RPC failures. Connection draining is the process of gracefully removing a backend from service without dropping active requests.

The draining process works in several layers:

  1. Health check transition -- The backend updates its health status from SERVING to NOT_SERVING. Load balancers and clients that are watching health status stop sending new RPCs to this backend.
  2. HTTP/2 GOAWAY frame -- The backend sends a GOAWAY frame on each HTTP/2 connection. This tells the client that no new streams should be opened on this connection, but existing streams can complete. The GOAWAY frame includes the ID of the last stream the server will process.
  3. In-flight completion -- The backend continues processing all in-flight RPCs until they complete or a deadline is reached.
  4. Connection closure -- After all in-flight RPCs complete (or a grace period expires), the backend closes the TCP connections.

In a Kubernetes environment, this maps to the pod termination lifecycle:

  1. Kubernetes sends SIGTERM to the pod.
  2. The gRPC server stops accepting new RPCs and begins draining.
  3. The pod's readiness probe fails, causing the Kubernetes endpoints controller to remove it from the Service endpoints.
  4. The gRPC server finishes in-flight RPCs.
  5. After terminationGracePeriodSeconds, Kubernetes sends SIGKILL if the pod has not exited.

A common mistake is setting the termination grace period too short. If your RPCs can take up to 30 seconds to complete, your grace period must be at least 30 seconds (plus buffer for the readiness probe update to propagate). Otherwise, Kubernetes will SIGKILL the pod while RPCs are still in flight.

In Go, graceful shutdown looks like:

// On SIGTERM:
server.GracefulStop()  // stops accepting new RPCs, waits for in-flight to complete

// Or with a deadline:
go func() {
    time.Sleep(30 * time.Second)
    server.Stop()  // force stop after 30s
}()
server.GracefulStop()

Retry Policies

Network failures, backend crashes, and transient errors are inevitable. gRPC has a built-in retry mechanism that can automatically retry failed RPCs without application code needing to handle retries explicitly.

Retries are configured in the service config, which can be returned by the name resolver, set statically on the client, or pushed via xDS:

{
  "methodConfig": [
    {
      "name": [{"service": "my.package.MyService"}],
      "retryPolicy": {
        "maxAttempts": 3,
        "initialBackoff": "0.1s",
        "maxBackoff": "1s",
        "backoffMultiplier": 2,
        "retryableStatusCodes": [
          "UNAVAILABLE",
          "DEADLINE_EXCEEDED"
        ]
      }
    }
  ]
}

The retry policy specifies:

Retries interact with load balancing in an important way: when an RPC is retried, the load balancer picks a different backend for the retry attempt. This means that a retry after a backend crash will land on a healthy backend, not the same crashed one. This is true for both proxy-based and client-side load balancing.

There are important constraints on retries:

Hedging

Hedged requests are a more aggressive form of redundancy than retries. Instead of waiting for an RPC to fail before sending a retry, hedging sends the same RPC to multiple backends simultaneously and uses the first successful response. The remaining in-flight attempts are cancelled.

Hedging is configured in the service config alongside (but mutually exclusive with) retry policies:

{
  "methodConfig": [
    {
      "name": [{"service": "my.package.MyService", "method": "Search"}],
      "hedgingPolicy": {
        "maxAttempts": 3,
        "hedgingDelay": "0.5s",
        "nonFatalStatusCodes": ["UNAVAILABLE", "INTERNAL"]
      }
    }
  ]
}

The hedgingDelay is the key parameter. The client sends the first attempt immediately. If no response arrives within 500ms, it sends a second attempt to a different backend (via the load balancer). If neither responds within another 500ms, it sends a third. The first successful response wins.

Hedging is powerful for tail-latency reduction. If your P99 latency is 500ms but your P50 is 10ms, a slow response from one backend will be rescued by a fast response from another. Google's research showed that hedging can reduce tail latency by 5-10x with only a modest increase in overall load (because most hedged attempts are cancelled quickly).

However, hedging has strict requirements:

Hedging and retries cannot be used on the same method. Retries are for handling failures (reactive). Hedging is for reducing latency (proactive). Choose based on your requirements.

Putting It All Together: Choosing a Strategy

The right gRPC load balancing strategy depends on your environment, scale, and operational maturity. Here is a decision framework:

Use L7 proxy load balancing (Envoy, nginx, HAProxy) when:

Use client-side load balancing (round_robin, weighted_round_robin) when:

Use xDS-based load balancing (proxyless gRPC or Envoy sidecar) when:

Regardless of which strategy you choose, always implement the gRPC health checking protocol, configure retries for idempotent methods with UNAVAILABLE as a retryable status, and ensure your backends support graceful draining. These are not optional -- they are the foundation of reliable gRPC communication.

Common Pitfalls

Even with proper L7 load balancing, there are several traps that can lead to unbalanced load or degraded reliability:

Understanding these dynamics is part of the broader discipline of operating networked services at scale. Whether you're managing gRPC microservices or running load balancers in front of any protocol, the principles of connection management, health checking, and graceful degradation are universal. You can explore how these concepts apply at the network layer by looking up real-world infrastructure in the BGP looking glass -- examining how large-scale services like Google (AS15169) and Cloudflare (AS13335) use anycast and BGP to achieve global load distribution at the routing level.

See BGP routing data in real time

Open Looking Glass
More Articles
How gRPC Works
How Protocol Buffers Work
How gRPC-Web Works
gRPC and Service Mesh: Istio, Envoy, and Linkerd
gRPC Security: Authentication, TLS, and Authorization
gRPC Reflection, Testing, and Debugging