gRPC in Kubernetes: Deployment Patterns

Deploying a gRPC service to Kubernetes sounds straightforward until your first production load test reveals that a single pod is handling 90% of the traffic while the others sit idle. The root cause is a fundamental mismatch between how Kubernetes networking works and what gRPC needs. Kubernetes Services default to L4 (TCP-level) load balancing via kube-proxy, but gRPC multiplexes many requests over a single long-lived HTTP/2 connection. Once that connection is established to one pod, all subsequent RPCs flow over it, completely bypassing the load balancer. Understanding this mismatch and the patterns that solve it is essential for anyone running gRPC in production on Kubernetes.

The Core Problem: L4 Load Balancing Meets HTTP/2

Kubernetes Services use kube-proxy to distribute traffic across pods. Kube-proxy operates at Layer 4 of the OSI model, meaning it makes routing decisions when a new TCP connection is established. For traditional HTTP/1.1 traffic, this works well because clients open many short-lived connections, and each new connection gets balanced to a different pod.

gRPC is built on HTTP/2, which fundamentally changes this dynamic. HTTP/2 is designed to minimize connections. A single HTTP/2 connection multiplexes hundreds or thousands of concurrent requests (called streams) over one TCP socket. When a gRPC client connects through a Kubernetes Service, kube-proxy routes that one TCP connection to a single pod. From that point forward, every RPC the client sends travels over that same connection to that same pod, regardless of how many other pods are available.

gRPC Client 1000 RPCs/s ClusterIP Svc L4 kube-proxy 1 TCP conn Pod 1 1000 RPCs/s Pod 2 0 RPCs/s Pod 3 0 RPCs/s All RPCs routed to the same pod via a single HTTP/2 connection

The result is a hot-pod problem. In a deployment with three replicas, one pod might be saturated while the other two sit idle. CPU and memory metrics look fine on average, masking the fact that one pod is being hammered. This is the single most common gRPC-on-Kubernetes pitfall, and it applies to every Kubernetes distribution, whether you are running on GKE, EKS, AKS, or bare-metal clusters.

Headless Services and Client-Side Load Balancing

The simplest fix for gRPC load balancing in Kubernetes is to bypass kube-proxy entirely using a headless Service. A headless Service (one with clusterIP: None) does not allocate a virtual IP. Instead, a DNS lookup against the Service name returns the individual pod IPs as A/AAAA records.

apiVersion: v1
kind: Service
metadata:
  name: my-grpc-service
spec:
  clusterIP: None          # headless
  selector:
    app: my-grpc-service
  ports:
    - port: 50051
      targetPort: 50051
      protocol: TCP

With a headless Service, a gRPC client resolves the service hostname and receives a list of pod IPs. It can then open a connection to each pod and distribute RPCs across them. Most gRPC client libraries have built-in support for this via their name resolution and load balancing APIs.

In Go, for example, you can use the dns resolver scheme and the round_robin balancer:

conn, err := grpc.Dial(
    "dns:///my-grpc-service.default.svc.cluster.local:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingConfig":[{"round_robin":{}}]}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

The dns:/// prefix tells the gRPC library to use its built-in DNS resolver. The resolver periodically re-resolves the hostname, picks up new pods as they appear, and drops pods that have been removed. The round_robin policy creates a subchannel (connection) to each resolved address and distributes RPCs evenly across them.

Limitations of Client-Side Balancing

Client-side balancing with headless Services is lightweight and requires no additional infrastructure, but it has real constraints:

Envoy Sidecar and Istio for L7 Load Balancing

A service mesh solves the gRPC load balancing problem at the infrastructure level, removing the burden from individual clients. Istio, the most widely adopted service mesh, injects an Envoy proxy as a sidecar container into each pod. All inbound and outbound traffic passes through this proxy, which operates at Layer 7 and understands the HTTP/2 framing that gRPC uses.

Client Pod App Container Envoy Sidecar Server Pod 1 Envoy App ~333 RPCs/s Server Pod 2 Envoy App ~333 RPCs/s Server Pod 3 Envoy App ~333 RPCs/s L7-aware per-RPC routing

Because Envoy decodes HTTP/2 frames, it can make a routing decision for each individual gRPC request, not just for the TCP connection. The client application connects to localhost (the sidecar), and Envoy handles service discovery, load balancing, retries, and circuit breaking transparently. The application code needs no gRPC-specific load balancing configuration at all.

Istio provides additional capabilities on top of Envoy's per-request balancing:

A typical Istio DestinationRule for a gRPC service might look like:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-grpc-service
spec:
  host: my-grpc-service
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s

The LEAST_REQUEST policy is generally preferred over ROUND_ROBIN for gRPC services because it accounts for variable RPC durations, sending new requests to the pod with the fewest active requests. The trade-off is operational complexity: Istio adds a sidecar to every pod, increasing memory consumption (typically 50-100MB per sidecar), adding latency (0.5-2ms per hop), and introducing a significant control plane that needs its own monitoring and maintenance.

Gateway API and gRPC Routes

Kubernetes Gateway API, the successor to the Ingress API, introduced native gRPC support through the GRPCRoute resource. This allows Kubernetes-native, declarative routing for gRPC traffic at the cluster edge without requiring a full service mesh.

apiVersion: gateway.networking.k8s.io/v1
kind: GRPCRoute
metadata:
  name: my-grpc-route
spec:
  parentRefs:
    - name: my-gateway
  rules:
    - matches:
        - method:
            service: mypackage.MyService
            method: GetUser
      backendRefs:
        - name: my-grpc-service
          port: 50051
    - matches:
        - method:
            service: mypackage.MyService
      backendRefs:
        - name: my-grpc-service
          port: 50051

The GRPCRoute resource allows matching on gRPC service and method names, enabling fine-grained routing at the gateway level. You can route different gRPC methods to different backend services, split traffic for canary deployments, or apply different policies per method. The key advantage over Ingress is that GRPCRoute is L7-aware by design: the gateway implementation (Envoy, Contour, or another conformant controller) decodes HTTP/2 and routes per-RPC.

The Gateway API also supports traffic splitting via weighted backendRefs:

rules:
  - backendRefs:
      - name: my-grpc-service-v1
        port: 50051
        weight: 90
      - name: my-grpc-service-v2
        port: 50051
        weight: 10

This sends 10% of gRPC requests to the v2 deployment, regardless of how many TCP connections are open. Gateway API support for gRPC is now GA (stable) and implemented by most major gateway controllers including Envoy Gateway, Istio, Contour, and Traefik.

gRPC Health Checking in Kubernetes

Proper health checking is critical for gRPC services in Kubernetes. Kubernetes probes (liveness, readiness, startup) and the gRPC health checking protocol serve complementary but distinct purposes, and getting them wrong can cause subtle availability issues.

The gRPC Health Checking Protocol

gRPC defines a standard health checking protocol in grpc.health.v1.Health. It is a regular gRPC service with a Check RPC that returns a status enum: SERVING, NOT_SERVING, UNKNOWN, or SERVICE_UNKNOWN. The protocol also supports a Watch RPC for streaming health status updates. Implementing this protocol in your service is the first step:

// Go example using grpc-health
import "google.golang.org/grpc/health"
import healthpb "google.golang.org/grpc/health/grpc_health_v1"

healthServer := health.NewServer()
healthpb.RegisterHealthServer(grpcServer, healthServer)

// Set status per service
healthServer.SetServingStatus("mypackage.MyService", healthpb.HealthCheckResponse_SERVING)

// During graceful shutdown
healthServer.SetServingStatus("", healthpb.HealthCheckResponse_NOT_SERVING)

Kubernetes Probes for gRPC

Since Kubernetes 1.24, kubelet natively supports gRPC health probes without needing a third-party binary. You configure them in the pod spec:

containers:
  - name: my-grpc-service
    ports:
      - containerPort: 50051
    livenessProbe:
      grpc:
        port: 50051
      initialDelaySeconds: 10
      periodSeconds: 10
    readinessProbe:
      grpc:
        port: 50051
        service: "mypackage.MyService"
      initialDelaySeconds: 5
      periodSeconds: 5
    startupProbe:
      grpc:
        port: 50051
      failureThreshold: 30
      periodSeconds: 2

The three probe types serve different purposes for gRPC services:

Pod Lifecycle & Probe Sequence Starting Serving Draining Terminated Startup Probe Readiness Probe Liveness Probe SIGTERM (not yet, just showing ready) SIGTERM received Set NOT_SERVING Drain in-flight RPCs

Older Clusters: grpc-health-probe Binary

For Kubernetes versions before 1.24, you need the grpc-health-probe binary. This is a small static binary that calls the gRPC health check protocol and exits with code 0 (healthy) or 1 (unhealthy). You include it in your container image and reference it in an exec probe:

livenessProbe:
  exec:
    command:
      - /bin/grpc-health-probe
      - -addr=:50051
  initialDelaySeconds: 10
  periodSeconds: 10

Prefer the native grpc: probe type when your cluster supports it. It eliminates the need for an extra binary, reduces image size, and avoids process forking for each health check.

Graceful Shutdown and Connection Draining

When Kubernetes terminates a pod (during a rolling update, scale-down, or node drain), the default behavior is to send SIGTERM, wait terminationGracePeriodSeconds (default 30 seconds), and then send SIGKILL. For gRPC services, the window between SIGTERM and SIGKILL is your opportunity to drain in-flight RPCs without dropping them.

A proper graceful shutdown sequence for a gRPC server in Kubernetes:

  1. Catch SIGTERM — register a signal handler before starting the gRPC server.
  2. Set health status to NOT_SERVING — this causes the readiness probe to fail, removing the pod from the Service endpoints. New RPCs will be routed to other pods.
  3. Wait for endpoint propagation — there is a delay (typically 1-5 seconds) between the readiness probe failing and all clients/proxies learning about the change. During this window, new RPCs may still arrive. Sleep briefly to account for this.
  4. Call GracefulStop() — this stops accepting new RPCs and waits for all in-flight RPCs to complete. Set a deadline so you do not exceed terminationGracePeriodSeconds.
  5. Force stop as a fallback — if GracefulStop does not complete within the deadline, call Stop() to forcefully terminate remaining RPCs.
// Go graceful shutdown pattern
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

go func() {
    <-sigCh

    // 1. Mark as not ready
    healthServer.SetServingStatus("", healthpb.HealthCheckResponse_NOT_SERVING)

    // 2. Wait for endpoint propagation
    time.Sleep(5 * time.Second)

    // 3. Drain with a deadline
    done := make(chan struct{})
    go func() {
        grpcServer.GracefulStop()
        close(done)
    }()

    select {
    case <-done:
        // Drained cleanly
    case <-time.After(20 * time.Second):
        grpcServer.Stop() // Force stop
    }
}()

The terminationGracePeriodSeconds in your pod spec must be long enough to cover this entire sequence. If your longest RPC might take 60 seconds, set it to at least 70:

spec:
  terminationGracePeriodSeconds: 70
  containers:
    - name: my-grpc-service
      # ...

There is a subtle race condition to be aware of: Kubernetes sends SIGTERM and removes the pod from Endpoints concurrently, not sequentially. This means the pod might receive new RPCs after SIGTERM is sent but before all clients learn the pod is gone. The sleep in step 3 above handles this, but you can also use a preStop lifecycle hook for a more Kubernetes-native approach:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]

This delays the SIGTERM signal by 5 seconds, giving the endpoints controller time to remove the pod from the Service before the application starts shutting down.

Horizontal Pod Autoscaling for gRPC

Autoscaling gRPC services is more nuanced than autoscaling HTTP/1.1 workloads. The standard CPU-based HPA often works poorly for gRPC because HTTP/2 multiplexing means a small number of connections can drive high RPC throughput with relatively low CPU usage (much of the time is spent waiting on I/O or downstream services).

Custom Metrics

Better autoscaling signals for gRPC services include:

With Prometheus and the prometheus-adapter, you can expose gRPC-specific metrics to the HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-grpc-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-grpc-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: grpc_server_handled_total_rate
        target:
          type: AverageValue
          averageValue: "500"    # 500 RPCs/s per pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

The behavior section is important for gRPC. Scaling down too aggressively breaks existing connections. A stabilization window of 2 minutes and a maximum scale-down of 25% per minute gives clients time to re-establish connections as pods are removed. Without this, a rapid scale-down can cause a burst of errors as connections to deleted pods are reset.

KEDA for Event-Driven Scaling

For gRPC services that process messages from queues (Kafka, NATS, RabbitMQ), KEDA can scale based on queue depth or consumer lag. This is particularly useful for gRPC services that act as streaming processors, where the rate of incoming messages drives the need for more pods.

gRPC with Ingress Controllers

If you need to expose gRPC services outside the cluster without a full service mesh, several Ingress controllers support gRPC as a first-class protocol. The key requirement is that the Ingress controller must handle HTTP/2 end-to-end, not downgrade to HTTP/1.1 at any hop.

Ingress Controller gRPC Support Comparison Controller L7 gRPC Per-RPC LB GRPCRoute Config Method NGINX Yes Yes No Annotation Traefik Yes Yes Yes IngressRoute CRD Contour Yes Yes Yes HTTPProxy CRD Envoy Gateway Yes Yes Yes Gateway API All controllers above support HTTP/2 upstream and can load-balance per RPC

NGINX Ingress Controller

The NGINX Ingress Controller supports gRPC through annotations. You need to use the nginx.ingress.kubernetes.io/backend-protocol: "GRPC" annotation (or "GRPCS" for TLS backends) to tell NGINX to use HTTP/2 for upstream connections:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-grpc-ingress
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - grpc.example.com
      secretName: grpc-tls
  rules:
    - host: grpc.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-grpc-service
                port:
                  number: 50051

NGINX terminates TLS and re-initiates HTTP/2 to each upstream pod. Because it understands HTTP/2, it can distribute individual gRPC requests across pods instead of pinning to a single backend. However, NGINX's gRPC support has some caveats: it does not support gRPC-Web natively (you need a separate proxy for browser clients), and configuration for streaming RPCs requires tuning timeouts carefully.

Traefik

Traefik handles gRPC natively when it detects HTTP/2 traffic. With its CRD-based IngressRoute, you can define gRPC-specific routing and middleware:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-grpc-route
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`grpc.example.com`)
      kind: Rule
      services:
        - name: my-grpc-service
          port: 50051
          scheme: h2c          # HTTP/2 cleartext to backend

Traefik also implements the Gateway API, so you can use the GRPCRoute resource directly. Its middleware chain supports retries, rate limiting, and circuit breaking, all of which apply per-RPC for gRPC traffic.

Contour

Contour, powered by Envoy, provides first-class gRPC support through its HTTPProxy CRD. Since Envoy is the data plane, Contour inherits all of Envoy's HTTP/2 and gRPC capabilities:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: my-grpc-proxy
spec:
  virtualhost:
    fqdn: grpc.example.com
    tls:
      secretName: grpc-tls
  routes:
    - conditions:
        - prefix: /
      services:
        - name: my-grpc-service
          port: 50051
          protocol: h2c
      loadBalancerPolicy:
        strategy: WeightedLeastRequest

Contour's WeightedLeastRequest strategy is well suited for gRPC because it routes each RPC to the backend with the fewest active requests, accounting for variable RPC durations. Contour also supports request-level retries, per-route timeout configuration, and header-based routing for gRPC metadata.

Keepalive Tuning for Cloud Load Balancers

When gRPC services run behind cloud load balancers (AWS ALB/NLB, GCP load balancer, Azure Application Gateway), idle connection timeouts become a major source of errors. Cloud load balancers silently drop idle TCP connections after a timeout, typically 60-350 seconds depending on the provider. The gRPC client does not learn about the dropped connection until it tries to send the next RPC, resulting in a confusing RST or timeout error.

gRPC's keepalive mechanism sends periodic HTTP/2 PING frames to keep the connection alive and detect dead connections early. Both client and server need to be configured:

Client-Side Keepalive

// Go client keepalive
conn, err := grpc.Dial(target,
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                20 * time.Second, // Send PING every 20s if idle
        Timeout:             5 * time.Second,  // Wait 5s for PING ACK
        PermitWithoutStream: true,             // PING even with no active RPCs
    }),
)

Server-Side Keepalive

// Go server keepalive
grpcServer := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle:     5 * time.Minute,   // Close idle connections
        MaxConnectionAge:      30 * time.Minute,  // Force reconnect periodically
        MaxConnectionAgeGrace: 10 * time.Second,  // Grace period for in-flight RPCs
        Time:                  20 * time.Second,   // PING every 20s
        Timeout:               5 * time.Second,    // Wait 5s for PING ACK
    }),
    grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             10 * time.Second, // Min time between client PINGs
        PermitWithoutStream: true,
    }),
)

The critical parameters and how they interact with cloud environments:

Keepalive Prevents Silent Connection Drops Client Cloud LB idle timeout: 60s Server Without keepalive: RPC 60s idle LB drops conn silently RPC RST / timeout With keepalive (20s): RPC PING PING PING RPC Success

MaxConnectionAge for Load Distribution

MaxConnectionAge deserves special attention because it solves a problem unique to gRPC in Kubernetes: when you scale out a deployment, existing clients do not discover the new pods. Unlike HTTP/1.1, where short-lived connections naturally get balanced to new backends, gRPC clients hold their connections open indefinitely. By setting MaxConnectionAge on the server, you force periodic reconnection. When the client reconnects, DNS resolution (or the mesh proxy) will include the new pods in the list, and load spreads naturally.

The trade-off is reconnection cost. Each reconnection requires a new TCP handshake, TLS handshake (if applicable), and HTTP/2 SETTINGS exchange. With gRPC connection multiplexing, a single client typically maintains only a handful of connections, so periodic reconnection is cheap. In practice, 15-30 minutes is a good MaxConnectionAge for most services. Latency-sensitive services that cannot tolerate even a single retry should use MaxConnectionAgeGrace to ensure in-flight RPCs complete before the connection is closed.

Production Deployment Patterns

Choosing the right pattern depends on your cluster's complexity and operational maturity. Here is a decision framework:

Pattern 1: Headless Service + Client-Side Balancing

Best for: small clusters, single-language stacks, low operational overhead tolerance.

Use a headless Service with clusterIP: None, configure each gRPC client with the dns:/// resolver and round_robin policy, and set MaxConnectionAge on the server. This is the simplest pattern with the least moving parts. The downside is that every client in every language must be configured correctly, and you get no traffic management features.

Pattern 2: Service Mesh (Istio / Linkerd)

Best for: large polyglot environments, teams that need traffic management, canary deployments, and mutual TLS.

Inject sidecars, use regular (non-headless) ClusterIP Services, and let the mesh handle per-RPC balancing, retries, and observability. Application code needs no gRPC-specific load balancing configuration. The cost is the operational burden of the mesh control plane and the per-pod resource overhead of the sidecar proxy.

Pattern 3: Gateway API + GRPCRoute

Best for: edge ingress of gRPC traffic into the cluster. Combine with either Pattern 1 or Pattern 2 for east-west (pod-to-pod) traffic.

Use GRPCRoute for north-south traffic entering the cluster, with method-level routing and traffic splitting. This pairs well with Envoy Gateway or Contour as the gateway implementation. For internal service-to-service gRPC traffic, you still need one of the other patterns.

Pattern 4: Ingress Controller

Best for: teams already running NGINX or Traefik Ingress who need to expose gRPC externally without adopting Gateway API.

Configure the Ingress controller with the appropriate annotations or CRDs for HTTP/2 backend protocol. This works well for exposing a small number of gRPC services externally but does not solve east-west load balancing.

Common Pitfalls and Debugging

Even with the right pattern, gRPC on Kubernetes surfaces operational issues that are not obvious from documentation:

Putting It All Together

A production-ready gRPC deployment on Kubernetes combines several of these patterns. A typical setup for a high-traffic gRPC service looks like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-grpc-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-grpc-service
  template:
    metadata:
      labels:
        app: my-grpc-service
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: server
          image: my-grpc-service:v1.2.3
          ports:
            - containerPort: 50051
          resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              memory: 512Mi
          startupProbe:
            grpc:
              port: 50051
            failureThreshold: 30
            periodSeconds: 2
          readinessProbe:
            grpc:
              port: 50051
            periodSeconds: 5
          livenessProbe:
            grpc:
              port: 50051
            periodSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]

This configuration handles the full lifecycle: startup probes prevent premature traffic, readiness probes remove unhealthy pods from load balancing, liveness probes restart stuck processes, and the preStop hook with graceful shutdown ensures zero-dropped RPCs during rolling updates. Combine this Deployment with your chosen load balancing pattern (headless Service, mesh, or gateway) and appropriate keepalive tuning for your cloud environment.

For more background on the protocols and infrastructure discussed here, see the guides on how gRPC works, gRPC load balancing strategies, the role of service meshes in gRPC architectures, and the fundamentals of how load balancers work.

See BGP routing data in real time

Open Looking Glass
More Articles
How gRPC Works
How Protocol Buffers Work
How gRPC-Web Works
gRPC Load Balancing: Strategies and Patterns
gRPC and Service Mesh: Istio, Envoy, and Linkerd
gRPC Security: Authentication, TLS, and Authorization