gRPC in Kubernetes: Deployment Patterns

Deploying a gRPC service to Kubernetes sounds straightforward until your first production load test reveals that a single pod is handling 90% of the traffic while the others sit idle. The root cause is a fundamental mismatch between how Kubernetes networking works and what gRPC needs. Kubernetes Services default to L4 (TCP-level) load balancing via kube-proxy, but gRPC multiplexes many requests over a single long-lived HTTP/2 connection. Once that connection is established to one pod, all subsequent RPCs flow over it, completely bypassing the load balancer. Understanding this mismatch and the patterns that solve it is essential for anyone running gRPC in production on Kubernetes.

The Core Problem: L4 Load Balancing Meets HTTP/2

Kubernetes Services use kube-proxy to distribute traffic across pods. Kube-proxy operates at Layer 4 of the OSI model, meaning it makes routing decisions when a new TCP connection is established. For traditional HTTP/1.1 traffic, this works well because clients open many short-lived connections, and each new connection gets balanced to a different pod.

gRPC is built on HTTP/2, which fundamentally changes this dynamic. HTTP/2 is designed to minimize connections. A single HTTP/2 connection multiplexes hundreds or thousands of concurrent requests (called streams) over one TCP socket. When a gRPC client connects through a Kubernetes Service, kube-proxy routes that one TCP connection to a single pod. From that point forward, every RPC the client sends travels over that same connection to that same pod, regardless of how many other pods are available.

The result is a hot-pod problem. In a deployment with three replicas, one pod might be saturated while the other two sit idle. CPU and memory metrics look fine on average, masking the fact that one pod is being hammered. This is the single most common gRPC-on-Kubernetes pitfall, and it applies to every Kubernetes distribution, whether you are running on GKE, EKS, AKS, or bare-metal clusters.

Headless Services and Client-Side Load Balancing

The simplest fix for gRPC load balancing in Kubernetes is to bypass kube-proxy entirely using a headless Service. A headless Service (one with clusterIP: None) does not allocate a virtual IP. Instead, a DNS lookup against the Service name returns the individual pod IPs as A/AAAA records.

apiVersion: v1
kind: Service
metadata:
  name: my-grpc-service
spec:
  clusterIP: None          # headless
  selector:
    app: my-grpc-service
  ports:
    - port: 50051
      targetPort: 50051
      protocol: TCP

With a headless Service, a gRPC client resolves the service hostname and receives a list of pod IPs. It can then open a connection to each pod and distribute RPCs across them. Most gRPC client libraries have built-in support for this via their name resolution and load balancing APIs.

In Go, for example, you can use the dns resolver scheme and the round_robin balancer:

conn, err := grpc.Dial(
    "dns:///my-grpc-service.default.svc.cluster.local:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingConfig":[{"round_robin":{}}]}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

The dns:/// prefix tells the gRPC library to use its built-in DNS resolver. The resolver periodically re-resolves the hostname, picks up new pods as they appear, and drops pods that have been removed. The round_robin policy creates a subchannel (connection) to each resolved address and distributes RPCs evenly across them.

Limitations of Client-Side Balancing

Client-side balancing with headless Services is lightweight and requires no additional infrastructure, but it has real constraints:

DNS caching — gRPC resolvers cache DNS results and re-resolve on a timer (typically 30 seconds). Newly scaled pods may not receive traffic immediately. Pods that have been terminated may still have RPCs routed to them until the next resolution cycle.
No load awareness — round_robin distributes RPCs evenly by count, not by actual backend load. If one RPC takes 100ms and another takes 10s, the slow-response pod still gets an equal share of new requests.
Client-side logic — every client must be configured with the correct resolver and balancer. In a polyglot microservices environment, this means configuring gRPC clients in Go, Java, Python, C++, and any other language you use. Each has slightly different APIs.
No traffic policies — client-side balancing cannot enforce circuit breaking, retries, rate limits, or canary deployments. Those concerns must be handled elsewhere.

Envoy Sidecar and Istio for L7 Load Balancing

A service mesh solves the gRPC load balancing problem at the infrastructure level, removing the burden from individual clients. Note that switching kube-proxy to IPVS mode does not help here -- IPVS is still an L4 load balancer and suffers from the same connection-level pinning problem as iptables mode. Istio, the most widely adopted service mesh, injects an Envoy proxy as a sidecar container into each pod. All inbound and outbound traffic passes through this proxy, which operates at Layer 7 and understands the HTTP/2 framing that gRPC uses.

Because Envoy decodes HTTP/2 frames, it can make a routing decision for each individual gRPC request, not just for the TCP connection. The client application connects to localhost (the sidecar), and Envoy handles service discovery, load balancing, retries, and circuit breaking transparently. The application code needs no gRPC-specific load balancing configuration at all.

Istio provides additional capabilities on top of Envoy's per-request balancing:

Traffic splitting — route a percentage of RPCs to a canary deployment for gradual rollouts
Circuit breaking — limit outstanding requests per backend to prevent cascade failures
Outlier detection — automatically eject unhealthy pods from the load balancing pool based on error rates or latency
Mutual TLS — automatic mTLS between all services without application changes
Observability — distributed traces, metrics, and access logs for every RPC

A typical Istio DestinationRule for a gRPC service might look like:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-grpc-service
spec:
  host: my-grpc-service
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s

The LEAST_REQUEST policy is generally preferred over ROUND_ROBIN for gRPC services because it accounts for variable RPC durations, sending new requests to the pod with the fewest active requests. The trade-off is operational complexity: Istio adds a sidecar to every pod, increasing memory consumption (typically 50-100MB per sidecar), adding latency (0.5-2ms per hop), and introducing a significant control plane that needs its own monitoring and maintenance.

Gateway API and gRPC Routes

Kubernetes Gateway API, the successor to the Ingress API, introduced native gRPC support through the GRPCRoute resource. This allows Kubernetes-native, declarative routing for gRPC traffic at the cluster edge without requiring a full service mesh.

apiVersion: gateway.networking.k8s.io/v1
kind: GRPCRoute
metadata:
  name: my-grpc-route
spec:
  parentRefs:
    - name: my-gateway
  rules:
    - matches:
        - method:
            service: mypackage.MyService
            method: GetUser
      backendRefs:
        - name: my-grpc-service
          port: 50051
    - matches:
        - method:
            service: mypackage.MyService
      backendRefs:
        - name: my-grpc-service
          port: 50051

The GRPCRoute resource allows matching on gRPC service and method names, enabling fine-grained routing at the gateway level. You can route different gRPC methods to different backend services, split traffic for canary deployments, or apply different policies per method. The key advantage over Ingress is that GRPCRoute is L7-aware by design: the gateway implementation (Envoy, Contour, or another conformant controller) decodes HTTP/2 and routes per-RPC.

The Gateway API also supports traffic splitting via weighted backendRefs:

rules:
  - backendRefs:
      - name: my-grpc-service-v1
        port: 50051
        weight: 90
      - name: my-grpc-service-v2
        port: 50051
        weight: 10

This sends 10% of gRPC requests to the v2 deployment, regardless of how many TCP connections are open. Gateway API support for gRPC is now GA (stable) and implemented by most major gateway controllers including Envoy Gateway, Istio, Contour, and Traefik.

gRPC Health Checking in Kubernetes

Proper health checking is critical for gRPC services in Kubernetes. Kubernetes probes (liveness, readiness, startup) and the gRPC health checking protocol serve complementary but distinct purposes, and getting them wrong can cause subtle availability issues.

The gRPC Health Checking Protocol

gRPC defines a standard health checking protocol in grpc.health.v1.Health. It is a regular gRPC service with a Check RPC that returns a status enum: SERVING, NOT_SERVING, UNKNOWN, or SERVICE_UNKNOWN. The protocol also supports a Watch RPC for streaming health status updates. Implementing this protocol in your service is the first step:

// Go example using grpc-health
import "google.golang.org/grpc/health"
import healthpb "google.golang.org/grpc/health/grpc_health_v1"

healthServer := health.NewServer()
healthpb.RegisterHealthServer(grpcServer, healthServer)

// Set status per service
healthServer.SetServingStatus("mypackage.MyService", healthpb.HealthCheckResponse_SERVING)

// During graceful shutdown
healthServer.SetServingStatus("", healthpb.HealthCheckResponse_NOT_SERVING)

Kubernetes Probes for gRPC

Since Kubernetes 1.24, kubelet natively supports gRPC health probes without needing a third-party binary. You configure them in the pod spec:

containers:
  - name: my-grpc-service
    ports:
      - containerPort: 50051
    livenessProbe:
      grpc:
        port: 50051
      initialDelaySeconds: 10
      periodSeconds: 10
    readinessProbe:
      grpc:
        port: 50051
        service: "mypackage.MyService"
      initialDelaySeconds: 5
      periodSeconds: 5
    startupProbe:
      grpc:
        port: 50051
      failureThreshold: 30
      periodSeconds: 2

The three probe types serve different purposes for gRPC services:

Startup probe — guards slow initialization. gRPC servers that load large models, warm caches, or establish upstream connections should use a startup probe with a generous failureThreshold. Until the startup probe succeeds, Kubernetes does not run liveness or readiness checks.
Readiness probe — controls whether the pod receives traffic. When a readiness probe fails, the pod's IP is removed from the Endpoints (and thus from any Service). This is the most important probe for gRPC: set the health status to NOT_SERVING before shutting down, and the pod stops receiving new RPCs before the process exits.
Liveness probe — triggers a pod restart if the process is stuck. Use this sparingly for gRPC services. A deadlocked server that stops responding to health checks will be restarted. Be careful with the initialDelaySeconds to avoid restart loops during slow startup.

Older Clusters: grpc-health-probe Binary

For Kubernetes versions before 1.24, you need the grpc-health-probe binary. This is a small static binary that calls the gRPC health check protocol and exits with code 0 (healthy) or 1 (unhealthy). You include it in your container image and reference it in an exec probe:

livenessProbe:
  exec:
    command:
      - /bin/grpc-health-probe
      - -addr=:50051
  initialDelaySeconds: 10
  periodSeconds: 10

Prefer the native grpc: probe type when your cluster supports it. It eliminates the need for an extra binary, reduces image size, and avoids process forking for each health check.

Graceful Shutdown and Connection Draining

When Kubernetes terminates a pod (during a rolling update, scale-down, or node drain), the default behavior is to send SIGTERM, wait terminationGracePeriodSeconds (default 30 seconds), and then send SIGKILL. For gRPC services, the window between SIGTERM and SIGKILL is your opportunity to drain in-flight RPCs without dropping them.

A proper graceful shutdown sequence for a gRPC server in Kubernetes:

Catch SIGTERM — register a signal handler before starting the gRPC server.
Set health status to NOT_SERVING — this causes the readiness probe to fail, removing the pod from the Service endpoints. New RPCs will be routed to other pods.
Wait for endpoint propagation — there is a delay (typically 1-5 seconds) between the readiness probe failing and all clients/proxies learning about the change. During this window, new RPCs may still arrive. Sleep briefly to account for this.
Call GracefulStop() — this stops accepting new RPCs and waits for all in-flight RPCs to complete. Set a deadline so you do not exceed terminationGracePeriodSeconds.
Force stop as a fallback — if GracefulStop does not complete within the deadline, call Stop() to forcefully terminate remaining RPCs.

// Go graceful shutdown pattern
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

go func() {
    <-sigCh

    // 1. Mark as not ready
    healthServer.SetServingStatus("", healthpb.HealthCheckResponse_NOT_SERVING)

    // 2. Wait for endpoint propagation
    time.Sleep(5 * time.Second)

    // 3. Drain with a deadline
    done := make(chan struct{})
    go func() {
        grpcServer.GracefulStop()
        close(done)
    }()

    select {
    case <-done:
        // Drained cleanly
    case <-time.After(20 * time.Second):
        grpcServer.Stop() // Force stop
    }
}()

The terminationGracePeriodSeconds in your pod spec must be long enough to cover this entire sequence. If your longest RPC might take 60 seconds, set it to at least 70:

spec:
  terminationGracePeriodSeconds: 70
  containers:
    - name: my-grpc-service
      # ...

There is a subtle race condition to be aware of: Kubernetes sends SIGTERM and removes the pod from Endpoints concurrently, not sequentially. This means the pod might receive new RPCs after SIGTERM is sent but before all clients learn the pod is gone. The sleep in step 3 above handles this, but you can also use a preStop lifecycle hook for a more Kubernetes-native approach:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]

This delays the SIGTERM signal by 5 seconds, giving the endpoints controller time to remove the pod from the Service before the application starts shutting down.

Horizontal Pod Autoscaling for gRPC

Autoscaling gRPC services is more nuanced than autoscaling HTTP/1.1 workloads. The standard CPU-based HPA often works poorly for gRPC because HTTP/2 multiplexing means a small number of connections can drive high RPC throughput with relatively low CPU usage (much of the time is spent waiting on I/O or downstream services).

Custom Metrics

Better autoscaling signals for gRPC services include:

In-flight RPC count — the number of concurrent active RPCs. This directly measures server saturation.
Request rate — RPCs per second per pod. Scale when each pod exceeds a threshold.
Latency percentiles — if p99 latency exceeds an SLO, scale out to distribute load.
Queue depth — if your gRPC server uses a work queue, the queue length is a direct indicator of pressure.

With Prometheus and the prometheus-adapter, you can expose gRPC-specific metrics to the HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-grpc-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-grpc-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: grpc_server_handled_total_rate
        target:
          type: AverageValue
          averageValue: "500"    # 500 RPCs/s per pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

The behavior section is important for gRPC. Scaling down too aggressively breaks existing connections. A stabilization window of 2 minutes and a maximum scale-down of 25% per minute gives clients time to re-establish connections as pods are removed. Without this, a rapid scale-down can cause a burst of errors as connections to deleted pods are reset.

KEDA for Event-Driven Scaling

For gRPC services that process messages from queues (Kafka, NATS, RabbitMQ), KEDA can scale based on queue depth or consumer lag. This is particularly useful for gRPC services that act as streaming processors, where the rate of incoming messages drives the need for more pods.

gRPC with Ingress Controllers

If you need to expose gRPC services outside the cluster without a full service mesh, several Ingress controllers support gRPC as a first-class protocol. The key requirement is that the Ingress controller must handle HTTP/2 end-to-end, not downgrade to HTTP/1.1 at any hop.

NGINX Ingress Controller

The NGINX Ingress Controller supports gRPC through annotations. You need to use the nginx.ingress.kubernetes.io/backend-protocol: "GRPC" annotation (or "GRPCS" for TLS backends) to tell NGINX to use HTTP/2 for upstream connections:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-grpc-ingress
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - grpc.example.com
      secretName: grpc-tls
  rules:
    - host: grpc.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-grpc-service
                port:
                  number: 50051

NGINX terminates TLS and re-initiates HTTP/2 to each upstream pod. Because it understands HTTP/2, it can distribute individual gRPC requests across pods instead of pinning to a single backend. However, NGINX's gRPC support has some caveats: it does not support gRPC-Web natively (you need a separate proxy for browser clients), and configuration for streaming RPCs requires tuning timeouts carefully.

Traefik

Traefik handles gRPC natively when it detects HTTP/2 traffic. With its CRD-based IngressRoute, you can define gRPC-specific routing and middleware:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-grpc-route
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`grpc.example.com`)
      kind: Rule
      services:
        - name: my-grpc-service
          port: 50051
          scheme: h2c          # HTTP/2 cleartext to backend

Traefik also implements the Gateway API, so you can use the GRPCRoute resource directly. Its middleware chain supports retries, rate limiting, and circuit breaking, all of which apply per-RPC for gRPC traffic.

Contour

Contour, powered by Envoy, provides first-class gRPC support through its HTTPProxy CRD. Since Envoy is the data plane, Contour inherits all of Envoy's HTTP/2 and gRPC capabilities:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: my-grpc-proxy
spec:
  virtualhost:
    fqdn: grpc.example.com
    tls:
      secretName: grpc-tls
  routes:
    - conditions:
        - prefix: /
      services:
        - name: my-grpc-service
          port: 50051
          protocol: h2c
      loadBalancerPolicy:
        strategy: WeightedLeastRequest

Contour's WeightedLeastRequest strategy is well suited for gRPC because it routes each RPC to the backend with the fewest active requests, accounting for variable RPC durations. Contour also supports request-level retries, per-route timeout configuration, and header-based routing for gRPC metadata.

Keepalive Tuning for Cloud Load Balancers

When gRPC services run behind cloud load balancers (AWS ALB/NLB, GCP load balancer, Azure Application Gateway), idle connection timeouts become a major source of errors. Cloud load balancers silently drop idle TCP connections after a timeout, typically 60-350 seconds depending on the provider. The gRPC client does not learn about the dropped connection until it tries to send the next RPC, resulting in a confusing RST or timeout error.

gRPC's keepalive mechanism sends periodic HTTP/2 PING frames to keep the connection alive and detect dead connections early. Both client and server need to be configured:

Client-Side Keepalive

// Go client keepalive
conn, err := grpc.Dial(target,
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                20 * time.Second, // Send PING every 20s if idle
        Timeout:             5 * time.Second,  // Wait 5s for PING ACK
        PermitWithoutStream: true,             // PING even with no active RPCs
    }),
)

Server-Side Keepalive

// Go server keepalive
grpcServer := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle:     5 * time.Minute,   // Close idle connections
        MaxConnectionAge:      30 * time.Minute,  // Force reconnect periodically
        MaxConnectionAgeGrace: 10 * time.Second,  // Grace period for in-flight RPCs
        Time:                  20 * time.Second,   // PING every 20s
        Timeout:               5 * time.Second,    // Wait 5s for PING ACK
    }),
    grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             10 * time.Second, // Min time between client PINGs
        PermitWithoutStream: true,
    }),
)

The critical parameters and how they interact with cloud environments:

Time (client and server) — the interval between PING frames. Set this to less than the cloud LB's idle timeout. AWS NLB defaults to 350 seconds, GCP to 600 seconds, but Azure is only 60 seconds. A safe value is 20-30 seconds.
MaxConnectionAge (server) — forces clients to reconnect periodically. This is essential for gRPC in Kubernetes because it enables re-balancing after scale-out events. Without it, existing clients maintain connections to the old set of pods indefinitely, and newly scaled pods receive no traffic from existing clients. Set this to 15-30 minutes for a good balance between connection churn and load distribution.
MaxConnectionIdle (server) — closes connections with no active RPCs. This reclaims server resources and helps with scale-down.
PermitWithoutStream — allows PINGs even when there are no active RPCs on the connection. This must be true on both client and server for keepalive to work during idle periods, which is exactly when cloud LBs would drop the connection.

MaxConnectionAge for Load Distribution

MaxConnectionAge deserves special attention because it is the single most critical mitigation for gRPC load imbalance in Kubernetes. It solves a problem unique to gRPC in Kubernetes: when you scale out a deployment, existing clients do not discover the new pods. Unlike HTTP/1.1, where short-lived connections naturally get balanced to new backends, gRPC clients hold their connections open indefinitely. By setting MaxConnectionAge on the server, you force periodic reconnection. When the client reconnects, DNS resolution (or the mesh proxy) will include the new pods in the list, and load spreads naturally.

The trade-off is reconnection cost. Each reconnection requires a new TCP handshake, TLS handshake (if applicable), and HTTP/2 SETTINGS exchange. With gRPC connection multiplexing, a single client typically maintains only a handful of connections, so periodic reconnection is cheap. In practice, 15-30 minutes is a good MaxConnectionAge for most services. Latency-sensitive services that cannot tolerate even a single retry should use MaxConnectionAgeGrace to ensure in-flight RPCs complete before the connection is closed.

Production Deployment Patterns

Choosing the right pattern depends on your cluster's complexity and operational maturity. Here is a decision framework:

Pattern 1: Headless Service + Client-Side Balancing

Best for: small clusters, single-language stacks, low operational overhead tolerance.

Use a headless Service with clusterIP: None, configure each gRPC client with the dns:/// resolver and round_robin policy, and set MaxConnectionAge on the server. This is the simplest pattern with the least moving parts. The downside is that every client in every language must be configured correctly, and you get no traffic management features.

Pattern 2: Service Mesh (Istio / Linkerd)

Best for: large polyglot environments, teams that need traffic management, canary deployments, and mutual TLS.

Inject sidecars, use regular (non-headless) ClusterIP Services, and let the mesh handle per-RPC balancing, retries, and observability. Application code needs no gRPC-specific load balancing configuration. The cost is the operational burden of the mesh control plane and the per-pod resource overhead of the sidecar proxy.

Pattern 3: Gateway API + GRPCRoute

Best for: edge ingress of gRPC traffic into the cluster. Combine with either Pattern 1 or Pattern 2 for east-west (pod-to-pod) traffic.

Use GRPCRoute for north-south traffic entering the cluster, with method-level routing and traffic splitting. This pairs well with Envoy Gateway or Contour as the gateway implementation. For internal service-to-service gRPC traffic, you still need one of the other patterns.

Pattern 4: Ingress Controller

Best for: teams already running NGINX or Traefik Ingress who need to expose gRPC externally without adopting Gateway API.

Configure the Ingress controller with the appropriate annotations or CRDs for HTTP/2 backend protocol. This works well for exposing a small number of gRPC services externally but does not solve east-west load balancing.

Common Pitfalls and Debugging

Even with the right pattern, gRPC on Kubernetes surfaces operational issues that are not obvious from documentation:

Unbalanced load after deployment — if you are using client-side balancing and your clients do not re-resolve DNS after a rolling update, all traffic can end up on the new pods (which got new IPs) while old connections drain. Use MaxConnectionAge to force periodic reconnection.
Connection failures during scale-down — when a pod is terminated, in-flight RPCs are killed if you do not implement graceful shutdown. The SIGTERM/drain pattern described above is not optional for production services.
Mysterious RST_STREAM errors — usually caused by cloud LB idle timeouts or misconfigured keepalive. Check that the keepalive interval is less than the LB's idle timeout, and that PermitWithoutStream is enabled on both sides.
503 errors during rolling updates — the window between a pod being terminated and all proxies/clients learning about it. The preStop sleep hook and proper readiness probe configuration handle this.
Memory growth in Envoy sidecars — Envoy buffers gRPC messages. Large unary RPCs or high-throughput streaming RPCs can cause Envoy's memory to grow unexpectedly. Set per_connection_buffer_limit_bytes in the Envoy configuration or Istio mesh config.
HPA thrashing — gRPC traffic is often bursty. Without a stabilization window in the HPA behavior spec, the autoscaler will rapidly scale up and down, causing connection churn. Always configure scaleDown.stabilizationWindowSeconds.

Putting It All Together

A production-ready gRPC deployment on Kubernetes combines several of these patterns. A typical setup for a high-traffic gRPC service looks like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-grpc-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-grpc-service
  template:
    metadata:
      labels:
        app: my-grpc-service
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: server
          image: my-grpc-service:v1.2.3
          ports:
            - containerPort: 50051
          resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              memory: 512Mi
          startupProbe:
            grpc:
              port: 50051
            failureThreshold: 30
            periodSeconds: 2
          readinessProbe:
            grpc:
              port: 50051
            periodSeconds: 5
          livenessProbe:
            grpc:
              port: 50051
            periodSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"]

This configuration handles the full lifecycle: startup probes prevent premature traffic, readiness probes remove unhealthy pods from load balancing, liveness probes restart stuck processes, and the preStop hook with graceful shutdown ensures zero-dropped RPCs during rolling updates. Combine this Deployment with your chosen load balancing pattern (headless Service, mesh, or gateway) and appropriate keepalive tuning for your cloud environment.

For more background on the protocols and infrastructure discussed here, see the guides on how gRPC works, gRPC load balancing strategies, the role of service meshes in gRPC architectures, and the fundamentals of how load balancers work.