gRPC Error Handling and Status Codes

gRPC defines a structured error model that goes far beyond what HTTP status codes offer. Where HTTP gives you a numeric code and a text reason phrase, gRPC provides a machine-readable status code, a human-readable message, and an extensible details payload that can carry structured metadata about what went wrong, where, and why. Understanding this error model is essential to building resilient distributed systems with gRPC.

This guide covers the full gRPC error handling surface: the 17 canonical status codes, rich error details, propagation semantics, retries, deadlines, circuit breaking, and the nuances of error handling in streaming RPCs.

The 17 gRPC Status Codes

Every gRPC call completes with a Status consisting of a status code, an optional message string, and optional detail payloads. The status code is one of 17 values defined in google.rpc.Code. These are not arbitrary numbers like HTTP status codes that have accumulated over decades of RFCs. They were designed as a coherent set, each with a clear contract about semantics and retryability.

gRPC Status Codes by Category 0 OK Success. Not an error. Client Errors 1 CANCELLED Client cancelled the request 3 INVALID_ARGUMENT Bad request data. Do not retry without fixing input. 5 NOT_FOUND Requested entity does not exist 6 ALREADY_EXISTS Entity that client tried to create already exists 7 PERMISSION_DENIED Caller lacks permission. Do not retry. 9 FAILED_PRECONDITION System not in required state for the operation 11 OUT_OF_RANGE Operation attempted past valid range 16 UNAUTHENTICATED No valid authentication credentials Server / Infrastructure Errors 2 UNKNOWN 4 DEADLINE_EXCEEDED 8 RESOURCE_EXHAUSTED 10 ABORTED 12 UNIMPLEMENTED 13 INTERNAL 14 UNAVAILABLE 15 DATA_LOSS

OK (0)

The call succeeded. This is returned implicitly on success and is not an error. You will rarely construct an OK status explicitly, but it appears in monitoring dashboards, traces, and logs as the successful completion marker.

CANCELLED (1)

The operation was cancelled, typically by the caller. In Go, this corresponds to context.Canceled. In practice, a client cancels a request when it no longer needs the result — the user navigated away, a higher-level operation timed out, or a competing request finished first. The server should stop processing when it detects cancellation, as continuing wastes resources. Cancellation propagates automatically through gRPC call chains: if Service A calls Service B and Service A's client cancels, Service B's context is also cancelled.

UNKNOWN (2)

An unknown error occurred. This is the default code when an error is raised without a gRPC status code attached. If a server handler throws a raw exception in Java or panics in Go without wrapping the error, the framework maps it to UNKNOWN. Treat it as an indication of a bug or unhandled edge case. When you see UNKNOWN in production metrics, investigate and replace it with a more specific code.

INVALID_ARGUMENT (3)

The client sent a request that fails validation — a malformed field, a value out of bounds, a missing required parameter. This maps to HTTP 400. Retrying without changing the request is pointless. The error details should tell the client exactly which field is invalid and why. Use BadRequest error details (covered below) to provide per-field validation information.

DEADLINE_EXCEEDED (4)

The operation did not complete within the allocated deadline. This is one of the most important codes in distributed systems. Every gRPC call should have a deadline, and when that deadline expires, the call fails with this code. The operation may or may not have been executed on the server — the client cannot know. This maps loosely to HTTP 504 (Gateway Timeout). Whether to retry depends on the idempotency of the operation.

NOT_FOUND (5)

The requested resource does not exist. Maps to HTTP 404. Use this for entity lookups by ID, not for search queries that return empty results (those should return OK with an empty list). An important distinction: NOT_FOUND means the resource is expected to exist but does not. If the resource could be created, it is appropriate for the details to suggest the creation endpoint.

ALREADY_EXISTS (6)

A resource the client attempted to create already exists. Maps to HTTP 409. This is common in idempotent create operations. If the client retries a create that already succeeded, ALREADY_EXISTS tells it the operation effectively succeeded. The error details should include the identifier of the existing resource.

PERMISSION_DENIED (7)

The caller is authenticated but does not have permission for this operation. Maps to HTTP 403. Do not retry without changing the authorization context. This differs from UNAUTHENTICATED: PERMISSION_DENIED means "I know who you are, but you cannot do this." It should not leak information about whether the resource exists — if a user lacks read permission, return PERMISSION_DENIED rather than NOT_FOUND, to avoid exposing the existence of resources.

RESOURCE_EXHAUSTED (8)

A resource limit was reached — rate limit, quota, disk space, memory. Maps to HTTP 429 (Too Many Requests). This is retryable after a backoff period. Include RetryInfo or QuotaFailure in the error details to tell the client when and how it can retry. This is the correct code for rate limiting, not UNAVAILABLE.

FAILED_PRECONDITION (9)

The system is not in the state required for the operation to proceed. The client should not retry until the precondition is addressed. For example, trying to delete a non-empty directory, or attempting an operation on a resource that has not been initialized. The difference from INVALID_ARGUMENT is that the request itself is valid, but the system state makes it impossible to execute. The difference from ABORTED is that FAILED_PRECONDITION suggests the client should not retry without some intervention, while ABORTED suggests an immediate retry may succeed.

ABORTED (10)

The operation was aborted, typically due to a concurrency conflict like a transaction abort or an optimistic locking failure. The client should retry the entire read-modify-write sequence. This maps loosely to HTTP 409 in the context of conflicts. Unlike FAILED_PRECONDITION, an ABORTED operation may succeed on retry if the client re-reads the current state and resubmits.

OUT_OF_RANGE (11)

The operation was attempted past the valid range. For example, seeking past the end of a file, or requesting page -1 of a paginated API. This is distinct from INVALID_ARGUMENT: use OUT_OF_RANGE when the problem is a range issue that could be valid in other circumstances (like reading past the end of a growing file), and INVALID_ARGUMENT when the value is never valid (like a negative page size). OUT_OF_RANGE may indicate that the client should retry with different pagination parameters.

UNIMPLEMENTED (12)

The method is not implemented or not enabled on this server. Maps to HTTP 501. This commonly occurs during API evolution: a new RPC method is defined in the proto file but the server has not deployed an implementation yet. This is never retryable against the same server. It can also appear when a particular feature flag is disabled on a server.

INTERNAL (13)

An internal error occurred. This is the catch-all for server-side bugs — null pointer exceptions, broken invariants, unexpected states. Maps to HTTP 500. Like UNKNOWN, this should be investigated and ideally replaced with a more specific code. The difference: INTERNAL is explicitly set by server code that detects something is wrong, while UNKNOWN is the default when an error lacks a code.

UNAVAILABLE (14)

The service is temporarily unavailable. Maps to HTTP 503. This is the primary retryable server error. Use it for transient conditions: the server is starting up, shutting down, overloaded, or a downstream dependency is temporarily unreachable. gRPC clients with retry policies will automatically retry UNAVAILABLE errors. For persistent load issues, prefer RESOURCE_EXHAUSTED with backoff information.

DATA_LOSS (15)

Unrecoverable data loss or corruption. This is a severe error indicating that data the client depends on has been permanently lost or corrupted. Use it only for genuine data integrity failures, not for transient read errors. In practice, this is rare and always warrants investigation and alerting.

UNAUTHENTICATED (16)

The request does not have valid authentication credentials. Maps to HTTP 401. The client should obtain fresh credentials and retry. This differs from PERMISSION_DENIED: UNAUTHENTICATED means "I don't know who you are," while PERMISSION_DENIED means "I know who you are, but you cannot do this." Retrying with a refreshed token is appropriate.

Rich Error Details

The status code and message string are often insufficient to communicate what went wrong in a way that clients can act on programmatically. A message like "invalid request" does not tell the client which field was invalid, what the valid range is, or when it can retry. gRPC's rich error model solves this through the google.rpc.Status proto message, which carries an arbitrary list of typed detail payloads.

google.rpc.Status Structure Status code: int32 // e.g. 3 (INVALID_ARGUMENT) message: string // "email field is not a valid address" details: repeated google.protobuf.Any BadRequest field_violations: [{field, description}] RetryInfo retry_delay: Duration ErrorInfo reason, domain, metadata map DebugInfo stack_entries[], detail string

google.rpc.Status

The Status message is the wire format for gRPC errors. It contains three fields: the integer code, a developer-facing message string, and a list of google.protobuf.Any messages that carry typed detail payloads. The Any type acts as an envelope: each detail is serialized as a proto message with a type URL that lets the receiver deserialize it into the correct type.

In Go, you construct a rich error using the status and errdetails packages:

st := status.New(codes.InvalidArgument, "invalid email address")
st, _ = st.WithDetails(&errdetails.BadRequest{
    FieldViolations: []*errdetails.BadRequest_FieldViolation{{
        Field:       "email",
        Description: "must be a valid email address",
    }},
})
return st.Err()

The client extracts details by calling status.FromError(err) and iterating over the detail messages, type-asserting each one to the expected type.

ErrorInfo

ErrorInfo is the most general-purpose detail type. It carries a machine-readable reason string (like "QUOTA_EXCEEDED"), a domain string (like "api.myservice.com"), and a metadata map of string key-value pairs. This is the detail type to use when you want clients to switch on error reasons programmatically without parsing the human-readable message. Google's own APIs use ErrorInfo extensively, and it is the recommended first choice for machine-readable error context.

BadRequest

BadRequest carries a list of FieldViolation messages, each containing a field path (using dot notation for nested fields, like "user.address.zip_code") and a description. This lets clients display per-field validation errors in forms or UI elements. It is the correct detail type to pair with INVALID_ARGUMENT.

RetryInfo

RetryInfo contains a single retry_delay field (a google.protobuf.Duration) that tells the client how long to wait before retrying. Pair this with RESOURCE_EXHAUSTED or UNAVAILABLE. A rate-limiting server can say "try again in 30 seconds" instead of leaving the client to guess with exponential backoff.

QuotaFailure

QuotaFailure describes which quota was violated and its limit. This pairs with RESOURCE_EXHAUSTED and provides structured information about which specific quota was exceeded — API call quota, bandwidth quota, storage quota — and what the limits are.

PreconditionFailure

PreconditionFailure describes what preconditions were not met. Each violation has a type, a subject, and a description. Use this with FAILED_PRECONDITION to tell the client exactly what state needs to change before the operation can succeed.

DebugInfo

DebugInfo carries a stack trace and a detail string. This is intended for server-side debugging and should never be sent to external clients in production. Stack traces leak implementation details, internal hostnames, and code structure. Use DebugInfo only in development environments or internal-only services where the caller is trusted.

RequestInfo and ResourceInfo

RequestInfo carries a request_id and a serving_data field, useful for correlating errors with server-side logs. ResourceInfo identifies the resource the error relates to — its type, name, owner, and a description. These are helpful for error reporting dashboards and for clients that need to display which specific resource caused a failure.

Error Propagation Across Services

In a microservices architecture, a single client request can fan out across dozens of backend services. How errors propagate through this chain determines the quality of the error experience for the end user and the debuggability for operators.

Error Propagation Chain Client deadline: 5s API Gateway remaining: 4.8s Service A remaining: 4.2s Service B INTERNAL Service C OK INTERNAL mapped to INTERNAL Deadlines propagate automatically, shrinking at each hop. Error codes should be re-mapped at service boundaries to avoid leaking implementation details.

The Translation Problem

A common antipattern is blindly forwarding errors from downstream services to the caller. If Service B returns NOT_FOUND to Service A, and Service A forwards that directly to the client, the client receives a confusing error about an internal resource it never asked for. Service A should translate the error: if Service B's NOT_FOUND means Service A cannot fulfill the request, the appropriate code might be INTERNAL (the dependency failed) or NOT_FOUND (if Service A's resource genuinely does not exist), depending on the semantics.

Rules for Error Translation

At each service boundary, apply these rules:

Status Code Mapping at Boundaries

Here is a practical mapping for translating downstream errors:

Downstream Code        Your Response (typical)
-------------------------------------------------
INVALID_ARGUMENT   ->  INTERNAL  (your request to downstream was bad = your bug)
NOT_FOUND          ->  INTERNAL or NOT_FOUND  (depends on semantics)
PERMISSION_DENIED  ->  INTERNAL  (your credentials to downstream are wrong)
UNAVAILABLE        ->  UNAVAILABLE  (propagate retryability)
DEADLINE_EXCEEDED  ->  DEADLINE_EXCEEDED  (propagate if your deadline is also expiring)
RESOURCE_EXHAUSTED ->  UNAVAILABLE  (downstream is overloaded)

Retries

gRPC supports several retry mechanisms, each appropriate for different failure modes. Understanding which retries happen automatically and which require explicit configuration is critical to building reliable systems.

gRPC Retry Strategies Transparent Retry Automatic, no config Triggers: - Connection failures - GOAWAY before headers - Stream RST before data Constraint: Only if server hasn't seen the request bytes Safe: always idempotent (server never processed) Configured Retry Service config policy Config: - maxAttempts: 3 - retryableStatusCodes - initialBackoff - maxBackoff - backoffMultiplier Wait between attempts: random(0, min(initial * mult^(n-1), max)) Hedged Requests Parallel speculative Config: - maxAttempts: 3 - hedgingDelay: 500ms - nonFatalStatusCodes Behavior: Send parallel requests after hedgingDelay. Use first success, cancel the rest. Requires idempotent ops

Transparent Retries

gRPC performs transparent retries automatically for connection-level failures where the server never received the request. These include TCP connection failures, HTTP/2 GOAWAY frames received before the request headers were sent, and stream resets before request data was transmitted. Because the server never processed the request, these retries are always safe — even for non-idempotent operations. You cannot disable transparent retries; they are part of the gRPC transport layer.

Configurable Retry Policy

For application-level errors (like UNAVAILABLE or ABORTED), you configure a retry policy in the gRPC service config. The policy specifies which status codes trigger a retry, the maximum number of attempts, and the backoff parameters. The backoff between retries uses a jittered exponential formula: a random value between 0 and min(initialBackoff * backoffMultiplier^(attempt-1), maxBackoff).

{
  "methodConfig": [{
    "name": [{"service": "my.package.MyService"}],
    "retryPolicy": {
      "maxAttempts": 4,
      "initialBackoff": "0.1s",
      "maxBackoff": "10s",
      "backoffMultiplier": 2,
      "retryableStatusCodes": ["UNAVAILABLE", "ABORTED"]
    }
  }]
}

A critical constraint: retry policies and hedging policies are mutually exclusive per method. You can use one or the other, not both. Also, the total number of retry attempts across all outstanding RPCs on a channel is bounded by a configurable throttle to prevent retry storms from overwhelming a recovering server.

Hedged Requests

Hedging is a latency-reduction technique where the client proactively sends multiple copies of the same request. After a configurable delay, if the first attempt has not returned, the client sends a second copy to a different backend. It uses whichever response arrives first and cancels the rest. This is powerful for latency-sensitive applications where the tail latency of a single request is high due to backend variance.

Hedging requires that the operation is idempotent — sending the same request multiple times must produce the same effect as sending it once. This makes it suitable for read operations and idempotent writes, but inappropriate for operations like "increment counter" or "transfer funds."

Retry Throttling

To prevent retry storms — where many clients retry simultaneously and overwhelm a recovering server — gRPC implements a token-based throttle. Each channel maintains a token count. Successful RPCs add tokens; retries consume them. When the token count drops below a threshold, retries are suppressed. This creates a natural backpressure mechanism: when the server is healthy, retries are allowed; when it is struggling, retries are automatically suppressed.

Deadlines and Timeout Propagation

Deadlines are the single most important reliability mechanism in gRPC. Every RPC should have a deadline. Without one, a failed request can hang indefinitely, tying up resources on the client, every intermediate service, and the final backend. A deadline says: "this entire operation must complete within N seconds, or fail."

How Deadlines Propagate

When a client sets a deadline, gRPC transmits it as a grpc-timeout header on the wire. Each service in the call chain receives the remaining deadline, not the original one. If the client sets a 5-second deadline and the first service takes 800ms to process before calling the next service, the next service receives a deadline of approximately 4.2 seconds. This automatic propagation ensures that the entire call chain is bounded by the original client's deadline.

// Go example: setting a deadline
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()

// The deadline propagates automatically to downstream calls
resp, err := serviceA.DoSomething(ctx, req)
// If serviceA calls serviceB, serviceB's context carries the remaining deadline

Deadline Budgeting

In deep call chains, deadline propagation creates a "budget" that shrinks at each hop. A service receiving a request with only 200ms of remaining deadline must decide quickly whether it can complete the operation. If it needs to make multiple downstream calls, it may not have enough budget. Best practices:

The "No Deadline" Trap

If no deadline is set, the RPC has no timeout — it will wait forever (or until the connection drops). In production, this causes cascading failures: a slow backend causes its callers to accumulate blocked threads, which causes their callers to accumulate blocked threads, and so on up the call chain. Always set deadlines. If a particular method is expected to be slow (batch processing, long-running operations), use a long deadline, but always set one.

Circuit Breaking Patterns

Circuit breaking prevents a service from repeatedly calling a failing dependency, giving the dependency time to recover while the caller fails fast instead of accumulating timeouts.

Circuit Breaker State Machine CLOSED Requests pass through HALF-OPEN Probe with limited traffic OPEN Fail fast immediately failure threshold probe fails probe succeeds timeout expires

The circuit breaker pattern uses three states. In the Closed state, all requests pass through normally. The circuit breaker tracks the error rate. When the error rate exceeds a threshold (e.g., 50% of requests fail within a 10-second window), the circuit opens. In the Open state, all requests fail immediately with UNAVAILABLE without hitting the backend. After a cooldown period, the circuit moves to Half-Open, allowing a limited number of probe requests through. If the probes succeed, the circuit closes again. If they fail, it reopens.

gRPC does not include a built-in circuit breaker, but the pattern integrates naturally. Circuit breakers are typically implemented as client-side interceptors (middleware). The interceptor checks the circuit state before making the RPC and records the outcome after. In service mesh architectures, the circuit breaker runs in the sidecar proxy (Envoy, Linkerd) rather than in application code.

Circuit Breaking with gRPC Health Checking

gRPC defines a standard health-checking protocol (grpc.health.v1.Health) that load balancers use to determine whether a server instance is healthy. When a health check fails, the load balancer removes the instance from the pool — acting as a per-instance circuit breaker. This is simpler than application-level circuit breaking and handles the most common case: individual server failures.

Outlier Detection

Service meshes like Envoy extend the circuit breaker concept with outlier detection, which ejects individual endpoints from the load balancing pool when they exhibit higher error rates than their peers. This is more granular than circuit breaking: instead of stopping all traffic to a service, it stops traffic only to the specific instances that are failing.

Error Handling in Streaming RPCs

Streaming RPCs introduce error handling complexities that do not exist in unary (request-response) calls. A stream is a long-lived connection where errors can occur at the beginning, middle, or end of data transmission, and where partial data may have already been delivered.

Server Streaming Errors

In a server-streaming RPC, the client sends one request and receives a stream of responses. An error can occur at any point in the stream. The server signals the error by setting the status on the stream trailer. The client sees it as an error on the next Recv() call after all buffered messages have been consumed.

The challenge: if the server sends 95 messages successfully and then fails on the 96th, the client has already processed 95 messages. The error handling must account for partial results. Options include:

Client Streaming Errors

In a client-streaming RPC, the client sends a stream of messages and receives a single response. If the server encounters a problem with one of the messages (e.g., validation failure), it can close the stream with an error status. The client discovers the error when it calls CloseAndRecv() or on the next Send() call (which may return an io.EOF indicating the server closed the stream).

A subtlety: when the server returns an error mid-stream, the client may still have unsent messages buffered. The client must handle the error even if Send() has not returned an error yet — check for errors on CloseAndRecv().

Bidirectional Streaming Errors

Bidirectional streams are the most complex case. Both sides are sending and receiving concurrently. An error can originate from either side at any time. The error terminates the entire stream — there is no way to recover a failed bidirectional stream without establishing a new one.

For bidirectional streams, implement application-level error messages within your proto message types. Define a oneof that can be either a data message or an error message:

message StreamResponse {
  oneof payload {
    DataChunk data = 1;
    StreamError error = 2;
  }
}

message StreamError {
  string code = 1;
  string message = 2;
  bool retryable = 3;
}

This allows the server to communicate per-message errors without terminating the stream. The gRPC status is reserved for stream-level failures.

Retry Limitations with Streams

gRPC's built-in retry policy does not support streaming RPCs. The retry policy only applies to unary calls. For streaming RPCs, you must implement retry logic at the application level. This typically means detecting the stream termination, creating a new stream, and resuming from the last known position. Design your streaming protocols with resumability in mind from the start.

gRPC Status Codes vs HTTP Status Codes

gRPC status codes and HTTP status codes serve similar purposes but have fundamental design differences. Understanding the mapping between them is important when building gRPC-Web applications, HTTP/JSON transcoding gateways, or when debugging with browser developer tools.

gRPC to HTTP Status Code Mapping gRPC Code HTTP Notes OK 200 Exact mapping INVALID_ARGUMENT 400 Exact mapping UNAUTHENTICATED 401 Exact mapping PERMISSION_DENIED 403 Exact mapping NOT_FOUND 404 Exact mapping ALREADY_EXISTS 409 Conflict ABORTED 409 Same HTTP code, different gRPC semantics RESOURCE_EXHAUSTED 429 Too Many Requests CANCELLED 499 Client Closed Request (non-standard) INTERNAL / UNKNOWN 500 Server-side errors collapse to 500 UNIMPLEMENTED 501 Exact mapping UNAVAILABLE 503 Service Unavailable DEADLINE_EXCEEDED 504 Gateway Timeout DATA_LOSS, FAILED_PRECONDITION, OUT_OF_RANGE all map to varying HTTP codes depending on context.

The mapping is not one-to-one. HTTP has dozens of status codes accumulated over decades (418 I'm a Teapot, 451 Unavailable for Legal Reasons, etc.), while gRPC intentionally limits itself to 17 with clearly defined semantics. Some gRPC codes map to the same HTTP code: ALREADY_EXISTS and ABORTED both map to HTTP 409. Going the other direction, HTTP 400 could be INVALID_ARGUMENT, FAILED_PRECONDITION, or OUT_OF_RANGE depending on the specific error.

The key difference in philosophy: HTTP status codes are transport-level indicators that proxies and caches act on. gRPC status codes are application-level indicators that encode retry semantics and actionability. A gRPC UNAVAILABLE (HTTP 503) explicitly means "retry this." An HTTP 503 might or might not be retryable depending on context that the status code does not capture.

gRPC-Web and Transcoding Considerations

When using gRPC-Web or HTTP/JSON transcoding (like Envoy's gRPC-JSON transcoder or Google Cloud Endpoints), the gRPC status code is translated to an HTTP status code in the response. The error details are serialized as a JSON body. Clients consuming the transcoded API see HTTP semantics, but the error payloads still follow the gRPC error model. Be aware that information can be lost in translation: the nuance between FAILED_PRECONDITION and INVALID_ARGUMENT is invisible in the HTTP 400 response code.

Best Practices for Error Design

Designing a consistent error strategy across a microservices platform requires discipline and clear conventions. These practices are drawn from Google's API design guide, which was developed from experience running gRPC at scale across thousands of services.

1. Use the Most Specific Status Code

Do not default to INTERNAL or UNKNOWN for every server error. Each status code carries semantic meaning that clients depend on. UNAVAILABLE tells clients to retry. INVALID_ARGUMENT tells clients to fix the input. FAILED_PRECONDITION tells clients to change the system state. Using the wrong code causes clients to take the wrong action: retrying a request that will never succeed, or giving up on a request that would succeed after a brief wait.

2. Error Messages Are for Developers, Not End Users

The message string in a gRPC status is for debugging. It should be technical, specific, and actionable for a developer reading logs. It should not be displayed to end users. Use LocalizedMessage in error details if you need user-facing error text. This separation ensures that error messages can include technical details (field names, internal identifiers) without worrying about user experience.

3. Use ErrorInfo for Machine-Readable Errors

Clients often need to branch on specific error conditions. Parsing error message strings is fragile — messages change across versions and are not part of the API contract. Instead, use ErrorInfo with a stable reason enum and domain. Clients can switch on the reason string reliably:

// ErrorInfo detail
reason: "DAILY_LIMIT_EXCEEDED"
domain: "api.myservice.com"
metadata: {
  "limit": "1000",
  "reset_time": "2025-01-15T00:00:00Z"
}

4. Include Retry Information

When returning UNAVAILABLE or RESOURCE_EXHAUSTED, include a RetryInfo detail with a concrete retry delay. This prevents clients from guessing with arbitrary backoffs and lets you control the retry behavior of your callers from the server side. If you know the rate limit resets in 30 seconds, tell the client.

5. Do Not Expose Internal Details to External Clients

Stack traces, internal hostnames, database query strings, and downstream service names should never appear in errors returned to external clients. Log them server-side, include them in DebugInfo for internal services, and include a request ID in RequestInfo so support teams can correlate client-reported errors with server-side logs.

6. Design Idempotent APIs to Simplify Error Handling

When a client receives DEADLINE_EXCEEDED or UNAVAILABLE, it does not know whether the server processed the request. If the operation is idempotent, the client can safely retry. If not, the client must implement complex at-most-once logic. Design your APIs to be idempotent wherever possible. Use client-generated request IDs to deduplicate retried mutations.

7. Always Set Deadlines

Every outgoing RPC call should have a deadline. Without deadlines, failed downstream services cause cascading resource exhaustion. Set deadlines at the entry point (API gateway, mobile client) and let them propagate through the call chain. Monitor deadline utilization: if most requests use 10% of their deadline, it is too generous; if many hit the deadline, it is too tight.

8. Test Error Paths

Error paths are code paths. Test that your service returns the correct status code, the correct error details, and that clients handle each error condition correctly. Use fault injection in integration tests to verify that retries work, circuit breakers trip, and cascading failures are contained. The correctness of error handling is at least as important as the correctness of the happy path.

9. Monitor Error Rates by Code

Track the distribution of gRPC status codes in your monitoring system. A sudden spike in DEADLINE_EXCEEDED indicates a latency problem. A spike in UNAVAILABLE indicates infrastructure issues. A spike in PERMISSION_DENIED might indicate a misconfigured deployment. Each status code tells a different story, and monitoring them separately gives you much richer operational visibility than a single "error rate" metric.

10. Document Your Error Contract

Your API documentation should specify which error codes each RPC method can return and under what conditions. This is part of your API contract. A method that can return NOT_FOUND, INVALID_ARGUMENT, and PERMISSION_DENIED should list all three with explanations. Clients should not have to guess or discover error codes through trial and error.

Further Reading

For a deeper understanding of the gRPC framework that underlies this error model, see How gRPC Works, which covers the protocol architecture, HTTP/2 transport, and protobuf serialization. For the specific challenges of error handling in long-lived connections, see gRPC Streaming Patterns, which covers server streaming, client streaming, and bidirectional streaming in detail.

See BGP routing data in real time

Open Looking Glass
More Articles
How gRPC Works
How Protocol Buffers Work
How gRPC-Web Works
gRPC Load Balancing: Strategies and Patterns
gRPC and Service Mesh: Istio, Envoy, and Linkerd
gRPC Security: Authentication, TLS, and Authorization