How API Gateways Work: Routing, Auth, Rate Limiting, and Protocol Translation

An API gateway is a server that acts as the single entry point for all client requests to a set of backend services. It sits between external consumers and internal microservices, handling cross-cutting concerns like authentication, rate limiting, request routing, protocol translation, and observability -- so that individual services do not have to implement these features themselves. API gateways are distinct from load balancers in purpose (though they often incorporate load balancing): where a load balancer distributes traffic across instances of the same service, an API gateway routes requests to different services based on the request path, method, headers, and authentication context. Kong, Amazon API Gateway, Apigee, and cloud-native solutions built on Envoy or NGINX are the dominant implementations. Understanding how API gateways work at the protocol, algorithmic, and architectural level is critical for designing secure, performant API-driven systems.

Request Routing: The Core Function

At its most fundamental, an API gateway is a reverse proxy that routes incoming requests to the appropriate backend service based on configurable rules. The routing decision typically considers:

The routing layer must handle path rewriting: the external path /api/v2/users/123 may be rewritten to /users/123 before forwarding to the user service, stripping the /api/v2 prefix that only has meaning at the gateway level. This prefix stripping or path transformation is a universal feature of API gateways.

API Versioning at the Gateway

API gateways are the natural enforcement point for API versioning strategies:

The gateway can also implement version negotiation: if a client requests v3 but only v2 is deployed, the gateway can return an error with supported versions in the response headers, or it can fall back to v2 with appropriate deprecation warnings.

Authentication and Authorization

Authentication is the single most common cross-cutting concern handled by API gateways. By centralizing auth at the gateway, individual services do not need to validate credentials, parse tokens, or manage key rotation -- they receive pre-authenticated requests with verified identity information in headers.

API Key Authentication

The simplest auth mechanism: clients include an API key in a header (X-API-Key: sk_live_abc123) or query parameter. The gateway looks up the key in its database, verifies it is valid and not revoked, identifies the associated client/tenant, and attaches the client identity to the request before forwarding. The gateway enforces rate limits, usage quotas, and access policies based on the API key's associated plan.

API key auth is appropriate for server-to-server communication where keys can be kept secret. It is inappropriate for browser or mobile clients where the key would be exposed in client-side code.

JWT Validation

For OAuth 2.0 and JWT-based authentication, the gateway validates the token without contacting an external authorization server on every request. The gateway:

  1. Extracts the JWT from the Authorization: Bearer <token> header
  2. Verifies the signature using the issuer's public key (fetched and cached from the JWKS endpoint)
  3. Checks the expiration (exp), not-before (nbf), and issuer (iss) claims
  4. Optionally verifies the audience (aud) claim matches the API's expected audience
  5. Extracts identity claims (user ID, roles, scopes) and passes them to the backend as trusted headers

JWT validation is stateless: the gateway does not need a database lookup or network call to validate the token (assuming the signing key is cached). This makes it extremely fast -- sub-millisecond per request. The tradeoff is that JWTs cannot be revoked before expiration without maintaining a revocation list (which reintroduces statefulness). Short-lived tokens (5-15 minutes) mitigate this by bounding the window of compromise.

API Gateway Request Pipeline Client Bearer <JWT> API Gateway Pipeline 1. TLS termination 2. Auth JWT verify 3. Rate limit check 4. Transform + route 401 Unauthorized 429 Too Many Requests user-service order-service payment-svc Pipeline Stage Details 1. TLS Termination Decrypt HTTPS, extract SNI for multi-tenant routing Verify client cert for mTLS (machine-to-machine APIs) 2. Authentication Validate JWT signature via cached JWKS public keys Check exp, iss, aud claims. Extract user/tenant identity 3. Rate Limiting Token bucket or sliding window per API key/user/IP Return Retry-After header on 429 responses 4. Transform + Route Strip /api/v2 prefix, add X-User-Id header Select backend by path prefix, load balance Each stage can short-circuit the pipeline: auth failure returns 401 before rate limiting is checked

OAuth 2.0 Token Introspection

For opaque tokens (non-JWT access tokens), the gateway cannot validate the token locally. Instead, it calls the authorization server's token introspection endpoint (RFC 7662) to verify the token and retrieve associated metadata (scopes, client ID, expiration). This adds a network round-trip to every request, so gateways typically cache introspection results with a short TTL (30-60 seconds).

mTLS for Machine-to-Machine APIs

For service-to-service API access, the gateway can require mutual TLS: the client must present a valid X.509 certificate signed by a trusted CA. The gateway verifies the certificate chain, checks revocation status (CRL or OCSP), and extracts the client identity from the certificate's subject or SAN field. mTLS is the strongest auth mechanism because the private key never leaves the client, but it is complex to manage at scale (certificate provisioning, rotation, revocation). Service meshes automate mTLS management for internal APIs.

Rate Limiting Algorithms

Rate limiting is the mechanism that protects backend services from abuse, prevents individual clients from consuming disproportionate resources, and enforces usage quotas tied to API pricing tiers. The choice of rate limiting algorithm affects accuracy, fairness, memory usage, and behavior at the boundary of the limit.

Token Bucket

The most intuitive algorithm. Each client has a "bucket" with a fixed capacity (e.g., 100 tokens). A token is removed for each request. Tokens are added at a fixed rate (e.g., 10 per second). If the bucket is empty, the request is rejected. This allows bursts up to the bucket capacity while enforcing a sustained rate equal to the refill rate.

Token bucket has two parameters: rate (tokens per second refill) and burst (bucket capacity). A configuration of rate=100/s, burst=200 allows a burst of 200 requests followed by a sustained 100 requests/second. The burst parameter is critical for real-world APIs: clients often send requests in bursts (page loads, batch operations) rather than at a perfectly uniform rate, and rejecting bursts that stay within the sustained rate creates a poor developer experience.

Sliding Window Log

Stores a timestamp for each request in the current window. To check if a new request is within the limit, count the number of stored timestamps within the window [now - window_size, now]. This is perfectly accurate but memory-intensive: each request stores a timestamp, so a limit of 10,000 requests/minute requires storing up to 10,000 timestamps per client.

Sliding Window Counter

A memory-efficient approximation of the sliding window log. Maintains counters for the current and previous fixed windows (e.g., current minute and previous minute). The rate estimate is: previous_count * overlap_fraction + current_count. For example, if the previous minute had 80 requests and we are 30 seconds into the current minute with 40 requests, the estimate is 80 * 0.5 + 40 = 80. This requires only two counters per client per window, regardless of the request rate.

Cloudflare uses this algorithm for their global rate limiting because it combines good accuracy with O(1) memory per client per rule -- essential when enforcing rate limits across millions of API keys.

Fixed Window Counter

The simplest algorithm: maintain a counter for each fixed time window (e.g., each calendar minute). Increment on each request, reject when the counter exceeds the limit. The problem is boundary spikes: a client can send limit requests at the end of one window and limit requests at the beginning of the next, achieving 2x the limit in a window-length period spanning the boundary.

Leaky Bucket (Queue-Based)

Processes requests at a fixed rate, queuing excess requests. Unlike token bucket which allows bursts, leaky bucket enforces a strict output rate. Requests that arrive faster than the drain rate are queued (up to a queue capacity), and excess requests are dropped. This produces a perfectly smooth output rate, which is useful for protecting fragile backends that cannot handle any traffic spikes.

Distributed Rate Limiting

When the API gateway runs as multiple instances (scaled horizontally behind a load balancer), rate limiting must be coordinated across instances. Two approaches:

AWS API Gateway uses a centralized approach with token bucket rate limiting. Kong supports both Redis-backed (exact) and local (approximate) rate limiting via its rate-limiting plugin.

Request and Response Transformation

API gateways transform requests and responses to decouple the external API contract from the internal service implementation.

Header Manipulation

Common header transformations include: adding identity headers (X-User-Id, X-Tenant-Id, X-Request-Id) that the auth stage populated, removing internal headers that should not leak to clients (server version, debug headers), adding security headers (X-Content-Type-Options, Strict-Transport-Security), and adding or modifying CORS headers for browser-based API consumers.

Request/Response Body Transformation

Some gateways support transforming request and response bodies: converting XML payloads to JSON, restructuring JSON response shapes, filtering out sensitive fields from responses (e.g., removing internal IDs or PII from external-facing API responses), and aggregating responses from multiple backend services into a single response (API composition or BFF -- Backend for Frontend pattern).

Body transformation is expensive (requires buffering, parsing, and re-serializing the body) and should be used sparingly. If you find yourself doing heavy body transformation at the gateway, it often indicates a need for a dedicated BFF service rather than gateway-level transformation.

Protocol Translation

API gateways can translate between protocols: exposing a REST/JSON API to external clients while communicating with backend services via gRPC (gRPC-JSON transcoding), or accepting GraphQL queries and decomposing them into REST calls to multiple backend services. Envoy supports gRPC-JSON transcoding natively via its grpc_json_transcoder filter, allowing clients to call gRPC services using REST conventions.

API Gateway Architectures

API gateways can be deployed in several architectural patterns, each with different tradeoffs.

Edge Gateway (North-South)

The most common pattern: the gateway sits at the network edge, handling all external client traffic. It terminates TLS, authenticates requests, enforces rate limits, and routes to internal services. External clients never communicate directly with backend services. This is the pattern used by Kong, Amazon API Gateway, and Apigee.

Internal Gateway (East-West)

A gateway for service-to-service communication within the internal network. This is less common because service meshes handle most east-west concerns (mTLS, load balancing, observability). However, internal gateways are useful for: API composition (aggregating multiple service responses), protocol translation (REST-to-gRPC), and enforcing organizational API standards across teams.

Gateway per Domain (BFF Pattern)

A separate gateway for each client type: mobile gateway, web gateway, partner gateway. Each gateway is tailored to its consumer's needs -- the mobile gateway may aggregate data to reduce round trips, the web gateway may support WebSocket for real-time features, and the partner gateway may enforce stricter rate limits and different auth requirements. This pattern is called "Backend for Frontend" (BFF) and is common in large organizations with diverse API consumers.

API Gateway Deployment Patterns Edge Gateway Mobile Web Partner Gateway (single) svc-a svc-b svc-c All clients share one gateway BFF (Backend for Frontend) Mobile Web Partner Mobile GW Web GW Partner GW svc-a svc-b Each client type gets a tailored gateway Gateway + Service Mesh Clients Edge GW N-S auth Service Mesh (E-W mTLS, LB, obs) svc-a svc-b svc-c Edge GW: auth, rate limit, transform API Gateway Implementations Gateway Type Proxy Best For Kong OSS / Enterprise NGINX / Kong Gateway Plugin ecosystem, K8s native AWS API GW Managed Proprietary Serverless, Lambda integration Envoy Gateway OSS (K8s Gateway API) Envoy K8s-native, xDS, gRPC Apigee Managed (Google) Proprietary Enterprise API management Traefik OSS / Enterprise Go (native) Auto-discovery, Let's Encrypt

API Gateway vs Service Mesh vs Load Balancer

These three technologies overlap but serve different purposes:

In a complete architecture, all three coexist: the API gateway handles external authentication and rate limiting, routes requests to internal services, where the service mesh handles inter-service mTLS and observability, and load balancers (both within the mesh and at the infrastructure layer) distribute traffic across service instances.

Observability at the API Gateway

API gateways are uniquely positioned for observability: they see every external request and can correlate traffic patterns, error rates, and latency across all backend services.

Key metrics exposed by API gateways:

API gateways also generate the initial trace context (the X-Request-Id or W3C traceparent header) that propagates through all downstream services, enabling end-to-end distributed tracing for every external request.

Security Considerations

As the entry point for all external traffic, the API gateway is a critical security boundary. Common security features include:

API Gateways and Network Infrastructure

API gateways sit at the intersection of application architecture and network infrastructure. They are typically deployed behind a load balancer (or cloud load balancer) that provides L4 distribution across gateway instances, TLS passthrough or termination, and DNS-based GSLB across regions.

For globally distributed APIs, the API gateway is replicated across multiple regions, with DNS or anycast routing directing clients to the nearest gateway instance. Each regional gateway instance connects to both local backend services and (for some requests) cross-region services. The BGP routing topology between regions determines the latency of cross-region API calls, making the AS path between your cloud provider's regions a relevant factor in API architecture design.

Explore the routing infrastructure behind your API gateways and their backend services with the god.ad BGP Looking Glass. Look up the IP addresses of your gateway endpoints to see the BGP routes, origin ASNs, and network paths that determine how client traffic reaches your APIs.

See BGP routing data in real time

Open Looking Glass
More Articles
How GraphQL Works: Schema, Queries, Resolvers, and Execution
How HTTP Caching Works: Cache-Control, ETags, and CDN Edge Caching
How HTTP Cookies Work: Set-Cookie, Security Attributes, and SameSite
How Server-Sent Events (SSE) Work: HTTP Streaming for Real-Time Updates
What is BGP? The Internet's Routing Protocol Explained
What is an Autonomous System (AS)?