How API Gateways Work: Routing, Auth, Rate Limiting, and Protocol Translation
An API gateway is a server that acts as the single entry point for all client requests to a set of backend services. It sits between external consumers and internal microservices, handling cross-cutting concerns like authentication, rate limiting, request routing, protocol translation, and observability -- so that individual services do not have to implement these features themselves. API gateways are distinct from load balancers in purpose (though they often incorporate load balancing): where a load balancer distributes traffic across instances of the same service, an API gateway routes requests to different services based on the request path, method, headers, and authentication context. Kong, Amazon API Gateway, Apigee, and cloud-native solutions built on Envoy or NGINX are the dominant implementations. Understanding how API gateways work at the protocol, algorithmic, and architectural level is critical for designing secure, performant API-driven systems.
Request Routing: The Core Function
At its most fundamental, an API gateway is a reverse proxy that routes incoming requests to the appropriate backend service based on configurable rules. The routing decision typically considers:
- Path prefix --
/api/v2/users/*routes to the user service,/api/v2/orders/*to the order service - HTTP method -- GET requests to
/productsroute to the read-optimized query service, POST requests route to the write service - Host header --
api.example.comroutes to production,api-staging.example.comroutes to staging - Headers -- Requests with
X-API-Version: 2route to the v2 backend, others to v1 - Query parameters -- Requests with
?debug=trueroute to a debug-enabled backend
The routing layer must handle path rewriting: the external path /api/v2/users/123 may be rewritten to /users/123 before forwarding to the user service, stripping the /api/v2 prefix that only has meaning at the gateway level. This prefix stripping or path transformation is a universal feature of API gateways.
API Versioning at the Gateway
API gateways are the natural enforcement point for API versioning strategies:
- URL path versioning (
/v1/users,/v2/users) -- The gateway routes each version to a different backend service or different version of the same service. This is the most common approach because it is explicit and easy to implement at the routing layer. - Header-based versioning (
Accept: application/vnd.api.v2+jsonorX-API-Version: 2) -- The gateway inspects custom headers and routes accordingly. Cleaner URLs but requires clients to set headers correctly. - Query parameter versioning (
/users?version=2) -- Least common, but supported by most gateways.
The gateway can also implement version negotiation: if a client requests v3 but only v2 is deployed, the gateway can return an error with supported versions in the response headers, or it can fall back to v2 with appropriate deprecation warnings.
Authentication and Authorization
Authentication is the single most common cross-cutting concern handled by API gateways. By centralizing auth at the gateway, individual services do not need to validate credentials, parse tokens, or manage key rotation -- they receive pre-authenticated requests with verified identity information in headers.
API Key Authentication
The simplest auth mechanism: clients include an API key in a header (X-API-Key: sk_live_abc123) or query parameter. The gateway looks up the key in its database, verifies it is valid and not revoked, identifies the associated client/tenant, and attaches the client identity to the request before forwarding. The gateway enforces rate limits, usage quotas, and access policies based on the API key's associated plan.
API key auth is appropriate for server-to-server communication where keys can be kept secret. It is inappropriate for browser or mobile clients where the key would be exposed in client-side code.
JWT Validation
For OAuth 2.0 and JWT-based authentication, the gateway validates the token without contacting an external authorization server on every request. The gateway:
- Extracts the JWT from the
Authorization: Bearer <token>header - Verifies the signature using the issuer's public key (fetched and cached from the JWKS endpoint)
- Checks the expiration (
exp), not-before (nbf), and issuer (iss) claims - Optionally verifies the audience (
aud) claim matches the API's expected audience - Extracts identity claims (user ID, roles, scopes) and passes them to the backend as trusted headers
JWT validation is stateless: the gateway does not need a database lookup or network call to validate the token (assuming the signing key is cached). This makes it extremely fast -- sub-millisecond per request. The tradeoff is that JWTs cannot be revoked before expiration without maintaining a revocation list (which reintroduces statefulness). Short-lived tokens (5-15 minutes) mitigate this by bounding the window of compromise.
OAuth 2.0 Token Introspection
For opaque tokens (non-JWT access tokens), the gateway cannot validate the token locally. Instead, it calls the authorization server's token introspection endpoint (RFC 7662) to verify the token and retrieve associated metadata (scopes, client ID, expiration). This adds a network round-trip to every request, so gateways typically cache introspection results with a short TTL (30-60 seconds).
mTLS for Machine-to-Machine APIs
For service-to-service API access, the gateway can require mutual TLS: the client must present a valid X.509 certificate signed by a trusted CA. The gateway verifies the certificate chain, checks revocation status (CRL or OCSP), and extracts the client identity from the certificate's subject or SAN field. mTLS is the strongest auth mechanism because the private key never leaves the client, but it is complex to manage at scale (certificate provisioning, rotation, revocation). Service meshes automate mTLS management for internal APIs.
Rate Limiting Algorithms
Rate limiting is the mechanism that protects backend services from abuse, prevents individual clients from consuming disproportionate resources, and enforces usage quotas tied to API pricing tiers. The choice of rate limiting algorithm affects accuracy, fairness, memory usage, and behavior at the boundary of the limit.
Token Bucket
The most intuitive algorithm. Each client has a "bucket" with a fixed capacity (e.g., 100 tokens). A token is removed for each request. Tokens are added at a fixed rate (e.g., 10 per second). If the bucket is empty, the request is rejected. This allows bursts up to the bucket capacity while enforcing a sustained rate equal to the refill rate.
Token bucket has two parameters: rate (tokens per second refill) and burst (bucket capacity). A configuration of rate=100/s, burst=200 allows a burst of 200 requests followed by a sustained 100 requests/second. The burst parameter is critical for real-world APIs: clients often send requests in bursts (page loads, batch operations) rather than at a perfectly uniform rate, and rejecting bursts that stay within the sustained rate creates a poor developer experience.
Sliding Window Log
Stores a timestamp for each request in the current window. To check if a new request is within the limit, count the number of stored timestamps within the window [now - window_size, now]. This is perfectly accurate but memory-intensive: each request stores a timestamp, so a limit of 10,000 requests/minute requires storing up to 10,000 timestamps per client.
Sliding Window Counter
A memory-efficient approximation of the sliding window log. Maintains counters for the current and previous fixed windows (e.g., current minute and previous minute). The rate estimate is: previous_count * overlap_fraction + current_count. For example, if the previous minute had 80 requests and we are 30 seconds into the current minute with 40 requests, the estimate is 80 * 0.5 + 40 = 80. This requires only two counters per client per window, regardless of the request rate.
Cloudflare uses this algorithm for their global rate limiting because it combines good accuracy with O(1) memory per client per rule -- essential when enforcing rate limits across millions of API keys.
Fixed Window Counter
The simplest algorithm: maintain a counter for each fixed time window (e.g., each calendar minute). Increment on each request, reject when the counter exceeds the limit. The problem is boundary spikes: a client can send limit requests at the end of one window and limit requests at the beginning of the next, achieving 2x the limit in a window-length period spanning the boundary.
Leaky Bucket (Queue-Based)
Processes requests at a fixed rate, queuing excess requests. Unlike token bucket which allows bursts, leaky bucket enforces a strict output rate. Requests that arrive faster than the drain rate are queued (up to a queue capacity), and excess requests are dropped. This produces a perfectly smooth output rate, which is useful for protecting fragile backends that cannot handle any traffic spikes.
Distributed Rate Limiting
When the API gateway runs as multiple instances (scaled horizontally behind a load balancer), rate limiting must be coordinated across instances. Two approaches:
- Centralized counter store -- All gateway instances read/write rate limit counters from Redis or Memcached. Each request performs a Redis INCR + EXPIRE (or Lua script for atomic check-and-increment). This provides accurate global rate limiting but adds a Redis round-trip (~0.5-1ms) to every request.
- Local rate limiting with synchronization -- Each instance maintains local counters and periodically synchronizes with peers or a central store. This is faster (no per-request Redis call) but less accurate: the actual rate may briefly exceed the configured limit by up to N*local_limit (where N is the number of instances) before synchronization catches up.
AWS API Gateway uses a centralized approach with token bucket rate limiting. Kong supports both Redis-backed (exact) and local (approximate) rate limiting via its rate-limiting plugin.
Request and Response Transformation
API gateways transform requests and responses to decouple the external API contract from the internal service implementation.
Header Manipulation
Common header transformations include: adding identity headers (X-User-Id, X-Tenant-Id, X-Request-Id) that the auth stage populated, removing internal headers that should not leak to clients (server version, debug headers), adding security headers (X-Content-Type-Options, Strict-Transport-Security), and adding or modifying CORS headers for browser-based API consumers.
Request/Response Body Transformation
Some gateways support transforming request and response bodies: converting XML payloads to JSON, restructuring JSON response shapes, filtering out sensitive fields from responses (e.g., removing internal IDs or PII from external-facing API responses), and aggregating responses from multiple backend services into a single response (API composition or BFF -- Backend for Frontend pattern).
Body transformation is expensive (requires buffering, parsing, and re-serializing the body) and should be used sparingly. If you find yourself doing heavy body transformation at the gateway, it often indicates a need for a dedicated BFF service rather than gateway-level transformation.
Protocol Translation
API gateways can translate between protocols: exposing a REST/JSON API to external clients while communicating with backend services via gRPC (gRPC-JSON transcoding), or accepting GraphQL queries and decomposing them into REST calls to multiple backend services. Envoy supports gRPC-JSON transcoding natively via its grpc_json_transcoder filter, allowing clients to call gRPC services using REST conventions.
API Gateway Architectures
API gateways can be deployed in several architectural patterns, each with different tradeoffs.
Edge Gateway (North-South)
The most common pattern: the gateway sits at the network edge, handling all external client traffic. It terminates TLS, authenticates requests, enforces rate limits, and routes to internal services. External clients never communicate directly with backend services. This is the pattern used by Kong, Amazon API Gateway, and Apigee.
Internal Gateway (East-West)
A gateway for service-to-service communication within the internal network. This is less common because service meshes handle most east-west concerns (mTLS, load balancing, observability). However, internal gateways are useful for: API composition (aggregating multiple service responses), protocol translation (REST-to-gRPC), and enforcing organizational API standards across teams.
Gateway per Domain (BFF Pattern)
A separate gateway for each client type: mobile gateway, web gateway, partner gateway. Each gateway is tailored to its consumer's needs -- the mobile gateway may aggregate data to reduce round trips, the web gateway may support WebSocket for real-time features, and the partner gateway may enforce stricter rate limits and different auth requirements. This pattern is called "Backend for Frontend" (BFF) and is common in large organizations with diverse API consumers.
API Gateway vs Service Mesh vs Load Balancer
These three technologies overlap but serve different purposes:
- API Gateway -- Handles north-south traffic (external clients to internal services). Focuses on authentication, rate limiting, API versioning, and request transformation. Operates at the edge of the network.
- Service Mesh -- Handles east-west traffic (service-to-service within the cluster). Focuses on mTLS, observability, traffic splitting, and resilience. Operates transparently within the service network.
- Load Balancer -- Distributes traffic across instances of the same service. Focuses on availability and throughput. Operates at L4 or L7 with minimal request awareness.
In a complete architecture, all three coexist: the API gateway handles external authentication and rate limiting, routes requests to internal services, where the service mesh handles inter-service mTLS and observability, and load balancers (both within the mesh and at the infrastructure layer) distribute traffic across service instances.
Observability at the API Gateway
API gateways are uniquely positioned for observability: they see every external request and can correlate traffic patterns, error rates, and latency across all backend services.
Key metrics exposed by API gateways:
- Request rate -- Total requests per second, broken down by route, consumer, and HTTP method. Essential for capacity planning and detecting traffic anomalies.
- Error rate -- Percentage of 4xx and 5xx responses per route. Distinguishes client errors (4xx -- bad requests, auth failures) from server errors (5xx -- backend failures).
- Latency distribution -- p50, p90, p95, p99 latency per route. The distribution matters more than the average: a p50 of 50ms with a p99 of 5000ms indicates an intermittent problem affecting 1% of requests.
- Rate limit hits -- How often clients are being rate-limited. High rate limit hit rates may indicate: misconfigured limits, a misbehaving client, a DDoS attack, or limits that are too low for legitimate traffic.
- Auth failures -- Failed authentication attempts per consumer. Spikes may indicate credential stuffing attacks, expired tokens, or misconfigured clients.
API gateways also generate the initial trace context (the X-Request-Id or W3C traceparent header) that propagates through all downstream services, enabling end-to-end distributed tracing for every external request.
Security Considerations
As the entry point for all external traffic, the API gateway is a critical security boundary. Common security features include:
- IP allow/deny lists -- Block or allow traffic based on source IP or CIDR range. Useful for restricting partner APIs to known IP ranges or blocking known-bad actors.
- WAF integration -- Web Application Firewall rules that inspect request bodies for SQL injection, XSS, and other OWASP Top 10 attack patterns. Many gateways integrate with ModSecurity rules or cloud WAF services (AWS WAF, Cloudflare WAF).
- Request size limits -- Enforce maximum request body size to prevent resource exhaustion. A 10MB request body submitted to every endpoint could overwhelm backends that do not expect large payloads.
- TLS enforcement -- Require minimum TLS version (1.2+), enforce strong cipher suites, and implement HSTS headers. The gateway is the TLS termination point, so its TLS configuration determines the security of the external-facing connection.
- CORS enforcement -- Enforce Cross-Origin Resource Sharing policies for browser-based API consumers. The gateway can validate the Origin header and return appropriate CORS headers without backend services needing to implement CORS logic.
API Gateways and Network Infrastructure
API gateways sit at the intersection of application architecture and network infrastructure. They are typically deployed behind a load balancer (or cloud load balancer) that provides L4 distribution across gateway instances, TLS passthrough or termination, and DNS-based GSLB across regions.
For globally distributed APIs, the API gateway is replicated across multiple regions, with DNS or anycast routing directing clients to the nearest gateway instance. Each regional gateway instance connects to both local backend services and (for some requests) cross-region services. The BGP routing topology between regions determines the latency of cross-region API calls, making the AS path between your cloud provider's regions a relevant factor in API architecture design.
Explore the routing infrastructure behind your API gateways and their backend services with the god.ad BGP Looking Glass. Look up the IP addresses of your gateway endpoints to see the BGP routes, origin ASNs, and network paths that determine how client traffic reaches your APIs.