How Rate Limiting Works
Every service exposed to the internet faces the same fundamental problem: demand can exceed capacity. Whether it is a DDoS attack flooding your servers with millions of requests, a misbehaving client hammering an API endpoint in a tight loop, or a viral spike in legitimate traffic, the result is the same — degraded performance or total unavailability. Rate limiting is the set of techniques that control how many requests a client or system can make within a given time window, protecting infrastructure from overload and ensuring fair access for everyone.
Rate limiting operates at every layer of the modern internet stack, from kernel-level packet filtering and firewall rules all the way up to application middleware and CDN edge policies. Understanding how it works — the algorithms, the tradeoffs, the distributed systems challenges — is essential for anyone building or operating internet-facing services.
Why Rate Limiting Matters
Rate limiting serves four critical purposes, each addressing a different failure mode:
DDoS mitigation. Distributed denial-of-service attacks generate traffic volumes that can overwhelm even well-provisioned infrastructure. Rate limiting at the network edge — whether through firewall rules, load balancer policies, or CDN rate limiting rules — drops excess traffic before it reaches application servers. Without rate limiting, a single attacker with a botnet can render a service unreachable for all users.
API abuse prevention. Public APIs are routinely scraped, brute-forced, or abused by automated tools. Rate limiting per API key, per IP, or per user account prevents any single consumer from monopolizing shared resources. This is why every major API — from GitHub to Stripe to Google Maps — enforces rate limits and returns 429 Too Many Requests when they are exceeded.
Fair usage. In multi-tenant systems, one noisy tenant can degrade performance for everyone else. Per-tenant rate limits ensure that a single customer's burst of activity does not starve other customers of compute, memory, or I/O. This is the "noisy neighbor" problem, and rate limiting is the primary defense.
Cost control. Cloud infrastructure is billed by usage. An unexpected traffic spike — whether malicious or accidental — can generate enormous bills. Rate limiting caps resource consumption, providing a financial safety net alongside the technical one.
Token Bucket Algorithm
The token bucket is the most widely used rate limiting algorithm, favored for its simplicity and its ability to allow controlled bursts of traffic. It works by maintaining a virtual "bucket" that holds tokens.
The rules are straightforward: tokens are added to the bucket at a fixed refill rate (e.g., 10 tokens per second). The bucket has a maximum capacity (e.g., 50 tokens). When a request arrives, it must consume one token from the bucket. If the bucket is empty, the request is rejected or queued. If the bucket is full, new tokens are discarded — they do not accumulate beyond the maximum.
The key property of the token bucket is burst allowance. If a client has been idle for a while, the bucket fills to capacity. The client can then send a burst of requests up to the bucket size before being throttled. This makes the token bucket well-suited for bursty traffic patterns where occasional spikes are acceptable but sustained overload is not.
Implementation is efficient: you do not need to actually tick a timer and add tokens. Instead, you store the last request timestamp and the current token count, then compute how many tokens have accumulated since then using simple arithmetic: tokens = min(max_tokens, stored_tokens + (now - last_time) * refill_rate). This lazy evaluation means a token bucket requires only two values per key — a timestamp and a counter — making it practical even at massive scale.
Linux's iptables uses a token bucket internally for its -m limit module. Nginx uses a variant for its limit_req directive. Amazon API Gateway, Stripe, and most cloud provider rate limiters are built on token buckets.
Leaky Bucket Algorithm
The leaky bucket algorithm is the dual of the token bucket. Imagine a bucket with a hole in the bottom: water (requests) pours in from the top, and drains out at a constant rate from the hole. If water pours in faster than it drains, the bucket fills up. Once the bucket overflows, excess water (requests) is discarded.
In implementation terms, the leaky bucket is a FIFO queue with a fixed processing rate. Incoming requests are added to the queue. A processor dequeues and handles requests at a constant rate. If the queue is full when a new request arrives, the request is dropped.
The critical difference from the token bucket: the leaky bucket smooths traffic to a constant output rate. There are no bursts. Even if 100 requests arrive simultaneously, they are processed one by one at the configured drain rate. This makes the leaky bucket ideal for systems that need a predictable, uniform request rate — such as writing to a database at a fixed throughput or sending messages to a rate-limited downstream API.
The tradeoff is latency. Requests that are queued rather than immediately processed experience added delay. And unlike the token bucket, there is no way to "save up" capacity for a burst. The leaky bucket enforces strict smoothing, which can feel unresponsive to users who expect immediate results after a period of inactivity.
Fixed Window Counters
The simplest rate limiting approach is the fixed window counter. Divide time into fixed intervals (e.g., 1-minute windows) and maintain a counter for each window. Every request increments the counter for the current window. If the counter exceeds the limit, the request is denied. When the window expires, the counter resets to zero.
Fixed window counters are trivially implemented — a single counter and a timestamp per key — and easy to reason about. But they have a well-known flaw: the boundary burst problem.
Consider a limit of 100 requests per minute. A client sends 100 requests at 12:00:59 (the last second of one window) and another 100 requests at 12:01:00 (the first second of the next window). Both windows see exactly 100 requests, so both pass the rate limit check. But the server just handled 200 requests in two seconds — double the intended rate — because the requests straddled the window boundary.
Despite this flaw, fixed window counters remain widely used for their simplicity. Many internal rate limiting systems use them because the boundary burst problem rarely matters when limits are set conservatively. Redis's INCR command with EXPIRE makes fixed window counters a one-liner to implement.
Sliding Window Log
The sliding window log fixes the boundary burst problem by tracking the exact timestamp of every request. Instead of maintaining a counter per window, you maintain a log (sorted set) of all request timestamps within the window duration.
When a new request arrives: (1) remove all entries older than now - window_size, (2) count the remaining entries, (3) if the count is below the limit, add the new timestamp and allow the request; otherwise, deny it.
This approach is perfectly accurate — there is no boundary burst problem because the window truly slides with each request. But it comes at a cost: memory usage is proportional to the number of requests, not the number of clients. A client making 10,000 requests per hour requires storing 10,000 timestamps. For high-volume APIs, this becomes impractical. Redis sorted sets (ZADD / ZREMRANGEBYSCORE / ZCARD) are a common implementation, but memory consumption can be significant.
Sliding Window Counter
The sliding window counter is a hybrid that combines the accuracy of the sliding log with the memory efficiency of fixed windows. It works by maintaining counters for two adjacent fixed windows and computing a weighted average based on the current position within the window.
For example, suppose the limit is 100 requests per minute. The previous window (12:00-12:01) had 84 requests. The current window (12:01-12:02) has 36 requests so far, and we are 40% through the current window (24 seconds in). The estimated request count is: 84 * (1 - 0.4) + 36 = 84 * 0.6 + 36 = 86.4. Since 86.4 is below 100, the request is allowed.
This approach uses only two counters per key per window — the same memory as fixed window counters — but produces a much more accurate approximation of a true sliding window. The error is bounded and predictable. Cloudflare's rate limiting implementation uses this algorithm, as do many other production systems.
Rate Limiting at the Network Layer
The lowest level at which rate limiting can be applied is the network stack itself, operating on raw packets before they reach any application code.
iptables and nftables
Linux's iptables provides the -m limit match module, which implements a token bucket rate limiter at the packet level. For example:
iptables -A INPUT -p tcp --dport 80 -m limit --limit 25/sec --limit-burst 50 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j DROP
This accepts up to 25 HTTP packets per second with a burst tolerance of 50 packets, dropping everything else. The hashlimit module extends this with per-source-IP rate limiting, maintaining separate token buckets for each source address.
nftables, the successor to iptables, provides similar functionality with a cleaner syntax and better performance for large rulesets. Both operate in kernel space, making them the fastest possible rate limiting mechanism — no context switch to userspace is needed.
BPF and XDP
For even higher performance, BPF (Berkeley Packet Filter) programs can be attached to network interfaces via XDP (eXpress Data Path). XDP programs run before the kernel's network stack even processes the packet, achieving line-rate filtering on modern NICs. Companies like Cloudflare and Meta use XDP-based rate limiting to handle multi-terabit DDoS attacks, dropping malicious packets at the NIC driver level before they consume any CPU or memory resources.
Rate Limiting at the Load Balancer
Load balancers sit between clients and application servers, making them a natural enforcement point for rate limiting.
Nginx
Nginx's limit_req module implements a leaky bucket algorithm. Configuration defines a shared memory zone for tracking request rates per key (typically the client IP):
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
server {
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend;
}
}
The rate=10r/s sets the drain rate (10 requests per second). The burst=20 allows up to 20 excess requests to queue. The nodelay flag processes burst requests immediately rather than spacing them out, converting the leaky bucket into something closer to a token bucket in practice.
HAProxy Stick Tables
HAProxy provides stick tables — in-memory key-value stores that track per-connection metrics. Stick tables can track request rates, connection counts, and byte counters per client IP, and use these for rate limiting decisions:
frontend http_front
stick-table type ip size 200k expire 30s store http_req_rate(10s)
http-request deny if { src_http_req_rate gt 100 }
This denies clients exceeding 100 requests per 10-second window. HAProxy stick tables can be replicated across peers for distributed rate limiting.
Envoy Proxy
Envoy supports both local rate limiting (per-instance token bucket) and global rate limiting (via an external rate limit service). The global rate limit service is typically a gRPC service backed by Redis, allowing consistent rate limiting across an Envoy mesh. This architecture separates rate limiting policy from proxy configuration, making it easier to manage at scale.
Rate Limiting at the Application Layer
Application-layer rate limiting operates within the application code itself, typically as middleware in the request processing pipeline. This is where the most granular rate limiting decisions are made — per user, per API key, per endpoint, per operation type.
Application-layer rate limiters have access to rich request context that lower layers lack: the authenticated user identity, the API key, the specific endpoint being called, the request payload size. This enables sophisticated policies like "free tier users get 100 requests per hour; paid users get 10,000; enterprise users get custom limits."
The tradeoff is that application-layer rate limiting only protects after the request has already traversed the network, the load balancer, and the application framework. It cannot protect against volumetric attacks that overwhelm the network itself. Application-layer rate limiting is a complement to, not a replacement for, network and edge-layer protection.
Rate Limiting at the CDN/Edge
CDN providers offer rate limiting at their global edge network, which is often the most effective place to enforce limits because traffic is filtered before it reaches your infrastructure at all.
Cloudflare Rate Limiting lets you define rules based on URI paths, HTTP methods, headers, and response codes. Rules can match on patterns like "more than 100 requests per minute to /api/* from the same IP" and respond with a block, challenge, or custom response. Because Cloudflare operates anycast points of presence in over 300 cities, the rate limiting logic runs close to the client, and attack traffic never reaches the origin server. Cloudflare's newer WAF rate limiting rules use the sliding window counter algorithm internally.
AWS WAF provides rate-based rules that count requests matching a pattern and block source IPs that exceed the threshold. AWS WAF integrates with CloudFront, ALB, and API Gateway, providing rate limiting at the edge of AWS's network.
Edge-based rate limiting has one significant challenge: state synchronization. A CDN with 300 PoPs needs each PoP to have an accurate view of a client's request count. In practice, most CDN rate limiters enforce limits per-PoP rather than globally, meaning a distributed attacker hitting different PoPs could exceed the intended global limit. Some CDN providers mitigate this with eventual consistency — PoPs periodically sync counters — but there is always a window where the limit can be exceeded.
Distributed Rate Limiting
Rate limiting a single server is straightforward — you keep the state in memory. Distributed rate limiting across a fleet of servers is a fundamentally harder problem. If you run 50 application servers behind a load balancer, each server maintaining its own local rate limit counter would allow a client to make 50x the intended limit by having their requests distributed across all servers.
Redis-Based Rate Limiting
The standard solution is centralized state in Redis. All application servers check and update a shared counter in Redis for each rate-limited key. A typical implementation uses Redis's atomic INCR command with EXPIRE for fixed window counting, or sorted sets for sliding window logs.
The classic Redis rate limiter uses a MULTI/EXEC transaction to atomically increment a counter and set its expiration:
MULTI
INCR rate_limit:{user_id}:{window}
EXPIRE rate_limit:{user_id}:{window} 60
EXEC
This is atomic within a single Redis instance — no other client can read a stale value between the INCR and EXPIRE. But MULTI/EXEC is not a true distributed lock; on a Redis cluster with multiple shards, the key might live on a different shard than expected during resharding.
Race Conditions
The fundamental race condition in distributed rate limiting is the check-then-increment pattern. If two servers simultaneously read the counter, both see it below the limit, and both increment it, the effective limit is exceeded by one. Redis's INCR is atomic and returns the new value, which eliminates this race for single-key counters — the pattern is "increment first, then check the returned value against the limit."
For more complex algorithms (sliding window log with ZADD + ZCARD), Lua scripts executed atomically on the Redis server are the standard solution. A Lua script runs as a single atomic operation, preventing any interleaving with other commands.
Consistent Hashing
For extremely high throughput, even Redis can become a bottleneck. Consistent hashing distributes rate limiting state across multiple Redis instances based on the rate-limited key (user ID, IP address, etc.). Each key deterministically maps to a specific Redis node, so all servers agree on which Redis instance holds the counter for a given key. Libraries like hashring or Redis Cluster's built-in hash slots provide this functionality.
HTTP Rate Limit Headers
Well-designed APIs communicate rate limit status to clients through HTTP headers, defined in the IETF's RFC 6585 and the draft RateLimit header fields specification. The standard headers are:
RateLimit-Limit— the maximum number of requests allowed in the current window (e.g.,RateLimit-Limit: 100)RateLimit-Remaining— how many requests the client has left before being throttled (e.g.,RateLimit-Remaining: 73)RateLimit-Reset— the time (in seconds or as a Unix timestamp) until the rate limit window resets (e.g.,RateLimit-Reset: 42)Retry-After— included in429 Too Many Requestsresponses, telling the client how long to wait before retrying (e.g.,Retry-After: 30)
When a client exceeds the rate limit, the server responds with HTTP 429 Too Many Requests and a Retry-After header. Well-behaved clients respect this header and implement exponential backoff. Poorly-behaved clients that retry immediately after a 429 create a feedback loop of retries, making the overload worse — which is why rate limiting is often paired with escalating penalties (progressively longer block times for repeated violations).
API Rate Limiting Patterns
Production APIs rarely use a single flat rate limit. Instead, they compose multiple limits across different dimensions:
Per-user limits are keyed by authenticated user ID. They prevent any single user from monopolizing resources, regardless of how many IPs or API keys they use. This is the most common pattern for authenticated APIs.
Per-IP limits are keyed by client IP address. They protect against unauthenticated abuse and brute-force attacks. However, IP-based limits can be problematic behind NAT gateways or corporate proxies where many users share a single IP. The X-Forwarded-For header helps but can be spoofed.
Per-endpoint limits apply different rates to different API routes. A read-heavy endpoint like GET /api/status might allow 1000 requests per minute, while a write-heavy endpoint like POST /api/transfer might allow only 10. This reflects the different resource costs of different operations.
Tiered plans assign different rate limits based on subscription level. GitHub's API, for example, allows 60 requests per hour for unauthenticated users, 5,000 per hour for authenticated users, and custom higher limits for enterprise customers. The rate limit tier is typically encoded in the API key or OAuth token.
Composite keys combine multiple dimensions: "100 requests per minute per user per endpoint" prevents a single user from flooding a specific endpoint while allowing high aggregate usage across different endpoints.
Adaptive Rate Limiting and Circuit Breakers
Static rate limits are a blunt instrument. Adaptive rate limiting adjusts limits dynamically based on current system health — increasing limits when the system is healthy and reducing them under stress.
A common approach monitors backend latency, error rates, and CPU utilization. When latency exceeds a threshold (indicating the backend is struggling), the rate limiter reduces the allowed request rate. When metrics return to normal, limits are gradually increased. This creates a feedback loop that automatically protects the system without manual intervention.
Circuit breakers are a related pattern from the distributed systems toolkit. A circuit breaker monitors the error rate of calls to a downstream service. When failures exceed a threshold, the circuit "opens" and all subsequent requests are immediately rejected without even attempting the call — fast-failing rather than waiting for a timeout. After a configurable period, the circuit enters a "half-open" state and allows a limited number of test requests through. If those succeed, the circuit closes and normal traffic resumes.
Circuit breakers and rate limiters complement each other. Rate limiting controls the input rate; circuit breakers control the output behavior when a dependency fails. Netflix's Hystrix library popularized this pattern, and modern implementations like resilience4j and Envoy's circuit breaking continue to evolve it.
Rate Limiting vs Throttling vs Backpressure
These three terms are often used interchangeably, but they describe distinct mechanisms:
Rate limiting enforces a hard cap on request volume. Requests that exceed the limit are rejected outright (HTTP 429, TCP RST, or packet drop). The decision is binary: allow or deny. Rate limiting protects the server by shedding load.
Throttling slows down requests rather than rejecting them. A throttled client might see artificially increased latency, reduced bandwidth, or deprioritized processing. Throttling preserves the request but degrades the experience — common in ISP bandwidth management and API quotas where partial service is better than no service.
Backpressure is a feedback mechanism where a downstream system signals to upstream producers that it is overwhelmed, causing the producers to slow down. Unlike rate limiting (which is imposed by the server on clients), backpressure propagates upstream through the system. TCP's flow control (window sizing) is a classic example: when a receiver's buffer fills, it advertises a smaller window, causing the sender to transmit less data. Reactive Streams and protocols like gRPC flow control implement application-level backpressure.
In practice, robust systems use all three: rate limiting at the edge to reject abusive traffic, throttling in the middle tier to degrade gracefully under load, and backpressure internally to prevent queue overflow between services.
BGP Blackholing as Network-Level Rate Limiting
At the most extreme end of rate limiting is BGP blackholing, specifically Remotely Triggered Black Hole (RTBH) routing. When a network is under a volumetric DDoS attack so large that no amount of per-packet rate limiting can handle it, the target network can announce a specific BGP route for the victim's IP address with a special community string that tells upstream routers to drop all traffic destined for that IP.
RTBH works by setting the next-hop of the blackholed route to a discard interface (typically the Null0 or discard interface on the router). When upstream providers receive this announcement, they install a route that silently drops all packets matching the prefix. The attack traffic is absorbed across the upstream network rather than concentrating at the target.
The tradeoff is brutal: blackholing drops all traffic to the victim IP, including legitimate traffic. The attacker effectively wins — the service is unavailable — but the collateral damage to the rest of the network is contained. This is why blackholing is a last resort, used when the attack volume exceeds the capacity of more selective mitigation techniques.
More sophisticated variants like flowspec (BGP Flow Specification, RFC 5575) allow networks to distribute fine-grained traffic filtering rules via BGP. Instead of blackholing an entire prefix, flowspec can match on source/destination IP, port, protocol, packet length, and other fields — enabling targeted rate limiting or blocking of attack traffic while allowing legitimate traffic through. Flowspec essentially distributes firewall rules across the internet's routing infrastructure.
You can observe BGP blackholing and RTBH communities in practice by examining BGP routes with community strings. Networks like Cloudflare (AS13335) and Lumen (AS3356) support RTBH communities that their customers can use to trigger upstream blackholing during attacks.
Choosing the Right Approach
The choice of rate limiting algorithm and enforcement layer depends on your specific requirements:
- Token bucket — best for APIs with bursty traffic patterns where you want to allow short spikes while enforcing a sustained rate
- Leaky bucket — best when you need a smooth, constant output rate regardless of input burstiness
- Fixed window — best when simplicity matters more than precision, and the boundary burst problem is acceptable
- Sliding window counter — best for production systems that need accuracy with bounded memory
- Network-layer (iptables/BPF) — best for volumetric attack mitigation where packet-level filtering is needed
- Load balancer — best for connection and request rate limiting before traffic reaches application servers
- Application layer — best for identity-aware, context-rich rate limiting decisions
- CDN/edge — best for global protection of origin infrastructure, especially against DDoS attacks
- BGP blackholing — last resort for attacks that exceed all other mitigation capacity
In practice, defense in depth is the right answer. A well-architected system applies rate limiting at multiple layers: CDN edge rules for volumetric protection, load balancer limits for connection management, application middleware for identity-based policies, and BGP blackholing as the ultimate safety valve. Each layer catches what the layer above missed, and together they provide comprehensive protection against the full spectrum of abuse.
You can explore the networks that provide these protections — CDN and DDoS mitigation providers like Cloudflare (AS13335), transit providers like Lumen (AS3356) that support RTBH, and the anycast infrastructure that makes global rate limiting possible — by looking up their ASNs and examining their routing tables.