Distributed Rate Limiting — Scaling Request Throttling Across Nodes
Why rate limiting gets hard at scale#
A single-node rate limiter is straightforward — track counters in memory and reject requests that exceed the threshold. But the moment you scale to multiple servers behind a load balancer, each node only sees a fraction of traffic. Without coordination, a client can blow past your global limit by spreading requests across nodes.
Distributed rate limiting solves this by sharing state across all nodes so the global request count is enforced, not a per-node approximation.
Centralized vs distributed rate limiters#
There are two broad architectures:
Centralized rate limiter#
Every application node queries a single shared store (typically Redis or Memcached) before allowing a request through.
Pros:
- Exact global counts — no approximation
- Simple mental model
- Single source of truth
Cons:
- The shared store becomes a single point of failure
- Every request adds a network round-trip
- Latency increases under high concurrency
Distributed (peer-to-peer) rate limiter#
Each node maintains local counters and periodically synchronizes with peers or a central coordinator.
Pros:
- Lower latency — most decisions are local
- Tolerates temporary network partitions
- Scales horizontally with the application
Cons:
- Counts are approximate between sync intervals
- More complex to implement and debug
- Brief over-admission windows are possible
Redis-based sliding window#
The most common production approach uses Redis as a centralized counter with a sliding window algorithm. The idea: store each request timestamp in a sorted set, then count entries within the current window.
-- Sliding window rate limiter in Redis Lua
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count remaining entries
local count = redis.call('ZCARD', key)
if count >= limit then
return 0 -- rejected
end
-- Add current request
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window)
return 1 -- allowed
Running this as a Lua script inside Redis guarantees atomicity — no race conditions between the count check and the insert.
Why Lua scripts matter#
Without atomic execution, two requests arriving simultaneously could both read a count of 99 (under a limit of 100) and both succeed, pushing the actual count to 101. The Lua script ensures the read-check-write sequence is indivisible.
Token bucket at scale#
The token bucket algorithm refills tokens at a fixed rate and each request consumes one token. It naturally supports burst traffic up to the bucket capacity.
In a distributed setting, you store bucket state in Redis:
bucket:{client_id} -> { tokens: 47, last_refill: 1711670400 }
On each request:
- Read current tokens and last refill timestamp
- Calculate tokens to add based on elapsed time
- If tokens are available, decrement and allow
- Otherwise, reject and return a
Retry-Afterheader
Use a Lua script or Redis transaction to keep this atomic. The refill calculation happens lazily — you do not need a background process adding tokens.
Fixed window vs sliding window vs sliding log#
| Algorithm | Accuracy | Memory | Burst behavior |
|---|---|---|---|
| Fixed window | Low — boundary spikes | O(1) per key | 2x burst at window edges |
| Sliding window | Medium — weighted average | O(1) per key | Smoothed across windows |
| Sliding log | High — exact timestamps | O(n) per key | No burst artifacts |
| Token bucket | High — configurable burst | O(1) per key | Controlled burst capacity |
For most APIs, the sliding window counter (a weighted blend of current and previous fixed windows) offers the best tradeoff between accuracy and memory.
Rate limit synchronization across nodes#
When latency to a central Redis is too high, you can use a local + global hybrid approach:
- Each node maintains a local counter with a fraction of the global limit
- Periodically (every 100ms–1s), nodes push local counts to Redis and pull the global total
- If the global total approaches the limit, nodes tighten their local allocation
This is similar to how Google's Doorman and Envoy's rate limit service work. The tradeoff is tuning the sync interval — too fast wastes bandwidth, too slow allows over-admission.
Consistent hashing for client affinity#
Another approach: use consistent hashing in your load balancer so requests from the same client always hit the same node. This turns the distributed problem into a local one — at the cost of uneven load distribution and failover complexity.
Handling failures gracefully#
What happens when Redis goes down? You have two choices:
- Fail open — allow all requests through (risk of overload)
- Fail closed — reject all requests (risk of total outage)
Most systems choose fail open with a local fallback rate limiter. Each node applies a conservative per-node limit until Redis recovers.
Rate limiting at the API gateway layer#
API gateways like Kong, Envoy, and NGINX can enforce rate limits before requests reach your application:
- Kong — built-in
rate-limitingplugin with Redis or PostgreSQL backing - Envoy — external rate limit service with gRPC interface
- NGINX —
limit_reqmodule with shared memory zones
Gateway-level limiting protects your backend from traffic spikes but lacks application-level context (user tier, endpoint cost). Combine both layers for defense in depth.
Implementing rate limit headers#
Always return standard headers so clients can self-throttle:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711670460
Retry-After: 30
Well-behaved clients use Retry-After to back off. Without these headers, clients resort to aggressive retries that amplify the problem.
Monitoring and observability#
Track these metrics for your rate limiter:
- Rejection rate — percentage of requests returning 429
- Latency overhead — time added by the rate limit check
- Redis hit rate — cache performance for rate limit keys
- Sync lag — delay between local and global counts (hybrid mode)
Alert when rejection rate spikes — it may indicate a traffic surge, a misbehaving client, or a misconfigured limit.
Visualize distributed rate limiting#
On Codelit, generate a multi-node API architecture with Redis-backed rate limiting to see how requests flow through the limiter, how token buckets refill, and how nodes synchronize state.
This is article #407 in the Codelit engineering blog series.
Build and explore distributed system architectures visually at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
API Backward Compatibility: Ship Changes Without Breaking Consumers
6 min read
api designBatch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsMultiplayer Game Backend
Real-time multiplayer game server with matchmaking, state sync, leaderboards, and anti-cheat.
8 componentsBuild this architecture
Generate an interactive architecture for Distributed Rate Limiting in seconds.
Try it in Codelit →
Comments