Rate Limiter System Design: Algorithms, Distributed Redis, and Scale
Rate Limiter System Design#
A rate limiter controls how many requests a client can send in a given time window. It protects services from abuse, prevents resource starvation, and manages cost — every major API (Stripe, GitHub, Twitter) enforces rate limits.
Functional Requirements#
- Limit requests per client/IP/API key within a configurable time window
- Return clear feedback when a client is throttled (HTTP 429 + headers)
- Support multiple rules — e.g., 100 req/min for
/api/search, 1000 req/min for/api/read
Non-Functional Requirements#
- Low latency — the limiter sits on the hot path; must add < 1 ms overhead
- Highly available — if the limiter goes down, traffic should not be blocked
- Distributed — works across multiple servers behind a load balancer
- Accurate — minimal over-counting or under-counting
Scale Estimation#
DAU: 50 M
Avg requests per user per day: 20
Total daily requests: 1 B
Peak QPS: ~30,000
Rate limit check per request: 1 Redis call ≈ 0.5 ms
Each rate limit check must be atomic and sub-millisecond.
Where to Place the Rate Limiter#
Client → API Gateway / Rate Limiter Middleware → Application Server → DB
Three options:
| Placement | Pros | Cons |
|---|---|---|
| Client-side | Zero server load | Easily bypassed |
| Middleware / Gateway | Centralized, language-agnostic | Extra hop |
| Application layer | Fine-grained per-endpoint rules | Coupled to app |
Recommended: Rate limit at the API gateway or middleware layer. Most cloud gateways (AWS API Gateway, Kong, Envoy) have built-in rate limiting.
Rate Limiting Algorithms#
1. Token Bucket#
The most widely used algorithm (used by AWS, Stripe).
Bucket capacity: B tokens
Refill rate: R tokens per second
Each request consumes 1 token
If tokens > 0: allow, decrement
If tokens == 0: reject (429)
Pros: Allows short bursts up to B, smooth long-term rate of R/s. Cons: Two parameters to tune per rule.
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def allow(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
2. Sliding Window Log#
Track the exact timestamp of every request in a sorted set.
On each request:
1. Remove all entries older than (now - window_size)
2. Count remaining entries
3. If count < limit: allow, add timestamp
4. Else: reject
Pros: Perfectly accurate — no boundary issues. Cons: Memory-intensive (stores every timestamp). For 1 M users at 100 req/min, that's 100 M entries.
3. Sliding Window Counter#
A memory-efficient approximation combining fixed windows:
current_window_count × weight + previous_window_count × (1 - weight)
weight = elapsed_time_in_current_window / window_size
Example with a 1-minute window, limit = 100:
Previous minute: 80 requests
Current minute (40s in): 30 requests
Estimated count = 30 × (40/60) + 80 × (20/60) = 20 + 26.7 ≈ 47
47 < 100 → allow
Pros: Low memory (two counters per key), reasonably accurate. Cons: Approximate — can allow ~1% more than the limit.
Algorithm Comparison#
| Algorithm | Memory | Accuracy | Burst Handling |
|---|---|---|---|
| Token Bucket | Low | Good | Allows controlled bursts |
| Sliding Window Log | High | Exact | No bursts beyond limit |
| Sliding Window Counter | Low | ~99% | Smooth approximation |
Distributed Rate Limiting with Redis#
In a distributed system, multiple servers must share rate limit state. Redis is the standard choice — single-threaded, atomic operations, sub-millisecond latency.
Token Bucket in Redis (Lua Script)#
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity, ARGV[2] = refill_rate, ARGV[3] = now
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return 0 -- rejected
end
The Lua script executes atomically in Redis — no race conditions.
Sliding Window Counter in Redis#
def is_allowed(redis, key, limit, window_seconds):
pipe = redis.pipeline()
now = time.time()
window_start = now - window_seconds
pipe.zremrangebyscore(key, 0, window_start) # prune old
pipe.zcard(key) # count
pipe.zadd(key, {str(now): now}) # add current
pipe.expire(key, window_seconds)
results = pipe.execute()
count = results[1]
return count < limit
Handling Race Conditions#
The Read-Then-Write Problem#
Server A: reads count = 99 (limit 100)
Server B: reads count = 99
Server A: increments to 100, allows
Server B: increments to 100, allows ← should have been rejected!
Solutions#
- Redis Lua scripts — atomic read+write in a single command (recommended)
- Redis MULTI/EXEC — transaction block, but watch/retry on contention
- Redis INCR with TTL — for simple fixed-window counters:
-- Atomic fixed-window counter
MULTI
INCR rate:user123:1711234560
EXPIRE rate:user123:1711234560 60
EXEC
-- Check if INCR result > limit
Rate Limit Headers#
Follow standard conventions so clients can adapt:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711234620
Retry-After: 47
| Header | Meaning |
|---|---|
X-RateLimit-Limit | Max requests in the window |
X-RateLimit-Remaining | Requests left in current window |
X-RateLimit-Reset | Unix timestamp when the window resets |
Retry-After | Seconds until the client should retry |
Always return these headers on both successful and rejected responses.
Client-Side Handling#
Well-behaved clients respect rate limits:
async function fetchWithRateLimit(url, options = {}) {
const res = await fetch(url, options);
if (res.status === 429) {
const retryAfter = parseInt(res.headers.get('Retry-After') || '1', 10);
await new Promise(r => setTimeout(r, retryAfter * 1000));
return fetchWithRateLimit(url, options); // retry
}
return res;
}
Best practices:
- Exponential backoff with jitter on 429s
- Read headers proactively — slow down before hitting the limit
- Queue requests client-side to stay under the limit
Failure Modes#
| Scenario | Behaviour |
|---|---|
| Redis down | Fail open — allow traffic (availability > accuracy) |
| Clock skew between servers | Use Redis server time (TIME command), not local clocks |
| Hot key (celebrity user) | Shard by user + endpoint, or use local counters with periodic sync |
Architecture Overview#
┌───────────────┐
Client ────►│ API Gateway │
│ (rate check) │
└───────┬───────┘
│ Lua script
┌───────▼───────┐
│ Redis Cluster │
│ (rate state) │
└───────────────┘
│
┌───────▼───────┐
│ Rules Config │
│ (per endpoint │
│ per tier) │
└───────────────┘
Rules are stored in a config service and cached at the gateway. Changes propagate via pub/sub.
Key Takeaways#
- Token bucket is the go-to algorithm — simple, allows bursts, low memory
- Sliding window counter is the best approximation when you need smooth limiting
- Redis Lua scripts solve the distributed race condition problem atomically
- Always return rate limit headers — good API citizenship
- Fail open when the rate limiter is unavailable — never block legitimate traffic because of infrastructure failure
- Separate rate limits by user tier, endpoint, and method for fine-grained control
Design, build, and ship — all in one place. Try codelit.io.
This is article #196 in the Codelit engineering blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsBuild this architecture
Generate an interactive architecture for Rate Limiter System Design in seconds.
Try it in Codelit →
Comments