Design a Rate Limiter — Algorithms, Architecture, and Trade-offs
Why every API needs rate limiting#
Without rate limiting, a single client can:
- Exhaust your server resources (intentional or accidental)
- Drive up your cloud bill with runaway scripts
- Degrade experience for all other users
- Enable brute-force attacks on auth endpoints
Rate limiting is a system design interview classic because it touches distributed systems, algorithms, and real-world trade-offs.
The four main algorithms#
1. Token Bucket#
A bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate.
- Bucket size: Maximum burst capacity (e.g., 10 requests)
- Refill rate: Steady-state limit (e.g., 5 tokens/second)
Bucket: capacity=10, refill=5/sec
Time 0: 10 tokens → 8 requests → 2 tokens left
Time 1: 2 + 5 refill = 7 tokens
Time 2: 7 + 5 refill = 10 tokens (capped at capacity)
Pros: Allows bursts, memory efficient (2 values per user), widely used Cons: Needs precise timing for refill
Used by: AWS API Gateway, Stripe, most production rate limiters
2. Sliding Window Log#
Store the timestamp of every request. Count requests within the window.
Window: 60 seconds, Limit: 100
Requests at: [t=0, t=5, t=12, ..., t=58]
New request at t=62 → remove entries before t=2, count remaining
Pros: Precise, no boundary issues Cons: Memory-heavy (stores every timestamp), expensive at high request rates
3. Sliding Window Counter#
Combine fixed window counts with interpolation. Track current and previous window counts.
Previous window (0-60s): 84 requests
Current window (60-120s): 36 requests so far
Request at t=75 (25% into current window):
Weighted count = 84 × 0.75 + 36 = 99
Limit: 100 → ALLOW (99 < 100)
Pros: Memory efficient, smoother than fixed windows Cons: Approximate (not exact), but close enough for most use cases
4. Fixed Window Counter#
Divide time into fixed windows (e.g., per minute). Count requests per window.
Window 00:00-00:59: 95 requests (limit: 100)
Window 01:00-01:59: 0 requests
Pros: Simplest implementation, low memory Cons: Boundary spike — 100 requests at 0:59 + 100 at 1:00 = 200 in 2 seconds
Architecture for distributed rate limiting#
In production, you have multiple API servers. Rate limiting must be centralized or requests will exceed limits by N× (where N = number of servers).
Centralized: Redis#
The most common approach. All API servers check/increment a counter in Redis.
-- Redis Lua script (atomic)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('INCR', key)
if current == 1 then
redis.call('EXPIRE', key, window)
end
if current > limit then
return 0 -- DENIED
end
return 1 -- ALLOWED
Why Lua? The INCR + EXPIRE must be atomic. Without Lua, a crash between commands leaves a key without expiry (memory leak).
Response headers#
Always tell clients their rate limit status:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1679529600
Retry-After: 30 (on 429 responses)
Rate limiting strategies#
| Strategy | Example |
|---|---|
| Per-user | 100 requests/min per API key |
| Per-IP | 50 requests/min per IP (for unauthenticated) |
| Per-endpoint | 10 POST /login per minute (prevent brute force) |
| Global | 10,000 requests/min total (protect infrastructure) |
| Tiered | Free: 100/min, Pro: 1000/min, Enterprise: unlimited |
Edge cases to handle#
What happens when a request is denied?
Return HTTP 429 (Too Many Requests) with Retry-After header. Don't silently drop requests.
Race conditions in distributed systems. Use Redis Lua scripts for atomic operations. Don't do read-then-write from application code.
Clock skew across servers. Use Redis server time, not application server time. Or use NTP-synchronized clocks.
Graceful degradation. If Redis is down, should you allow all requests (fail open) or deny all (fail closed)? Most choose fail open with local in-memory fallback.
Where to place the rate limiter#
- API Gateway — Best for global limits, before your code runs
- Reverse proxy (Nginx) — Simple, no code changes, per-IP only
- Application middleware — Flexible, can use user identity
- Service mesh sidecar — Per-service limits in microservices
Most production systems use multiple layers: Nginx for IP-based, API Gateway for per-key, application for per-endpoint.
Visualize your rate limiter architecture#
See how rate limiting fits into a production API stack — try Codelit to generate an interactive diagram showing how rate limiters connect to your gateway, cache, and application services.
Key takeaways#
- Token bucket is the go-to algorithm — allows bursts, memory efficient
- Redis is the standard for distributed rate limiting — atomic Lua scripts
- Always return 429 + Retry-After — don't silently drop requests
- Multiple layers — IP at edge, API key at gateway, endpoint at application
- Fail open when Redis is down — use local fallback
- Sliding window counter is the best compromise between accuracy and memory
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
90+ Templates
Practice with real-world architectures — Uber, Netflix, Slack, and more
Comments