Rate Limiting: Algorithms, Patterns & Production Architecture
Every production API needs rate limiting. Without it, a single misbehaving client can exhaust your resources, degrade service for everyone, and run up your cloud bill. Rate limiting is the gatekeeper that keeps your system healthy under pressure.
Why Rate Limit?#
- Protect availability — prevent resource exhaustion from traffic spikes or abuse
- Ensure fairness — no single consumer monopolizes capacity
- Control costs — bound compute, bandwidth, and third-party API spend
- Mitigate attacks — slow down brute-force, credential stuffing, and scraping
Rate Limiting Algorithms#
Fixed Window#
Divide time into fixed intervals (e.g., 1-minute windows). Count requests per window. Simple, but suffers from boundary bursts — a client can fire 100 requests at 0:59 and another 100 at 1:00.
Sliding Window Log#
Track each request timestamp. Count requests within the trailing window. Accurate, but storing every timestamp is memory-expensive at scale.
Sliding Window Counter#
A hybrid: combine the current window's count with a weighted portion of the previous window's count. Near-accurate with constant memory per key.
def sliding_window_count(prev_count, curr_count, window_size, elapsed):
weight = (window_size - elapsed) / window_size
return curr_count + prev_count * weight
Token Bucket#
A bucket holds tokens up to a max capacity. Each request consumes a token. Tokens refill at a steady rate. Allows short bursts while enforcing an average rate.
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.monotonic()
def allow(self) -> bool:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
Leaky Bucket#
Requests enter a queue (bucket) and are processed at a fixed rate. Excess requests overflow and are rejected. Smooths traffic perfectly but adds latency.
Implementation with Redis#
Redis is the go-to backing store for distributed rate limiters. Atomic operations and key expiration make it ideal.
Sliding Window Counter in Redis#
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window)
return 1 -- allowed
end
return 0 -- rejected
Express Middleware Example#
import { Redis } from "ioredis";
import type { Request, Response, NextFunction } from "express";
const redis = new Redis();
export function rateLimit(limit: number, windowSec: number) {
return async (req: Request, res: Response, next: NextFunction) => {
const key = `rl:${req.ip}`;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, windowSec);
}
const remaining = Math.max(0, limit - current);
const resetAt = Math.ceil(Date.now() / 1000) + windowSec;
res.set("X-RateLimit-Limit", String(limit));
res.set("X-RateLimit-Remaining", String(remaining));
res.set("X-RateLimit-Reset", String(resetAt));
if (current > limit) {
res.set("Retry-After", String(windowSec));
return res.status(429).json({ error: "Too many requests" });
}
next();
};
}
Rate Limit Headers#
Standard headers every API should return:
| Header | Purpose |
|---|---|
X-RateLimit-Limit | Max requests allowed in window |
X-RateLimit-Remaining | Requests left in current window |
X-RateLimit-Reset | Unix timestamp when window resets |
Retry-After | Seconds to wait before retrying (on 429) |
Distributed Rate Limiting#
In multi-node deployments, each instance needs a shared view of request counts. Strategies:
- Centralized store — Redis or Memcached. Simple, slight latency per request.
- Eventual consistency — each node keeps a local counter and syncs periodically. Faster but can overshoot limits briefly.
- Consistent hashing — route each client's requests to the same node. Avoids shared state but complicates failover.
For most systems, a Redis cluster with the Lua script above handles thousands of rate-limit checks per second with sub-millisecond overhead.
Granularity: Per-User vs Per-IP vs Per-API-Key#
| Strategy | Best For | Watch Out |
|---|---|---|
| Per-IP | Unauthenticated endpoints, login pages | Shared IPs (NAT, corporate proxies) penalize many users |
| Per-User | Authenticated APIs | Requires auth before rate check |
| Per-API-Key | Developer platforms, SaaS APIs | Different tiers need different limits |
| Per-Endpoint | Protecting expensive operations | Combine with per-user for best results |
Production systems typically layer multiple strategies — a global per-IP limit plus a per-user limit on authenticated routes.
Client-Side Handling#
Good clients respect rate limits gracefully.
async function fetchWithRetry(url: string, maxRetries = 3): Promise<Response> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const res = await fetch(url);
if (res.status !== 429) return res;
const retryAfter = res.headers.get("Retry-After");
const delay = retryAfter
? parseInt(retryAfter, 10) * 1000
: Math.min(1000 * 2 ** attempt, 30000); // exponential backoff
await new Promise((r) => setTimeout(r, delay));
}
throw new Error("Rate limited after max retries");
}
Key client-side patterns:
- Respect
Retry-After— always prefer the server's guidance - Exponential backoff with jitter — prevents thundering herd on retry
- Circuit breaker — stop calling entirely if repeated 429s persist
Tools & Managed Solutions#
| Tool | Type | Notes |
|---|---|---|
| Kong | API Gateway | Built-in rate limiting plugin, Redis-backed |
| Envoy | Service proxy | Local and global rate limit filters |
| AWS WAF | Cloud WAF | Rate-based rules at the edge |
| Cloudflare | CDN/WAF | Rate limiting rules with bot detection |
| Nginx | Reverse proxy | limit_req module, leaky bucket algorithm |
| express-rate-limit | Node.js middleware | Simple, pluggable stores |
For most teams, starting with a gateway-level rate limiter (Kong, Envoy) and adding application-level limits for sensitive endpoints is the pragmatic approach.
Choosing the Right Algorithm#
- Fixed window — simplest, fine for non-critical limits
- Sliding window counter — best balance of accuracy and efficiency
- Token bucket — when you need burst tolerance with average rate control
- Leaky bucket — when you need perfectly smooth output rate
Summary#
Rate limiting is not optional in production. Start with a sliding window counter in Redis, return proper X-RateLimit-* headers, layer per-IP and per-user limits, and make sure your clients handle 429s with exponential backoff. As traffic grows, move rate limiting to your API gateway and add distributed coordination.
Build smarter systems with us at codelit.io.
134 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 components
Comments