system-designinterviewinfrastructure

Design a Rate Limiter — Algorithms, Architecture, and Trade-offs

March 24, 2026 5 min readBy Mo Discussion

Why every API needs rate limiting#

Without rate limiting, a single client can:

Exhaust your server resources (intentional or accidental)
Drive up your cloud bill with runaway scripts
Degrade experience for all other users
Enable brute-force attacks on auth endpoints

Rate limiting is a system design interview classic because it touches distributed systems, algorithms, and real-world trade-offs.

The four main algorithms#

1. Token Bucket#

A bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate.

Bucket size: Maximum burst capacity (e.g., 10 requests)
Refill rate: Steady-state limit (e.g., 5 tokens/second)

Bucket: capacity=10, refill=5/sec
Time 0: 10 tokens → 8 requests → 2 tokens left
Time 1: 2 + 5 refill = 7 tokens
Time 2: 7 + 5 refill = 10 tokens (capped at capacity)

Pros: Allows bursts, memory efficient (2 values per user), widely used Cons: Needs precise timing for refill

Used by: AWS API Gateway, Stripe, most production rate limiters

2. Sliding Window Log#

Store the timestamp of every request. Count requests within the window.

Window: 60 seconds, Limit: 100
Requests at: [t=0, t=5, t=12, ..., t=58]
New request at t=62 → remove entries before t=2, count remaining

Pros: Precise, no boundary issues Cons: Memory-heavy (stores every timestamp), expensive at high request rates

3. Sliding Window Counter#

Combine fixed window counts with interpolation. Track current and previous window counts.

Previous window (0-60s): 84 requests
Current window (60-120s): 36 requests so far
Request at t=75 (25% into current window):

Weighted count = 84 × 0.75 + 36 = 99
Limit: 100 → ALLOW (99 < 100)

Pros: Memory efficient, smoother than fixed windows Cons: Approximate (not exact), but close enough for most use cases

4. Fixed Window Counter#

Divide time into fixed windows (e.g., per minute). Count requests per window.

Window 00:00-00:59: 95 requests (limit: 100)
Window 01:00-01:59: 0 requests

Pros: Simplest implementation, low memory Cons: Boundary spike — 100 requests at 0:59 + 100 at 1:00 = 200 in 2 seconds

Architecture for distributed rate limiting#

In production, you have multiple API servers. Rate limiting must be centralized or requests will exceed limits by N× (where N = number of servers).

Centralized: Redis#

The most common approach. All API servers check/increment a counter in Redis.

-- Redis Lua script (atomic)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
  redis.call('EXPIRE', key, window)
end

if current > limit then
  return 0  -- DENIED
end
return 1  -- ALLOWED

Why Lua? The INCR + EXPIRE must be atomic. Without Lua, a crash between commands leaves a key without expiry (memory leak).

Response headers#

Always tell clients their rate limit status:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1679529600
Retry-After: 30  (on 429 responses)

Rate limiting strategies#

Strategy	Example
Per-user	100 requests/min per API key
Per-IP	50 requests/min per IP (for unauthenticated)
Per-endpoint	10 POST /login per minute (prevent brute force)
Global	10,000 requests/min total (protect infrastructure)
Tiered	Free: 100/min, Pro: 1000/min, Enterprise: unlimited

Edge cases to handle#

What happens when a request is denied? Return HTTP 429 (Too Many Requests) with Retry-After header. Don't silently drop requests.

Race conditions in distributed systems. Use Redis Lua scripts for atomic operations. Don't do read-then-write from application code.

Clock skew across servers. Use Redis server time, not application server time. Or use NTP-synchronized clocks.

Graceful degradation. If Redis is down, should you allow all requests (fail open) or deny all (fail closed)? Most choose fail open with local in-memory fallback.

Where to place the rate limiter#

API Gateway — Best for global limits, before your code runs
Reverse proxy (Nginx) — Simple, no code changes, per-IP only
Application middleware — Flexible, can use user identity
Service mesh sidecar — Per-service limits in microservices

Most production systems use multiple layers: Nginx for IP-based, API Gateway for per-key, application for per-endpoint.

Visualize your rate limiter architecture#

See how rate limiting fits into a production API stack — try Codelit to generate an interactive diagram showing how rate limiters connect to your gateway, cache, and application services.

Key takeaways#

Token bucket is the go-to algorithm — allows bursts, memory efficient
Redis is the standard for distributed rate limiting — atomic Lua scripts
Always return 429 + Retry-After — don't silently drop requests
Multiple layers — IP at edge, API key at gateway, endpoint at application
Fail open when Redis is down — use local fallback
Sliding window counter is the best compromise between accuracy and memory

{ }

Explore the Stripe architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

90+ Templates

Practice with real-world architectures — Uber, Netflix, Slack, and more

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Figma Collaborative Design Platform

Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.

10 components

Build this architecture

Generate an interactive Design a Rate Limiter in seconds.

Try it in Codelit →

system-designinterviewinfrastructure

Design a Rate Limiter — Algorithms, Architecture, and Trade-offs

March 24, 2026 5 min readBy Mo Discussion

Why every API needs rate limiting#

Without rate limiting, a single client can:

Exhaust your server resources (intentional or accidental)
Drive up your cloud bill with runaway scripts
Degrade experience for all other users
Enable brute-force attacks on auth endpoints

Rate limiting is a system design interview classic because it touches distributed systems, algorithms, and real-world trade-offs.

The four main algorithms#

1. Token Bucket#

A bucket holds tokens. Each request consumes one token. Tokens refill at a fixed rate.

Bucket size: Maximum burst capacity (e.g., 10 requests)
Refill rate: Steady-state limit (e.g., 5 tokens/second)

Bucket: capacity=10, refill=5/sec
Time 0: 10 tokens → 8 requests → 2 tokens left
Time 1: 2 + 5 refill = 7 tokens
Time 2: 7 + 5 refill = 10 tokens (capped at capacity)

Pros: Allows bursts, memory efficient (2 values per user), widely used Cons: Needs precise timing for refill

Used by: AWS API Gateway, Stripe, most production rate limiters

2. Sliding Window Log#

Store the timestamp of every request. Count requests within the window.

Window: 60 seconds, Limit: 100
Requests at: [t=0, t=5, t=12, ..., t=58]
New request at t=62 → remove entries before t=2, count remaining

Pros: Precise, no boundary issues Cons: Memory-heavy (stores every timestamp), expensive at high request rates

3. Sliding Window Counter#

Combine fixed window counts with interpolation. Track current and previous window counts.

Previous window (0-60s): 84 requests
Current window (60-120s): 36 requests so far
Request at t=75 (25% into current window):

Weighted count = 84 × 0.75 + 36 = 99
Limit: 100 → ALLOW (99 < 100)

Pros: Memory efficient, smoother than fixed windows Cons: Approximate (not exact), but close enough for most use cases

4. Fixed Window Counter#

Divide time into fixed windows (e.g., per minute). Count requests per window.

Window 00:00-00:59: 95 requests (limit: 100)
Window 01:00-01:59: 0 requests

Pros: Simplest implementation, low memory Cons: Boundary spike — 100 requests at 0:59 + 100 at 1:00 = 200 in 2 seconds

Architecture for distributed rate limiting#

In production, you have multiple API servers. Rate limiting must be centralized or requests will exceed limits by N× (where N = number of servers).

Centralized: Redis#

The most common approach. All API servers check/increment a counter in Redis.

-- Redis Lua script (atomic)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
  redis.call('EXPIRE', key, window)
end

if current > limit then
  return 0  -- DENIED
end
return 1  -- ALLOWED

Why Lua? The INCR + EXPIRE must be atomic. Without Lua, a crash between commands leaves a key without expiry (memory leak).

Response headers#

Always tell clients their rate limit status:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1679529600
Retry-After: 30  (on 429 responses)

Rate limiting strategies#

Strategy	Example
Per-user	100 requests/min per API key
Per-IP	50 requests/min per IP (for unauthenticated)
Per-endpoint	10 POST /login per minute (prevent brute force)
Global	10,000 requests/min total (protect infrastructure)
Tiered	Free: 100/min, Pro: 1000/min, Enterprise: unlimited

Edge cases to handle#

What happens when a request is denied? Return HTTP 429 (Too Many Requests) with Retry-After header. Don't silently drop requests.

Race conditions in distributed systems. Use Redis Lua scripts for atomic operations. Don't do read-then-write from application code.

Clock skew across servers. Use Redis server time, not application server time. Or use NTP-synchronized clocks.

Graceful degradation. If Redis is down, should you allow all requests (fail open) or deny all (fail closed)? Most choose fail open with local in-memory fallback.

Where to place the rate limiter#

API Gateway — Best for global limits, before your code runs
Reverse proxy (Nginx) — Simple, no code changes, per-IP only
Application middleware — Flexible, can use user identity
Service mesh sidecar — Per-service limits in microservices

Most production systems use multiple layers: Nginx for IP-based, API Gateway for per-key, application for per-endpoint.

Visualize your rate limiter architecture#

See how rate limiting fits into a production API stack — try Codelit to generate an interactive diagram showing how rate limiters connect to your gateway, cache, and application services.

Key takeaways#

Token bucket is the go-to algorithm — allows bursts, memory efficient
Redis is the standard for distributed rate limiting — atomic Lua scripts
Always return 429 + Retry-After — don't silently drop requests
Multiple layers — IP at edge, API key at gateway, endpoint at application
Fail open when Redis is down — use local fallback
Sliding window counter is the best compromise between accuracy and memory

{ }

Explore the Stripe architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

90+ Templates

Practice with real-world architectures — Uber, Netflix, Slack, and more

Build this architecture →

Comments

api design

Try these templates

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Figma Collaborative Design Platform

Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.

10 components

Build this architecture

Generate an interactive Design a Rate Limiter in seconds.

Try it in Codelit →

Design a Rate Limiter — Algorithms, Architecture, and Trade-offs

Why every API needs rate limiting#

The four main algorithms#

1. Token Bucket#

2. Sliding Window Log#

3. Sliding Window Counter#

4. Fixed Window Counter#

Architecture for distributed rate limiting#

Centralized: Redis#

Response headers#

Rate limiting strategies#

Edge cases to handle#

Where to place the rate limiter#

Visualize your rate limiter architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Distributed Rate Limiter

Figma Collaborative Design Platform

Build this architecture

Design a Rate Limiter — Algorithms, Architecture, and Trade-offs

Why every API needs rate limiting#

The four main algorithms#

1. Token Bucket#

2. Sliding Window Log#

3. Sliding Window Counter#

4. Fixed Window Counter#

Architecture for distributed rate limiting#

Centralized: Redis#

Response headers#

Rate limiting strategies#

Edge cases to handle#

Where to place the rate limiter#

Visualize your rate limiter architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

Distributed Rate Limiter

Figma Collaborative Design Platform

Build this architecture