distributed-systemssystem-designbackendscalability

Distributed Rate Limiting — Scaling Request Throttling Across Nodes

March 29, 2026 6 min readBy Codelit Team Discussion

Why rate limiting gets hard at scale#

A single-node rate limiter is straightforward — track counters in memory and reject requests that exceed the threshold. But the moment you scale to multiple servers behind a load balancer, each node only sees a fraction of traffic. Without coordination, a client can blow past your global limit by spreading requests across nodes.

Distributed rate limiting solves this by sharing state across all nodes so the global request count is enforced, not a per-node approximation.

Centralized vs distributed rate limiters#

There are two broad architectures:

Centralized rate limiter#

Every application node queries a single shared store (typically Redis or Memcached) before allowing a request through.

Pros:

Exact global counts — no approximation
Simple mental model
Single source of truth

Cons:

The shared store becomes a single point of failure
Every request adds a network round-trip
Latency increases under high concurrency

Distributed (peer-to-peer) rate limiter#

Each node maintains local counters and periodically synchronizes with peers or a central coordinator.

Pros:

Lower latency — most decisions are local
Tolerates temporary network partitions
Scales horizontally with the application

Cons:

Counts are approximate between sync intervals
More complex to implement and debug
Brief over-admission windows are possible

Redis-based sliding window#

The most common production approach uses Redis as a centralized counter with a sliding window algorithm. The idea: store each request timestamp in a sorted set, then count entries within the current window.

-- Sliding window rate limiter in Redis Lua
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count remaining entries
local count = redis.call('ZCARD', key)

if count >= limit then
  return 0  -- rejected
end

-- Add current request
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window)
return 1  -- allowed

Running this as a Lua script inside Redis guarantees atomicity — no race conditions between the count check and the insert.

Why Lua scripts matter#

Without atomic execution, two requests arriving simultaneously could both read a count of 99 (under a limit of 100) and both succeed, pushing the actual count to 101. The Lua script ensures the read-check-write sequence is indivisible.

Token bucket at scale#

The token bucket algorithm refills tokens at a fixed rate and each request consumes one token. It naturally supports burst traffic up to the bucket capacity.

In a distributed setting, you store bucket state in Redis:

bucket:{client_id} -> { tokens: 47, last_refill: 1711670400 }

On each request:

Read current tokens and last refill timestamp
Calculate tokens to add based on elapsed time
If tokens are available, decrement and allow
Otherwise, reject and return a Retry-After header

Use a Lua script or Redis transaction to keep this atomic. The refill calculation happens lazily — you do not need a background process adding tokens.

Fixed window vs sliding window vs sliding log#

Algorithm	Accuracy	Memory	Burst behavior
Fixed window	Low — boundary spikes	O(1) per key	2x burst at window edges
Sliding window	Medium — weighted average	O(1) per key	Smoothed across windows
Sliding log	High — exact timestamps	O(n) per key	No burst artifacts
Token bucket	High — configurable burst	O(1) per key	Controlled burst capacity

For most APIs, the sliding window counter (a weighted blend of current and previous fixed windows) offers the best tradeoff between accuracy and memory.

Rate limit synchronization across nodes#

When latency to a central Redis is too high, you can use a local + global hybrid approach:

Each node maintains a local counter with a fraction of the global limit
Periodically (every 100ms–1s), nodes push local counts to Redis and pull the global total
If the global total approaches the limit, nodes tighten their local allocation

This is similar to how Google's Doorman and Envoy's rate limit service work. The tradeoff is tuning the sync interval — too fast wastes bandwidth, too slow allows over-admission.

Consistent hashing for client affinity#

Another approach: use consistent hashing in your load balancer so requests from the same client always hit the same node. This turns the distributed problem into a local one — at the cost of uneven load distribution and failover complexity.

Handling failures gracefully#

What happens when Redis goes down? You have two choices:

Fail open — allow all requests through (risk of overload)
Fail closed — reject all requests (risk of total outage)

Most systems choose fail open with a local fallback rate limiter. Each node applies a conservative per-node limit until Redis recovers.

Rate limiting at the API gateway layer#

API gateways like Kong, Envoy, and NGINX can enforce rate limits before requests reach your application:

Kong — built-in rate-limiting plugin with Redis or PostgreSQL backing
Envoy — external rate limit service with gRPC interface
NGINX — limit_req module with shared memory zones

Gateway-level limiting protects your backend from traffic spikes but lacks application-level context (user tier, endpoint cost). Combine both layers for defense in depth.

Implementing rate limit headers#

Always return standard headers so clients can self-throttle:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711670460
Retry-After: 30

Well-behaved clients use Retry-After to back off. Without these headers, clients resort to aggressive retries that amplify the problem.

Monitoring and observability#

Track these metrics for your rate limiter:

Rejection rate — percentage of requests returning 429
Latency overhead — time added by the rate limit check
Redis hit rate — cache performance for rate limit keys
Sync lag — delay between local and global counts (hybrid mode)

Alert when rejection rate spikes — it may indicate a traffic surge, a misbehaving client, or a misconfigured limit.

Visualize distributed rate limiting#

On Codelit, generate a multi-node API architecture with Redis-backed rate limiting to see how requests flow through the limiter, how token buckets refill, and how nodes synchronize state.

This is article #407 in the Codelit engineering blog series.

Build and explore distributed system architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

Try these templates

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Multiplayer Game Backend

Real-time multiplayer game server with matchmaking, state sync, leaderboards, and anti-cheat.

8 components

Build this architecture

Generate an interactive architecture for Distributed Rate Limiting in seconds.

Try it in Codelit →

distributed-systemssystem-designbackendscalability

Distributed Rate Limiting — Scaling Request Throttling Across Nodes

March 29, 2026 6 min readBy Codelit Team Discussion

Why rate limiting gets hard at scale#

Distributed rate limiting solves this by sharing state across all nodes so the global request count is enforced, not a per-node approximation.

Centralized vs distributed rate limiters#

There are two broad architectures:

Centralized rate limiter#

Every application node queries a single shared store (typically Redis or Memcached) before allowing a request through.

Pros:

Exact global counts — no approximation
Simple mental model
Single source of truth

Cons:

The shared store becomes a single point of failure
Every request adds a network round-trip
Latency increases under high concurrency

Distributed (peer-to-peer) rate limiter#

Each node maintains local counters and periodically synchronizes with peers or a central coordinator.

Pros:

Lower latency — most decisions are local
Tolerates temporary network partitions
Scales horizontally with the application

Cons:

Counts are approximate between sync intervals
More complex to implement and debug
Brief over-admission windows are possible

Redis-based sliding window#

-- Sliding window rate limiter in Redis Lua
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count remaining entries
local count = redis.call('ZCARD', key)

if count >= limit then
  return 0  -- rejected
end

-- Add current request
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window)
return 1  -- allowed

Running this as a Lua script inside Redis guarantees atomicity — no race conditions between the count check and the insert.

Why Lua scripts matter#

Token bucket at scale#

The token bucket algorithm refills tokens at a fixed rate and each request consumes one token. It naturally supports burst traffic up to the bucket capacity.

In a distributed setting, you store bucket state in Redis:

bucket:{client_id} -> { tokens: 47, last_refill: 1711670400 }

On each request:

Read current tokens and last refill timestamp
Calculate tokens to add based on elapsed time
If tokens are available, decrement and allow
Otherwise, reject and return a Retry-After header

Use a Lua script or Redis transaction to keep this atomic. The refill calculation happens lazily — you do not need a background process adding tokens.

Fixed window vs sliding window vs sliding log#

Algorithm	Accuracy	Memory	Burst behavior
Fixed window	Low — boundary spikes	O(1) per key	2x burst at window edges
Sliding window	Medium — weighted average	O(1) per key	Smoothed across windows
Sliding log	High — exact timestamps	O(n) per key	No burst artifacts
Token bucket	High — configurable burst	O(1) per key	Controlled burst capacity

For most APIs, the sliding window counter (a weighted blend of current and previous fixed windows) offers the best tradeoff between accuracy and memory.

Rate limit synchronization across nodes#

When latency to a central Redis is too high, you can use a local + global hybrid approach:

Each node maintains a local counter with a fraction of the global limit
Periodically (every 100ms–1s), nodes push local counts to Redis and pull the global total
If the global total approaches the limit, nodes tighten their local allocation

This is similar to how Google's Doorman and Envoy's rate limit service work. The tradeoff is tuning the sync interval — too fast wastes bandwidth, too slow allows over-admission.

Consistent hashing for client affinity#

Handling failures gracefully#

What happens when Redis goes down? You have two choices:

Fail open — allow all requests through (risk of overload)
Fail closed — reject all requests (risk of total outage)

Most systems choose fail open with a local fallback rate limiter. Each node applies a conservative per-node limit until Redis recovers.

Rate limiting at the API gateway layer#

API gateways like Kong, Envoy, and NGINX can enforce rate limits before requests reach your application:

Kong — built-in rate-limiting plugin with Redis or PostgreSQL backing
Envoy — external rate limit service with gRPC interface
NGINX — limit_req module with shared memory zones

Gateway-level limiting protects your backend from traffic spikes but lacks application-level context (user tier, endpoint cost). Combine both layers for defense in depth.

Implementing rate limit headers#

Always return standard headers so clients can self-throttle:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711670460
Retry-After: 30

Well-behaved clients use Retry-After to back off. Without these headers, clients resort to aggressive retries that amplify the problem.

Monitoring and observability#

Track these metrics for your rate limiter:

Rejection rate — percentage of requests returning 429
Latency overhead — time added by the rate limit check
Redis hit rate — cache performance for rate limit keys
Sync lag — delay between local and global counts (hybrid mode)

Alert when rejection rate spikes — it may indicate a traffic surge, a misbehaving client, or a misconfigured limit.

Visualize distributed rate limiting#

On Codelit, generate a multi-node API architecture with Redis-backed rate limiting to see how requests flow through the limiter, how token buckets refill, and how nodes synchronize state.

This is article #407 in the Codelit engineering blog series.

Build and explore distributed system architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

API design

Build this architecture

Generate an interactive architecture for Distributed Rate Limiting in seconds.

Try it in Codelit →

Distributed Rate Limiting — Scaling Request Throttling Across Nodes

Why rate limiting gets hard at scale#

Centralized vs distributed rate limiters#

Centralized rate limiter#

Distributed (peer-to-peer) rate limiter#

Redis-based sliding window#

Why Lua scripts matter#

Token bucket at scale#

Fixed window vs sliding window vs sliding log#

Rate limit synchronization across nodes#

Consistent hashing for client affinity#

Handling failures gracefully#

Rate limiting at the API gateway layer#

Implementing rate limit headers#

Monitoring and observability#

Visualize distributed rate limiting#

Comments

Related articles

API Backward Compatibility: Ship Changes Without Breaking Consumers

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

OpenAI API Request Pipeline

Distributed Rate Limiter

Multiplayer Game Backend

Build this architecture

Distributed Rate Limiting — Scaling Request Throttling Across Nodes

Why rate limiting gets hard at scale#

Centralized vs distributed rate limiters#

Centralized rate limiter#

Distributed (peer-to-peer) rate limiter#

Redis-based sliding window#

Why Lua scripts matter#

Token bucket at scale#

Fixed window vs sliding window vs sliding log#

Rate limit synchronization across nodes#

Consistent hashing for client affinity#

Handling failures gracefully#

Rate limiting at the API gateway layer#

Implementing rate limit headers#

Monitoring and observability#

Visualize distributed rate limiting#

Comments

Related articles

API Backward Compatibility: Ship Changes Without Breaking Consumers

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

OpenAI API Request Pipeline

Distributed Rate Limiter

Multiplayer Game Backend

Build this architecture