rate limitingAPI rate limitingthrottling patternstoken bucketsliding windowsystem designdistributed systems

Rate Limiting: Algorithms, Patterns & Production Architecture

March 28, 2026 6 min readBy Codelit Team Discussion

Every production API needs rate limiting. Without it, a single misbehaving client can exhaust your resources, degrade service for everyone, and run up your cloud bill. Rate limiting is the gatekeeper that keeps your system healthy under pressure.

Why Rate Limit?#

Protect availability — prevent resource exhaustion from traffic spikes or abuse
Ensure fairness — no single consumer monopolizes capacity
Control costs — bound compute, bandwidth, and third-party API spend
Mitigate attacks — slow down brute-force, credential stuffing, and scraping

Rate Limiting Algorithms#

Fixed Window#

Divide time into fixed intervals (e.g., 1-minute windows). Count requests per window. Simple, but suffers from boundary bursts — a client can fire 100 requests at 0:59 and another 100 at 1:00.

Sliding Window Log#

Track each request timestamp. Count requests within the trailing window. Accurate, but storing every timestamp is memory-expensive at scale.

Sliding Window Counter#

A hybrid: combine the current window's count with a weighted portion of the previous window's count. Near-accurate with constant memory per key.

def sliding_window_count(prev_count, curr_count, window_size, elapsed):
    weight = (window_size - elapsed) / window_size
    return curr_count + prev_count * weight

Token Bucket#

A bucket holds tokens up to a max capacity. Each request consumes a token. Tokens refill at a steady rate. Allows short bursts while enforcing an average rate.

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.monotonic()

    def allow(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Leaky Bucket#

Requests enter a queue (bucket) and are processed at a fixed rate. Excess requests overflow and are rejected. Smooths traffic perfectly but adds latency.

Implementation with Redis#

Redis is the go-to backing store for distributed rate limiters. Atomic operations and key expiration make it ideal.

Sliding Window Counter in Redis#

-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

local count = redis.call('ZCARD', key)
if count < limit then
    redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end
return 0  -- rejected

Express Middleware Example#

import { Redis } from "ioredis";
import type { Request, Response, NextFunction } from "express";

const redis = new Redis();

export function rateLimit(limit: number, windowSec: number) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = `rl:${req.ip}`;
    const current = await redis.incr(key);

    if (current === 1) {
      await redis.expire(key, windowSec);
    }

    const remaining = Math.max(0, limit - current);
    const resetAt = Math.ceil(Date.now() / 1000) + windowSec;

    res.set("X-RateLimit-Limit", String(limit));
    res.set("X-RateLimit-Remaining", String(remaining));
    res.set("X-RateLimit-Reset", String(resetAt));

    if (current > limit) {
      res.set("Retry-After", String(windowSec));
      return res.status(429).json({ error: "Too many requests" });
    }

    next();
  };
}

Rate Limit Headers#

Standard headers every API should return:

Header	Purpose
`X-RateLimit-Limit`	Max requests allowed in window
`X-RateLimit-Remaining`	Requests left in current window
`X-RateLimit-Reset`	Unix timestamp when window resets
`Retry-After`	Seconds to wait before retrying (on 429)

Distributed Rate Limiting#

In multi-node deployments, each instance needs a shared view of request counts. Strategies:

Centralized store — Redis or Memcached. Simple, slight latency per request.
Eventual consistency — each node keeps a local counter and syncs periodically. Faster but can overshoot limits briefly.
Consistent hashing — route each client's requests to the same node. Avoids shared state but complicates failover.

For most systems, a Redis cluster with the Lua script above handles thousands of rate-limit checks per second with sub-millisecond overhead.

Granularity: Per-User vs Per-IP vs Per-API-Key#

Strategy	Best For	Watch Out
Per-IP	Unauthenticated endpoints, login pages	Shared IPs (NAT, corporate proxies) penalize many users
Per-User	Authenticated APIs	Requires auth before rate check
Per-API-Key	Developer platforms, SaaS APIs	Different tiers need different limits
Per-Endpoint	Protecting expensive operations	Combine with per-user for best results

Production systems typically layer multiple strategies — a global per-IP limit plus a per-user limit on authenticated routes.

Client-Side Handling#

Good clients respect rate limits gracefully.

async function fetchWithRetry(url: string, maxRetries = 3): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch(url);

    if (res.status !== 429) return res;

    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseInt(retryAfter, 10) * 1000
      : Math.min(1000 * 2 ** attempt, 30000); // exponential backoff

    await new Promise((r) => setTimeout(r, delay));
  }

  throw new Error("Rate limited after max retries");
}

Key client-side patterns:

Respect Retry-After — always prefer the server's guidance
Exponential backoff with jitter — prevents thundering herd on retry
Circuit breaker — stop calling entirely if repeated 429s persist

Tools & Managed Solutions#

Tool	Type	Notes
Kong	API Gateway	Built-in rate limiting plugin, Redis-backed
Envoy	Service proxy	Local and global rate limit filters
AWS WAF	Cloud WAF	Rate-based rules at the edge
Cloudflare	CDN/WAF	Rate limiting rules with bot detection
Nginx	Reverse proxy	`limit_req` module, leaky bucket algorithm
express-rate-limit	Node.js middleware	Simple, pluggable stores

For most teams, starting with a gateway-level rate limiter (Kong, Envoy) and adding application-level limits for sensitive endpoints is the pragmatic approach.

Choosing the Right Algorithm#

Fixed window — simplest, fine for non-critical limits
Sliding window counter — best balance of accuracy and efficiency
Token bucket — when you need burst tolerance with average rate control
Leaky bucket — when you need perfectly smooth output rate

Summary#

Rate limiting is not optional in production. Start with a sliding window counter in Redis, return proper X-RateLimit-* headers, layer per-IP and per-user limits, and make sure your clients handle 429s with exponential backoff. As traffic grows, move rate limiting to your API gateway and add distributed coordination.

Build smarter systems with us at codelit.io.

134 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive Rate Limiting in seconds.

Try it in Codelit →

rate limitingAPI rate limitingthrottling patternstoken bucketsliding windowsystem designdistributed systems

Rate Limiting: Algorithms, Patterns & Production Architecture

March 28, 2026 6 min readBy Codelit Team Discussion

Why Rate Limit?#

Protect availability — prevent resource exhaustion from traffic spikes or abuse
Ensure fairness — no single consumer monopolizes capacity
Control costs — bound compute, bandwidth, and third-party API spend
Mitigate attacks — slow down brute-force, credential stuffing, and scraping

Rate Limiting Algorithms#

Fixed Window#

Divide time into fixed intervals (e.g., 1-minute windows). Count requests per window. Simple, but suffers from boundary bursts — a client can fire 100 requests at 0:59 and another 100 at 1:00.

Sliding Window Log#

Track each request timestamp. Count requests within the trailing window. Accurate, but storing every timestamp is memory-expensive at scale.

Sliding Window Counter#

A hybrid: combine the current window's count with a weighted portion of the previous window's count. Near-accurate with constant memory per key.

def sliding_window_count(prev_count, curr_count, window_size, elapsed):
    weight = (window_size - elapsed) / window_size
    return curr_count + prev_count * weight

Token Bucket#

A bucket holds tokens up to a max capacity. Each request consumes a token. Tokens refill at a steady rate. Allows short bursts while enforcing an average rate.

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.monotonic()

    def allow(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Leaky Bucket#

Requests enter a queue (bucket) and are processed at a fixed rate. Excess requests overflow and are rejected. Smooths traffic perfectly but adds latency.

Implementation with Redis#

Redis is the go-to backing store for distributed rate limiters. Atomic operations and key expiration make it ideal.

Sliding Window Counter in Redis#

-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

local count = redis.call('ZCARD', key)
if count < limit then
    redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end
return 0  -- rejected

Express Middleware Example#

import { Redis } from "ioredis";
import type { Request, Response, NextFunction } from "express";

const redis = new Redis();

export function rateLimit(limit: number, windowSec: number) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = `rl:${req.ip}`;
    const current = await redis.incr(key);

    if (current === 1) {
      await redis.expire(key, windowSec);
    }

    const remaining = Math.max(0, limit - current);
    const resetAt = Math.ceil(Date.now() / 1000) + windowSec;

    res.set("X-RateLimit-Limit", String(limit));
    res.set("X-RateLimit-Remaining", String(remaining));
    res.set("X-RateLimit-Reset", String(resetAt));

    if (current > limit) {
      res.set("Retry-After", String(windowSec));
      return res.status(429).json({ error: "Too many requests" });
    }

    next();
  };
}

Rate Limit Headers#

Standard headers every API should return:

Header	Purpose
`X-RateLimit-Limit`	Max requests allowed in window
`X-RateLimit-Remaining`	Requests left in current window
`X-RateLimit-Reset`	Unix timestamp when window resets
`Retry-After`	Seconds to wait before retrying (on 429)

Distributed Rate Limiting#

In multi-node deployments, each instance needs a shared view of request counts. Strategies:

Centralized store — Redis or Memcached. Simple, slight latency per request.
Eventual consistency — each node keeps a local counter and syncs periodically. Faster but can overshoot limits briefly.
Consistent hashing — route each client's requests to the same node. Avoids shared state but complicates failover.

For most systems, a Redis cluster with the Lua script above handles thousands of rate-limit checks per second with sub-millisecond overhead.

Granularity: Per-User vs Per-IP vs Per-API-Key#

Strategy	Best For	Watch Out
Per-IP	Unauthenticated endpoints, login pages	Shared IPs (NAT, corporate proxies) penalize many users
Per-User	Authenticated APIs	Requires auth before rate check
Per-API-Key	Developer platforms, SaaS APIs	Different tiers need different limits
Per-Endpoint	Protecting expensive operations	Combine with per-user for best results

Production systems typically layer multiple strategies — a global per-IP limit plus a per-user limit on authenticated routes.

Client-Side Handling#

Good clients respect rate limits gracefully.

async function fetchWithRetry(url: string, maxRetries = 3): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch(url);

    if (res.status !== 429) return res;

    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseInt(retryAfter, 10) * 1000
      : Math.min(1000 * 2 ** attempt, 30000); // exponential backoff

    await new Promise((r) => setTimeout(r, delay));
  }

  throw new Error("Rate limited after max retries");
}

Key client-side patterns:

Respect Retry-After — always prefer the server's guidance
Exponential backoff with jitter — prevents thundering herd on retry
Circuit breaker — stop calling entirely if repeated 429s persist

Tools & Managed Solutions#

Tool	Type	Notes
Kong	API Gateway	Built-in rate limiting plugin, Redis-backed
Envoy	Service proxy	Local and global rate limit filters
AWS WAF	Cloud WAF	Rate-based rules at the edge
Cloudflare	CDN/WAF	Rate limiting rules with bot detection
Nginx	Reverse proxy	`limit_req` module, leaky bucket algorithm
express-rate-limit	Node.js middleware	Simple, pluggable stores

For most teams, starting with a gateway-level rate limiter (Kong, Envoy) and adding application-level limits for sensitive endpoints is the pragmatic approach.

Choosing the Right Algorithm#

Fixed window — simplest, fine for non-critical limits
Sliding window counter — best balance of accuracy and efficiency
Token bucket — when you need burst tolerance with average rate control
Leaky bucket — when you need perfectly smooth output rate

Summary#

Build smarter systems with us at codelit.io.

134 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive Rate Limiting in seconds.

Try it in Codelit →

Rate Limiting: Algorithms, Patterns & Production Architecture

Why Rate Limit?#

Rate Limiting Algorithms#

Fixed Window#

Sliding Window Log#

Sliding Window Counter#

Token Bucket#

Leaky Bucket#

Implementation with Redis#

Sliding Window Counter in Redis#

Express Middleware Example#

Rate Limit Headers#

Distributed Rate Limiting#

Granularity: Per-User vs Per-IP vs Per-API-Key#

Client-Side Handling#

Tools & Managed Solutions#

Choosing the Right Algorithm#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

Search Engine Architecture

Build this architecture

Rate Limiting: Algorithms, Patterns & Production Architecture

Why Rate Limit?#

Rate Limiting Algorithms#

Fixed Window#

Sliding Window Log#

Sliding Window Counter#

Token Bucket#

Leaky Bucket#

Implementation with Redis#

Sliding Window Counter in Redis#

Express Middleware Example#

Rate Limit Headers#

Distributed Rate Limiting#

Granularity: Per-User vs Per-IP vs Per-API-Key#

Client-Side Handling#

Tools & Managed Solutions#

Choosing the Right Algorithm#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

Search Engine Architecture

Build this architecture