API gatewayrate limitingsecurityarchitecturesystem design

API Gateway Rate Limiting Patterns: Protect Your Services at the Edge

March 29, 2026 7 min readBy Codelit Team Discussion

API Gateway Rate Limiting Patterns#

Every public API will be abused. Bots, scrapers, misbehaving clients, and accidental infinite loops will hammer your endpoints. Rate limiting at the API gateway is your first line of defense — it protects backend services before traffic even reaches them.

Why Rate Limit at the Gateway#

Rate limiting can happen at multiple layers. The gateway is the best first choice:

Client --> API Gateway (rate limit here) --> Backend Service

Benefits of gateway-level limiting:

Rejects bad traffic before it consumes backend resources
Centralized policy — one place to configure, not per-service
Consistent behavior across all endpoints
No code changes in backend services

Rate Limiting Strategies#

Per-Client Limiting#

The most common pattern. Each client (identified by API key, user ID, or IP) gets a quota:

Client A: 1000 requests / minute
Client B: 1000 requests / minute

Clients are isolated. One misbehaving client cannot exhaust capacity for others.

Identification methods:

API key (most reliable)
OAuth token / JWT subject claim
IP address (unreliable behind NAT/proxies)
Custom header (X-Client-ID)

Per-Endpoint Limiting#

Different endpoints have different costs. A search query is more expensive than a health check:

GET  /api/health      --> 10,000 req/min (cheap)
GET  /api/search      -->    100 req/min (expensive)
POST /api/export      -->     10 req/min (very expensive)
GET  /api/users/:id   -->  2,000 req/min (moderate)

This prevents a client from burning all their quota on cheap endpoints while expensive endpoints remain unprotected.

Combined: Per-Client + Per-Endpoint#

The most robust approach layers both:

Client A:
  Global:           1,000 req/min
  GET /api/search:    100 req/min
  POST /api/export:    10 req/min

Client B (premium):
  Global:           5,000 req/min
  GET /api/search:    500 req/min
  POST /api/export:    50 req/min

Global Limiting#

A hard ceiling across all clients to protect system capacity:

All clients combined: 50,000 req/min to /api/search

This prevents total system overload even if individual client limits are generous.

Sliding Window Algorithm#

The sliding window is the preferred algorithm for gateway rate limiting. It avoids the burst problem of fixed windows.

Fixed Window Problem#

Window: 12:00:00 - 12:01:00  (limit: 100)

Client sends 0 requests from 12:00:00 - 12:00:55
Client sends 100 requests at 12:00:56  (allowed, within limit)
Client sends 100 requests at 12:01:01  (allowed, new window)

Result: 200 requests in 6 seconds -- double the intended rate

Sliding Window Solution#

Track requests in a rolling window. At any point in time, count requests from the past N seconds:

At 12:01:30, count all requests from 12:00:30 to 12:01:30

Implementation with Redis#

import time
import redis

r = redis.Redis()

def is_rate_limited(client_id, limit, window_seconds):
    now = time.time()
    key = f"ratelimit:{client_id}"
    pipe = r.pipeline()

    # Remove entries outside the window
    pipe.zremrangebyscore(key, 0, now - window_seconds)
    # Add the current request
    pipe.zadd(key, {f"{now}:{id(now)}": now})
    # Count requests in window
    pipe.zcard(key)
    # Set expiry on the key
    pipe.expire(key, window_seconds)

    results = pipe.execute()
    request_count = results[2]

    return request_count > limit

Algorithm Comparison#

Algorithm	Accuracy	Memory	Burst Handling
Fixed window	Low	Low	Poor (boundary burst)
Sliding window log	High	High	Excellent
Sliding window counter	Medium	Low	Good
Token bucket	Medium	Low	Configurable burst
Leaky bucket	High	Low	No burst allowed

Quota Headers#

Clients need visibility into their limits. Use standard headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1714003200

The emerging IETF standard uses RateLimit headers:

RateLimit-Limit: 1000
RateLimit-Remaining: 742
RateLimit-Reset: 58

Where RateLimit-Reset is seconds until the window resets (not a Unix timestamp).

Always include these headers — even on successful responses. Clients can proactively throttle themselves before hitting limits.

Retry-After Header#

When a client is rate limited, return 429 Too Many Requests with a Retry-After header:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 30 seconds.",
  "retry_after": 30
}

Client-Side Handling#

Well-behaved clients should respect Retry-After:

import time
import requests

def api_call_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            time.sleep(retry_after)
            continue

        return response

    raise Exception("Rate limited after max retries")

Graceful Degradation#

Rate limiting doesn't have to be all-or-nothing. Degrade gracefully:

Tiered Response Strategy#

Load Level	Response
Normal (under 50% capacity)	Full response with all fields
Elevated (50-80%)	Disable expensive computed fields
High (80-95%)	Return cached responses, skip real-time data
Critical (over 95%)	Return 429 for non-essential endpoints

Priority Queuing#

Not all requests are equal. Assign priority levels:

Priority 1 (never throttle): Authentication, payments
Priority 2 (throttle last):  Core CRUD operations
Priority 3 (throttle first): Search, analytics, exports
Priority 4 (shed first):     Webhooks, batch operations

When capacity is constrained, shed low-priority traffic first.

Circuit Breaker Integration#

Combine rate limiting with circuit breakers:

If backend is healthy:
  Apply normal rate limits

If backend is degraded:
  Reduce rate limits by 50%

If backend circuit is open:
  Return 503 for all requests to that service
  Redirect to fallback/cache where possible

Tools and Implementation#

Kong Rate Limiting Plugin#

plugins:
  - name: rate-limiting
    config:
      minute: 1000
      hour: 10000
      policy: redis
      redis_host: redis.internal
      redis_port: 6379
      fault_tolerant: true
      hide_client_headers: false
      limit_by: consumer

Kong supports fixed window, sliding window, and Redis-backed distributed limiting out of the box.

AWS WAF Rate-Based Rules#

{
  "Name": "RateLimitRule",
  "Priority": 1,
  "Action": { "Block": {} },
  "Statement": {
    "RateBasedStatement": {
      "Limit": 2000,
      "AggregateKeyType": "IP",
      "EvaluationWindowSec": 300,
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "SearchString": "/api/",
          "FieldToMatch": { "UriPath": {} },
          "PositionalConstraint": "STARTS_WITH",
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }]
        }
      }
    }
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "RateLimitRule"
  }
}

Other Gateway Tools#

Tool	Rate Limiting Style	Distributed Support
Kong	Plugin-based, Redis-backed	Yes
AWS API Gateway	Per-stage, per-method throttling	Yes (managed)
Envoy	Filter chain, local or global	Yes (with rate limit service)
NGINX	`limit_req` module, leaky bucket	Limited (per-instance)
Traefik	Middleware-based	Yes (with Redis)
Cloudflare	Edge-based, automatic	Yes (global edge)

Common Mistakes#

Rate limiting by IP only — many users share IPs behind corporate NATs
No quota headers — clients can't self-regulate without visibility
Hard 429 with no Retry-After — clients retry immediately, making things worse
Same limits for all endpoints — expensive endpoints need tighter limits
No global limit — individual client limits don't prevent total system overload
Testing only happy path — load test your rate limiting before production

Implementation Checklist#

Choose identification strategy (API key, JWT, IP)
Define per-client and per-endpoint limits
Set a global safety limit
Implement sliding window (Redis-backed for distributed)
Add quota headers to all responses
Return 429 with Retry-After on limit breach
Define graceful degradation tiers
Monitor rate limit hit rates in dashboards
Alert on sustained high rejection rates
Document limits in your API reference

This is article #356 in the Codelit engineering series. Explore more at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

The AI Agent Tool Permission Matrix

4 min read

AI agents

Non-Human Identity for AI Agents

3 min read

AI agents

Context Engineering for Agentic Systems

2 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Build this architecture

Generate an interactive architecture for API Gateway Rate Limiting Patterns in seconds.

Try it in Codelit →

API gatewayrate limitingsecurityarchitecturesystem design

API Gateway Rate Limiting Patterns: Protect Your Services at the Edge

March 29, 2026 7 min readBy Codelit Team Discussion

API Gateway Rate Limiting Patterns#

Why Rate Limit at the Gateway#

Rate limiting can happen at multiple layers. The gateway is the best first choice:

Client --> API Gateway (rate limit here) --> Backend Service

Benefits of gateway-level limiting:

Rejects bad traffic before it consumes backend resources
Centralized policy — one place to configure, not per-service
Consistent behavior across all endpoints
No code changes in backend services

Rate Limiting Strategies#

Per-Client Limiting#

The most common pattern. Each client (identified by API key, user ID, or IP) gets a quota:

Client A: 1000 requests / minute
Client B: 1000 requests / minute

Clients are isolated. One misbehaving client cannot exhaust capacity for others.

Identification methods:

API key (most reliable)
OAuth token / JWT subject claim
IP address (unreliable behind NAT/proxies)
Custom header (X-Client-ID)

Per-Endpoint Limiting#

Different endpoints have different costs. A search query is more expensive than a health check:

GET  /api/health      --> 10,000 req/min (cheap)
GET  /api/search      -->    100 req/min (expensive)
POST /api/export      -->     10 req/min (very expensive)
GET  /api/users/:id   -->  2,000 req/min (moderate)

This prevents a client from burning all their quota on cheap endpoints while expensive endpoints remain unprotected.

Combined: Per-Client + Per-Endpoint#

The most robust approach layers both:

Client A:
  Global:           1,000 req/min
  GET /api/search:    100 req/min
  POST /api/export:    10 req/min

Client B (premium):
  Global:           5,000 req/min
  GET /api/search:    500 req/min
  POST /api/export:    50 req/min

Global Limiting#

A hard ceiling across all clients to protect system capacity:

All clients combined: 50,000 req/min to /api/search

This prevents total system overload even if individual client limits are generous.

Sliding Window Algorithm#

The sliding window is the preferred algorithm for gateway rate limiting. It avoids the burst problem of fixed windows.

Fixed Window Problem#

Window: 12:00:00 - 12:01:00  (limit: 100)

Client sends 0 requests from 12:00:00 - 12:00:55
Client sends 100 requests at 12:00:56  (allowed, within limit)
Client sends 100 requests at 12:01:01  (allowed, new window)

Result: 200 requests in 6 seconds -- double the intended rate

Sliding Window Solution#

Track requests in a rolling window. At any point in time, count requests from the past N seconds:

At 12:01:30, count all requests from 12:00:30 to 12:01:30

Implementation with Redis#

import time
import redis

r = redis.Redis()

def is_rate_limited(client_id, limit, window_seconds):
    now = time.time()
    key = f"ratelimit:{client_id}"
    pipe = r.pipeline()

    # Remove entries outside the window
    pipe.zremrangebyscore(key, 0, now - window_seconds)
    # Add the current request
    pipe.zadd(key, {f"{now}:{id(now)}": now})
    # Count requests in window
    pipe.zcard(key)
    # Set expiry on the key
    pipe.expire(key, window_seconds)

    results = pipe.execute()
    request_count = results[2]

    return request_count > limit

Algorithm Comparison#

Algorithm	Accuracy	Memory	Burst Handling
Fixed window	Low	Low	Poor (boundary burst)
Sliding window log	High	High	Excellent
Sliding window counter	Medium	Low	Good
Token bucket	Medium	Low	Configurable burst
Leaky bucket	High	Low	No burst allowed

Quota Headers#

Clients need visibility into their limits. Use standard headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1714003200

The emerging IETF standard uses RateLimit headers:

RateLimit-Limit: 1000
RateLimit-Remaining: 742
RateLimit-Reset: 58

Where RateLimit-Reset is seconds until the window resets (not a Unix timestamp).

Always include these headers — even on successful responses. Clients can proactively throttle themselves before hitting limits.

Retry-After Header#

When a client is rate limited, return 429 Too Many Requests with a Retry-After header:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Rate limit exceeded. Try again in 30 seconds.",
  "retry_after": 30
}

Client-Side Handling#

Well-behaved clients should respect Retry-After:

import time
import requests

def api_call_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            time.sleep(retry_after)
            continue

        return response

    raise Exception("Rate limited after max retries")

Graceful Degradation#

Rate limiting doesn't have to be all-or-nothing. Degrade gracefully:

Tiered Response Strategy#

Load Level	Response
Normal (under 50% capacity)	Full response with all fields
Elevated (50-80%)	Disable expensive computed fields
High (80-95%)	Return cached responses, skip real-time data
Critical (over 95%)	Return 429 for non-essential endpoints

Priority Queuing#

Not all requests are equal. Assign priority levels:

Priority 1 (never throttle): Authentication, payments
Priority 2 (throttle last):  Core CRUD operations
Priority 3 (throttle first): Search, analytics, exports
Priority 4 (shed first):     Webhooks, batch operations

When capacity is constrained, shed low-priority traffic first.

Circuit Breaker Integration#

Combine rate limiting with circuit breakers:

If backend is healthy:
  Apply normal rate limits

If backend is degraded:
  Reduce rate limits by 50%

If backend circuit is open:
  Return 503 for all requests to that service
  Redirect to fallback/cache where possible

Tools and Implementation#

Kong Rate Limiting Plugin#

plugins:
  - name: rate-limiting
    config:
      minute: 1000
      hour: 10000
      policy: redis
      redis_host: redis.internal
      redis_port: 6379
      fault_tolerant: true
      hide_client_headers: false
      limit_by: consumer

Kong supports fixed window, sliding window, and Redis-backed distributed limiting out of the box.

AWS WAF Rate-Based Rules#

{
  "Name": "RateLimitRule",
  "Priority": 1,
  "Action": { "Block": {} },
  "Statement": {
    "RateBasedStatement": {
      "Limit": 2000,
      "AggregateKeyType": "IP",
      "EvaluationWindowSec": 300,
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "SearchString": "/api/",
          "FieldToMatch": { "UriPath": {} },
          "PositionalConstraint": "STARTS_WITH",
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }]
        }
      }
    }
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "RateLimitRule"
  }
}

Other Gateway Tools#

Tool	Rate Limiting Style	Distributed Support
Kong	Plugin-based, Redis-backed	Yes
AWS API Gateway	Per-stage, per-method throttling	Yes (managed)
Envoy	Filter chain, local or global	Yes (with rate limit service)
NGINX	`limit_req` module, leaky bucket	Limited (per-instance)
Traefik	Middleware-based	Yes (with Redis)
Cloudflare	Edge-based, automatic	Yes (global edge)

Common Mistakes#

Rate limiting by IP only — many users share IPs behind corporate NATs
No quota headers — clients can't self-regulate without visibility
Hard 429 with no Retry-After — clients retry immediately, making things worse
Same limits for all endpoints — expensive endpoints need tighter limits
No global limit — individual client limits don't prevent total system overload
Testing only happy path — load test your rate limiting before production

Implementation Checklist#

Choose identification strategy (API key, JWT, IP)
Define per-client and per-endpoint limits
Set a global safety limit
Implement sliding window (Redis-backed for distributed)
Add quota headers to all responses
Return 429 with Retry-After on limit breach
Define graceful degradation tiers
Monitor rate limit hit rates in dashboards
Alert on sustained high rejection rates
Document limits in your API reference

This is article #356 in the Codelit engineering series. Explore more at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for API Gateway Rate Limiting Patterns in seconds.

Try it in Codelit →

API Gateway Rate Limiting Patterns: Protect Your Services at the Edge

API Gateway Rate Limiting Patterns#

Why Rate Limit at the Gateway#

Rate Limiting Strategies#

Per-Client Limiting#

Per-Endpoint Limiting#

Combined: Per-Client + Per-Endpoint#

Global Limiting#

Sliding Window Algorithm#

Fixed Window Problem#

Sliding Window Solution#

Implementation with Redis#

Algorithm Comparison#

Quota Headers#

Retry-After Header#

Client-Side Handling#

Graceful Degradation#

Tiered Response Strategy#

Priority Queuing#

Circuit Breaker Integration#

Tools and Implementation#

Kong Rate Limiting Plugin#

AWS WAF Rate-Based Rules#

Other Gateway Tools#

Common Mistakes#

Implementation Checklist#

Comments

Related articles

The AI Agent Tool Permission Matrix

Non-Human Identity for AI Agents

Context Engineering for Agentic Systems

Try these templates

Scalable SaaS Application

Netflix Video Streaming Architecture

Distributed Rate Limiter

Build this architecture

API Gateway Rate Limiting Patterns: Protect Your Services at the Edge

API Gateway Rate Limiting Patterns#

Why Rate Limit at the Gateway#

Rate Limiting Strategies#

Per-Client Limiting#

Per-Endpoint Limiting#

Combined: Per-Client + Per-Endpoint#

Global Limiting#

Sliding Window Algorithm#

Fixed Window Problem#

Sliding Window Solution#

Implementation with Redis#

Algorithm Comparison#

Quota Headers#

Retry-After Header#

Client-Side Handling#

Graceful Degradation#

Tiered Response Strategy#

Priority Queuing#

Circuit Breaker Integration#

Tools and Implementation#

Kong Rate Limiting Plugin#

AWS WAF Rate-Based Rules#

Other Gateway Tools#

Common Mistakes#

Implementation Checklist#

Comments

Related articles

The AI Agent Tool Permission Matrix

Non-Human Identity for AI Agents

Context Engineering for Agentic Systems

Try these templates

Scalable SaaS Application

Netflix Video Streaming Architecture

Distributed Rate Limiter

Build this architecture