API Gateway Rate Limiting Patterns: Protect Your Services at the Edge
API Gateway Rate Limiting Patterns#
Every public API will be abused. Bots, scrapers, misbehaving clients, and accidental infinite loops will hammer your endpoints. Rate limiting at the API gateway is your first line of defense — it protects backend services before traffic even reaches them.
Why Rate Limit at the Gateway#
Rate limiting can happen at multiple layers. The gateway is the best first choice:
Client --> API Gateway (rate limit here) --> Backend Service
Benefits of gateway-level limiting:
- Rejects bad traffic before it consumes backend resources
- Centralized policy — one place to configure, not per-service
- Consistent behavior across all endpoints
- No code changes in backend services
Rate Limiting Strategies#
Per-Client Limiting#
The most common pattern. Each client (identified by API key, user ID, or IP) gets a quota:
Client A: 1000 requests / minute
Client B: 1000 requests / minute
Clients are isolated. One misbehaving client cannot exhaust capacity for others.
Identification methods:
- API key (most reliable)
- OAuth token / JWT subject claim
- IP address (unreliable behind NAT/proxies)
- Custom header (X-Client-ID)
Per-Endpoint Limiting#
Different endpoints have different costs. A search query is more expensive than a health check:
GET /api/health --> 10,000 req/min (cheap)
GET /api/search --> 100 req/min (expensive)
POST /api/export --> 10 req/min (very expensive)
GET /api/users/:id --> 2,000 req/min (moderate)
This prevents a client from burning all their quota on cheap endpoints while expensive endpoints remain unprotected.
Combined: Per-Client + Per-Endpoint#
The most robust approach layers both:
Client A:
Global: 1,000 req/min
GET /api/search: 100 req/min
POST /api/export: 10 req/min
Client B (premium):
Global: 5,000 req/min
GET /api/search: 500 req/min
POST /api/export: 50 req/min
Global Limiting#
A hard ceiling across all clients to protect system capacity:
All clients combined: 50,000 req/min to /api/search
This prevents total system overload even if individual client limits are generous.
Sliding Window Algorithm#
The sliding window is the preferred algorithm for gateway rate limiting. It avoids the burst problem of fixed windows.
Fixed Window Problem#
Window: 12:00:00 - 12:01:00 (limit: 100)
Client sends 0 requests from 12:00:00 - 12:00:55
Client sends 100 requests at 12:00:56 (allowed, within limit)
Client sends 100 requests at 12:01:01 (allowed, new window)
Result: 200 requests in 6 seconds -- double the intended rate
Sliding Window Solution#
Track requests in a rolling window. At any point in time, count requests from the past N seconds:
At 12:01:30, count all requests from 12:00:30 to 12:01:30
Implementation with Redis#
import time
import redis
r = redis.Redis()
def is_rate_limited(client_id, limit, window_seconds):
now = time.time()
key = f"ratelimit:{client_id}"
pipe = r.pipeline()
# Remove entries outside the window
pipe.zremrangebyscore(key, 0, now - window_seconds)
# Add the current request
pipe.zadd(key, {f"{now}:{id(now)}": now})
# Count requests in window
pipe.zcard(key)
# Set expiry on the key
pipe.expire(key, window_seconds)
results = pipe.execute()
request_count = results[2]
return request_count > limit
Algorithm Comparison#
| Algorithm | Accuracy | Memory | Burst Handling |
|---|---|---|---|
| Fixed window | Low | Low | Poor (boundary burst) |
| Sliding window log | High | High | Excellent |
| Sliding window counter | Medium | Low | Good |
| Token bucket | Medium | Low | Configurable burst |
| Leaky bucket | High | Low | No burst allowed |
Quota Headers#
Clients need visibility into their limits. Use standard headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1714003200
The emerging IETF standard uses RateLimit headers:
RateLimit-Limit: 1000
RateLimit-Remaining: 742
RateLimit-Reset: 58
Where RateLimit-Reset is seconds until the window resets (not a Unix timestamp).
Always include these headers — even on successful responses. Clients can proactively throttle themselves before hitting limits.
Retry-After Header#
When a client is rate limited, return 429 Too Many Requests with a Retry-After header:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Rate limit exceeded. Try again in 30 seconds.",
"retry_after": 30
}
Client-Side Handling#
Well-behaved clients should respect Retry-After:
import time
import requests
def api_call_with_retry(url, max_retries=3):
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
time.sleep(retry_after)
continue
return response
raise Exception("Rate limited after max retries")
Graceful Degradation#
Rate limiting doesn't have to be all-or-nothing. Degrade gracefully:
Tiered Response Strategy#
| Load Level | Response |
|---|---|
| Normal (under 50% capacity) | Full response with all fields |
| Elevated (50-80%) | Disable expensive computed fields |
| High (80-95%) | Return cached responses, skip real-time data |
| Critical (over 95%) | Return 429 for non-essential endpoints |
Priority Queuing#
Not all requests are equal. Assign priority levels:
Priority 1 (never throttle): Authentication, payments
Priority 2 (throttle last): Core CRUD operations
Priority 3 (throttle first): Search, analytics, exports
Priority 4 (shed first): Webhooks, batch operations
When capacity is constrained, shed low-priority traffic first.
Circuit Breaker Integration#
Combine rate limiting with circuit breakers:
If backend is healthy:
Apply normal rate limits
If backend is degraded:
Reduce rate limits by 50%
If backend circuit is open:
Return 503 for all requests to that service
Redirect to fallback/cache where possible
Tools and Implementation#
Kong Rate Limiting Plugin#
plugins:
- name: rate-limiting
config:
minute: 1000
hour: 10000
policy: redis
redis_host: redis.internal
redis_port: 6379
fault_tolerant: true
hide_client_headers: false
limit_by: consumer
Kong supports fixed window, sliding window, and Redis-backed distributed limiting out of the box.
AWS WAF Rate-Based Rules#
{
"Name": "RateLimitRule",
"Priority": 1,
"Action": { "Block": {} },
"Statement": {
"RateBasedStatement": {
"Limit": 2000,
"AggregateKeyType": "IP",
"EvaluationWindowSec": 300,
"ScopeDownStatement": {
"ByteMatchStatement": {
"SearchString": "/api/",
"FieldToMatch": { "UriPath": {} },
"PositionalConstraint": "STARTS_WITH",
"TextTransformations": [{ "Priority": 0, "Type": "NONE" }]
}
}
}
},
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "RateLimitRule"
}
}
Other Gateway Tools#
| Tool | Rate Limiting Style | Distributed Support |
|---|---|---|
| Kong | Plugin-based, Redis-backed | Yes |
| AWS API Gateway | Per-stage, per-method throttling | Yes (managed) |
| Envoy | Filter chain, local or global | Yes (with rate limit service) |
| NGINX | limit_req module, leaky bucket | Limited (per-instance) |
| Traefik | Middleware-based | Yes (with Redis) |
| Cloudflare | Edge-based, automatic | Yes (global edge) |
Common Mistakes#
- Rate limiting by IP only — many users share IPs behind corporate NATs
- No quota headers — clients can't self-regulate without visibility
- Hard 429 with no Retry-After — clients retry immediately, making things worse
- Same limits for all endpoints — expensive endpoints need tighter limits
- No global limit — individual client limits don't prevent total system overload
- Testing only happy path — load test your rate limiting before production
Implementation Checklist#
- Choose identification strategy (API key, JWT, IP)
- Define per-client and per-endpoint limits
- Set a global safety limit
- Implement sliding window (Redis-backed for distributed)
- Add quota headers to all responses
- Return 429 with Retry-After on limit breach
- Define graceful degradation tiers
- Monitor rate limit hit rates in dashboards
- Alert on sustained high rejection rates
- Document limits in your API reference
This is article #356 in the Codelit engineering series. Explore more at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsBuild this architecture
Generate an interactive architecture for API Gateway Rate Limiting Patterns in seconds.
Try it in Codelit →
Comments