API Gateway Patterns: A Deep Dive into Production Architecture
An API gateway sits between clients and your backend services. It absorbs cross-cutting concerns — routing, security, throttling, transformation — so individual services stay focused on business logic. In a microservices architecture, the gateway is often the single entry point for all external traffic.
Request Routing#
The gateway's most fundamental job is routing requests to the correct backend service based on path, headers, method, or query parameters.
Client ──▶ /api/orders/* ──▶ Order Service
──▶ /api/users/* ──▶ User Service
──▶ /api/payments/* ──▶ Payment Service
Path-Based Routing#
The simplest model. Each URL prefix maps to a service:
# Kong declarative config
services:
- name: order-service
url: http://orders.internal:8080
routes:
- paths: ["/api/orders"]
- name: user-service
url: http://users.internal:8080
routes:
- paths: ["/api/users"]
Header-Based and Weighted Routing#
For canary deployments or A/B testing, route a percentage of traffic — or traffic with specific headers — to a different backend version:
┌────────────┐
│ Gateway │──── 95% ────▶ Service v1
│ │──── 5% ────▶ Service v2
└────────────┘
This lets you validate new releases under real traffic before a full rollout.
Rate Limiting#
Rate limiting protects backends from abuse, ensures fair usage, and prevents cascading failures during traffic spikes.
Common Algorithms#
| Algorithm | Description | Best For |
|---|---|---|
| Token bucket | Tokens refill at a fixed rate; each request consumes one | Bursty traffic with sustained limits |
| Sliding window | Counts requests in a rolling time window | Smooth, predictable throttling |
| Fixed window | Counts requests in discrete time intervals | Simple implementation |
| Leaky bucket | Processes requests at a constant rate, queuing excess | Smoothing request flow |
Multi-Tier Limits#
Production gateways often enforce multiple tiers simultaneously:
- Global limit — protect the entire platform (e.g., 100K req/s).
- Per-tenant limit — enforce plan-based quotas (free: 100 req/min, pro: 10K req/min).
- Per-endpoint limit — protect expensive operations (e.g., search: 20 req/s per user).
Rate limit state is typically stored in Redis for sub-millisecond lookups across gateway replicas.
Authentication and Authorization#
The gateway centralizes identity verification so services do not each implement their own auth logic.
┌────────┐ ┌──────────┐ ┌────────────┐ ┌─────────┐
│ Client │───▶│ Gateway │───▶│ Auth Check │───▶│ Service │
│ │ │ │ │ (JWT/OAuth) │ │ │
└────────┘ └──────────┘ └────────────┘ └─────────┘
Common patterns:
- JWT validation — the gateway verifies the token signature and expiry, then forwards claims as headers.
- OAuth 2.0 introspection — the gateway calls the authorization server to validate opaque tokens.
- API key lookup — the gateway checks the key against a store and attaches the associated tenant context.
- mTLS termination — the gateway verifies client certificates and passes identity downstream.
After authentication, the gateway can enforce coarse-grained authorization (does this tenant have access to this service?) while leaving fine-grained authorization to the service itself.
Request Transformation#
Gateways modify requests and responses in flight to decouple clients from backend contracts.
Common Transformations#
- Path rewriting — strip the
/apiprefix before forwarding. - Header injection — add correlation IDs, tenant context, or trace headers.
- Body mapping — convert between JSON and XML, or reshape payloads for legacy backends.
- Protocol translation — accept REST from clients and forward as gRPC to internal services.
- Request aggregation — combine multiple backend calls into a single client response (Backend-for-Frontend pattern).
Client: GET /api/dashboard
└─▶ Gateway fans out:
├─▶ GET /orders/recent
├─▶ GET /metrics/summary
└─▶ GET /notifications/unread
└─▶ Gateway merges responses into one payload
Response Caching#
Caching at the gateway layer reduces backend load and improves latency for repeated queries.
Cache Strategies#
- Time-based TTL — cache responses for a fixed duration (e.g., 60 seconds).
- Cache-Control aware — respect
Cache-Control,ETag, andLast-Modifiedheaders from backends. - Vary-key caching — cache different variants based on
Accept,Authorization, or custom headers. - Stale-while-revalidate — serve stale content immediately while refreshing in the background.
Key rule: never cache authenticated, user-specific responses in a shared cache without proper vary keys. Leaking one user's data to another is a critical security incident.
Circuit Breaking#
When a backend service becomes unhealthy, the gateway should stop sending traffic to it rather than letting requests pile up and cascade failures.
┌─────────┐ ┌──────────┐ ┌─────────────┐
│ Gateway │────▶│ Circuit │────▶│ Backend │
│ │ │ Breaker │ │ Service │
└─────────┘ └──────────┘ └─────────────┘
States:
CLOSED ──▶ requests flow normally
OPEN ──▶ requests fail fast (503)
HALF-OPEN ──▶ allow a probe request to test recovery
Configuration typically includes:
- Failure threshold — number of failures before opening the circuit (e.g., 5 in 10 seconds).
- Open duration — how long to wait before probing (e.g., 30 seconds).
- Success threshold — number of successful probes before closing the circuit.
Load Shedding#
Load shedding goes further than rate limiting: when the system is under extreme pressure, the gateway deliberately drops low-priority requests to preserve capacity for critical ones.
Priority Classification#
Priority 1 (Critical): POST /api/payments/charge
Priority 2 (Important): GET /api/orders/{id}
Priority 3 (Best-effort): GET /api/recommendations
When CPU or queue depth crosses a threshold, the gateway sheds Priority 3 first, then Priority 2, keeping Priority 1 alive as long as possible. This is how systems survive traffic 10x beyond capacity.
Implementation Signals#
- Request queue depth — if the queue exceeds a threshold, reject new low-priority requests.
- Backend latency — if p99 latency spikes, begin shedding.
- CPU / memory — system-level signals trigger adaptive shedding.
Tooling#
Kong#
Kong is an open-source, Lua/OpenResty-based gateway with a rich plugin ecosystem. It supports declarative configuration, a database-backed admin API, and Kubernetes-native ingress via the Kong Ingress Controller.
Key plugins: rate-limiting, jwt, oauth2, request-transformer, response-ratelimiting, proxy-cache, circuit-breaker (via custom plugins or Kong Gateway Enterprise).
Tyk#
Tyk is a Go-based gateway with built-in analytics, a developer portal, and GraphQL support. It offers rate limiting, circuit breaking, request/response transformation, and API versioning out of the box.
AWS API Gateway#
AWS API Gateway provides a fully managed solution with two flavors:
- HTTP API — lightweight, low-latency, best for proxying to Lambda or HTTP backends.
- REST API — full-featured with request validation, WAF integration, caching, and usage plans.
Both integrate with AWS Cognito for authentication, CloudWatch for monitoring, and X-Ray for tracing.
Other Notable Tools#
- Envoy — high-performance L4/L7 proxy, often used as a data plane for service meshes (Istio).
- NGINX — widely used as a reverse proxy with API gateway capabilities via OpenResty or NGINX Plus.
- Traefik — cloud-native edge router with automatic service discovery.
- Ambassador (Emissary-Ingress) — Kubernetes-native gateway built on Envoy.
Key Takeaways#
An API gateway is not just a reverse proxy — it is the control plane for your external API surface. Routing, security, throttling, transformation, caching, circuit breaking, and load shedding are all cross-cutting concerns that belong at the edge, not scattered across services.
Choose a gateway that matches your operational maturity: managed services like AWS API Gateway for simplicity, or self-hosted solutions like Kong and Envoy for maximum control.
Want to sharpen your system design skills? Explore 323 more articles on Codelit.dev covering distributed systems, architecture patterns, and real-world engineering.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 components
Comments