Load Shedding Patterns: Protecting Services Under Overload
When a service receives more traffic than it can handle, two things can happen: it slows down for everyone, or it strategically rejects some requests so the rest succeed quickly. Load shedding is the art of choosing the second option — dropping work deliberately so the system stays healthy.
Why Overload Is Worse Than Rejection#
An overloaded service does not degrade gracefully on its own. As request queues grow, latency spikes, timeouts cascade, and retries multiply the load further. A service at 110% capacity often delivers 0% useful throughput because every request times out before completing.
Load shedding breaks this cycle by rejecting excess work early, before it consumes resources.
Incoming traffic
│
▼
┌──────────────┐ over capacity ┌────────────┐
│ Admission │───────────────────▶│ Reject │
│ Controller │ │ (503) │
└──────┬───────┘ └────────────┘
│ within capacity
▼
┌──────────────┐
│ Process │
│ Request │
└──────────────┘
Overload Detection Signals#
Before shedding load, you need to know you are overloaded. Common signals include:
- CPU utilization — When CPU exceeds 80-90%, new requests will queue.
- Request queue depth — A growing queue means arrivals outpace completions.
- In-flight request count — Too many concurrent requests exhaust threads and memory.
- Latency percentiles — Rising p99 latency is an early warning of saturation.
- Error rate — Upstream timeouts and downstream failures indicate cascading overload.
The best systems combine multiple signals rather than relying on a single metric.
Priority-Based Shedding#
Not all requests are equal. A health check from the load balancer matters more than a background analytics query. Priority-based shedding assigns each request a priority level and drops low-priority work first.
Priority levels (example):
P0 — Health checks, auth token refresh
P1 — User-facing reads (page loads, API GETs)
P2 — User-facing writes (form submissions)
P3 — Background jobs, analytics, prefetch
P4 — Internal batch processing
Implementation approaches:
- Header-based priority — Clients set a priority header; the gateway enforces it.
- Endpoint classification — Map URL patterns to priority tiers in configuration.
- Caller identity — Paid customers get higher priority than free-tier users.
- Request cost estimation — Expensive queries (large aggregations) get lower priority under pressure.
When load exceeds capacity, start rejecting P4, then P3, and so on until throughput stabilizes.
Adaptive Shedding#
Static thresholds break when traffic patterns change. Adaptive shedding adjusts its rejection rate based on real-time feedback.
CoDel-inspired shedding: Borrowed from network congestion control, Controlled Delay (CoDel) measures how long requests spend in the queue. If queue sojourn time exceeds a target (e.g., 5ms) for a sustained interval, the system starts dropping requests at an increasing rate.
PID controller approach: A proportional-integral-derivative controller adjusts the acceptance rate to keep a target metric (CPU, latency, queue depth) within bounds. This avoids the oscillation that simple threshold-based systems suffer from.
Gradient-based: Google's Envoy proxy uses a gradient between current and historical success rates. When the ratio drops, the proxy reduces the fraction of traffic it forwards.
acceptance_rate = max(0, min(1,
(target_latency / observed_p99_latency) * current_rate
))
Client Cooperation#
Load shedding works best when clients cooperate rather than blindly retrying.
503 with Retry-After#
When rejecting a request, return HTTP 503 (Service Unavailable) with a Retry-After header:
HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json
{"error": "service_overloaded", "retry_after_seconds": 30}
Well-behaved clients respect this header and wait before retrying, which gives the server time to recover.
Exponential Backoff with Jitter#
Clients should add randomized jitter to their retry delays. Without jitter, all clients retry at the same instant, creating a thundering herd that re-overloads the service.
delay = min(base * 2^attempt, max_delay) + random(0, jitter)
Client-Side Token Buckets#
Distribute token buckets to clients so they self-limit when tokens run out. The server replenishes tokens in response headers, creating a cooperative feedback loop.
Queue-Based Shedding#
Instead of rejecting requests at the front door, queue-based shedding accepts requests into a bounded queue and applies policies when the queue is full.
LIFO (Last-In-First-Out) eviction: When the queue is full, drop the oldest request. The rationale is that the oldest request's caller has likely already timed out, so processing it wastes resources.
Random early detection (RED): As queue occupancy rises above a threshold, randomly drop incoming requests with increasing probability. This prevents the queue from ever reaching capacity and avoids synchronized drops.
Deadline-aware queues: Each request carries a deadline. The queue periodically purges requests whose deadlines have passed, freeing slots for requests that can still succeed.
┌─────────────────────────────────────────┐
│ Bounded Queue (capacity: 1000) │
│ │
│ ┌───┬───┬───┬───┬───┬─── ─ ─ ─┬───┐ │
│ │ R1│ R2│ R3│ R4│ R5│ │Rn │ │
│ └───┴───┴───┴───┴───┴─── ─ ─ ─┴───┘ │
│ │
│ Policy: LIFO eviction when full │
│ Purge: Remove expired deadlines │
└─────────────────────────────────────────┘
Shedding at Different Layers#
Load shedding is most effective when applied at multiple layers:
| Layer | Mechanism | Benefit |
|---|---|---|
| CDN / Edge | Rate limiting by IP or API key | Stops abuse before it hits origin |
| Load Balancer | Connection limits, queue depth | Protects backend fleet uniformly |
| API Gateway | Priority routing, token buckets | Enforces per-client quotas |
| Service mesh | Circuit breakers, retry budgets | Prevents cascading failures |
| Application | In-process admission control | Fine-grained, context-aware decisions |
| Database | Connection pool limits, query timeouts | Protects the hardest-to-scale layer |
Monitoring Load Shedding#
You cannot improve what you do not measure. Track these metrics:
- Shed rate — Percentage of requests rejected over time.
- Shed breakdown by priority — Confirms low-priority requests are shed first.
- Goodput — Successful throughput (total throughput minus shed requests minus errors).
- Latency of accepted requests — Should remain stable even as total traffic rises.
- Recovery time — How quickly shed rate returns to zero after a spike.
Set alerts on shed rate crossing thresholds (e.g., shedding P1 requests means something serious is wrong).
Common Pitfalls#
- Shedding health checks — If the load balancer's health check gets rejected, it marks the instance as unhealthy and removes it, worsening the overload. Always exempt health checks.
- No feedback to clients — Rejecting without Retry-After causes immediate retries that amplify load.
- Shedding too late — If admission control runs after expensive middleware (auth, parsing, logging), you still burn resources on rejected requests. Shed as early as possible.
- Static thresholds only — A fixed "reject above 1000 RPS" breaks when request cost varies. Use resource-based signals instead.
- Ignoring partial degradation — Sometimes serving a degraded response (cached data, fewer fields) is better than rejecting entirely.
Key Takeaways#
- Load shedding is not failure — it is a deliberate reliability strategy that preserves service for the majority of requests.
- Use priority tiers so that critical traffic survives while background work is shed first.
- Adaptive algorithms (CoDel, PID controllers) outperform static thresholds in dynamic environments.
- Client cooperation through Retry-After headers and exponential backoff with jitter prevents thundering herds.
- Queue-based shedding with LIFO eviction and deadline awareness avoids wasting resources on stale requests.
- Apply shedding at multiple layers — edge, gateway, service, and database — for defense in depth.
Build and explore system design concepts hands-on at codelit.io.
391 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Load Shedding Patterns in seconds.
Try it in Codelit →
Comments