load sheddingoverload protectionpriority sheddingadaptive sheddingrate limitingsystem designreliability

Load Shedding Patterns: Protecting Services Under Overload

March 29, 2026 7 min readBy Codelit Team Discussion

When a service receives more traffic than it can handle, two things can happen: it slows down for everyone, or it strategically rejects some requests so the rest succeed quickly. Load shedding is the art of choosing the second option — dropping work deliberately so the system stays healthy.

Why Overload Is Worse Than Rejection#

An overloaded service does not degrade gracefully on its own. As request queues grow, latency spikes, timeouts cascade, and retries multiply the load further. A service at 110% capacity often delivers 0% useful throughput because every request times out before completing.

Load shedding breaks this cycle by rejecting excess work early, before it consumes resources.

Incoming traffic
      │
      ▼
┌──────────────┐    over capacity    ┌────────────┐
│  Admission    │───────────────────▶│  Reject     │
│  Controller   │                    │  (503)      │
└──────┬───────┘                    └────────────┘
       │ within capacity
       ▼
┌──────────────┐
│  Process      │
│  Request      │
└──────────────┘

Overload Detection Signals#

Before shedding load, you need to know you are overloaded. Common signals include:

CPU utilization — When CPU exceeds 80-90%, new requests will queue.
Request queue depth — A growing queue means arrivals outpace completions.
In-flight request count — Too many concurrent requests exhaust threads and memory.
Latency percentiles — Rising p99 latency is an early warning of saturation.
Error rate — Upstream timeouts and downstream failures indicate cascading overload.

The best systems combine multiple signals rather than relying on a single metric.

Priority-Based Shedding#

Not all requests are equal. A health check from the load balancer matters more than a background analytics query. Priority-based shedding assigns each request a priority level and drops low-priority work first.

Priority levels (example):
  P0 — Health checks, auth token refresh
  P1 — User-facing reads (page loads, API GETs)
  P2 — User-facing writes (form submissions)
  P3 — Background jobs, analytics, prefetch
  P4 — Internal batch processing

Implementation approaches:

Header-based priority — Clients set a priority header; the gateway enforces it.
Endpoint classification — Map URL patterns to priority tiers in configuration.
Caller identity — Paid customers get higher priority than free-tier users.
Request cost estimation — Expensive queries (large aggregations) get lower priority under pressure.

When load exceeds capacity, start rejecting P4, then P3, and so on until throughput stabilizes.

Adaptive Shedding#

Static thresholds break when traffic patterns change. Adaptive shedding adjusts its rejection rate based on real-time feedback.

CoDel-inspired shedding: Borrowed from network congestion control, Controlled Delay (CoDel) measures how long requests spend in the queue. If queue sojourn time exceeds a target (e.g., 5ms) for a sustained interval, the system starts dropping requests at an increasing rate.

PID controller approach: A proportional-integral-derivative controller adjusts the acceptance rate to keep a target metric (CPU, latency, queue depth) within bounds. This avoids the oscillation that simple threshold-based systems suffer from.

Gradient-based: Google's Envoy proxy uses a gradient between current and historical success rates. When the ratio drops, the proxy reduces the fraction of traffic it forwards.

acceptance_rate = max(0, min(1,
    (target_latency / observed_p99_latency) * current_rate
))

Client Cooperation#

Load shedding works best when clients cooperate rather than blindly retrying.

503 with Retry-After#

When rejecting a request, return HTTP 503 (Service Unavailable) with a Retry-After header:

HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json

{"error": "service_overloaded", "retry_after_seconds": 30}

Well-behaved clients respect this header and wait before retrying, which gives the server time to recover.

Exponential Backoff with Jitter#

Clients should add randomized jitter to their retry delays. Without jitter, all clients retry at the same instant, creating a thundering herd that re-overloads the service.

delay = min(base * 2^attempt, max_delay) + random(0, jitter)

Client-Side Token Buckets#

Distribute token buckets to clients so they self-limit when tokens run out. The server replenishes tokens in response headers, creating a cooperative feedback loop.

Queue-Based Shedding#

Instead of rejecting requests at the front door, queue-based shedding accepts requests into a bounded queue and applies policies when the queue is full.

LIFO (Last-In-First-Out) eviction: When the queue is full, drop the oldest request. The rationale is that the oldest request's caller has likely already timed out, so processing it wastes resources.

Random early detection (RED): As queue occupancy rises above a threshold, randomly drop incoming requests with increasing probability. This prevents the queue from ever reaching capacity and avoids synchronized drops.

Deadline-aware queues: Each request carries a deadline. The queue periodically purges requests whose deadlines have passed, freeing slots for requests that can still succeed.

┌─────────────────────────────────────────┐
│  Bounded Queue (capacity: 1000)         │
│                                         │
│  ┌───┬───┬───┬───┬───┬─── ─ ─ ─┬───┐   │
│  │ R1│ R2│ R3│ R4│ R5│         │Rn │   │
│  └───┴───┴───┴───┴───┴─── ─ ─ ─┴───┘   │
│                                         │
│  Policy: LIFO eviction when full        │
│  Purge: Remove expired deadlines        │
└─────────────────────────────────────────┘

Shedding at Different Layers#

Load shedding is most effective when applied at multiple layers:

Layer	Mechanism	Benefit
CDN / Edge	Rate limiting by IP or API key	Stops abuse before it hits origin
Load Balancer	Connection limits, queue depth	Protects backend fleet uniformly
API Gateway	Priority routing, token buckets	Enforces per-client quotas
Service mesh	Circuit breakers, retry budgets	Prevents cascading failures
Application	In-process admission control	Fine-grained, context-aware decisions
Database	Connection pool limits, query timeouts	Protects the hardest-to-scale layer

Monitoring Load Shedding#

You cannot improve what you do not measure. Track these metrics:

Shed rate — Percentage of requests rejected over time.
Shed breakdown by priority — Confirms low-priority requests are shed first.
Goodput — Successful throughput (total throughput minus shed requests minus errors).
Latency of accepted requests — Should remain stable even as total traffic rises.
Recovery time — How quickly shed rate returns to zero after a spike.

Set alerts on shed rate crossing thresholds (e.g., shedding P1 requests means something serious is wrong).

Common Pitfalls#

Shedding health checks — If the load balancer's health check gets rejected, it marks the instance as unhealthy and removes it, worsening the overload. Always exempt health checks.
No feedback to clients — Rejecting without Retry-After causes immediate retries that amplify load.
Shedding too late — If admission control runs after expensive middleware (auth, parsing, logging), you still burn resources on rejected requests. Shed as early as possible.
Static thresholds only — A fixed "reject above 1000 RPS" breaks when request cost varies. Use resource-based signals instead.
Ignoring partial degradation — Sometimes serving a degraded response (cached data, fewer fields) is better than rejecting entirely.

Key Takeaways#

Load shedding is not failure — it is a deliberate reliability strategy that preserves service for the majority of requests.
Use priority tiers so that critical traffic survives while background work is shed first.
Adaptive algorithms (CoDel, PID controllers) outperform static thresholds in dynamic environments.
Client cooperation through Retry-After headers and exponential backoff with jitter prevents thundering herds.
Queue-based shedding with LIFO eviction and deadline awareness avoids wasting resources on stale requests.
Apply shedding at multiple layers — edge, gateway, service, and database — for defense in depth.

Build and explore system design concepts hands-on at codelit.io.

391 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Agent Reliability Engineering

2 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Microservices with API Gateway

Microservices architecture with API gateway, service discovery, circuit breakers, and distributed tracing.

10 components

Build this architecture

Generate an interactive architecture for Load Shedding Patterns in seconds.

Try it in Codelit →

load sheddingoverload protectionpriority sheddingadaptive sheddingrate limitingsystem designreliability

Load Shedding Patterns: Protecting Services Under Overload

March 29, 2026 7 min readBy Codelit Team Discussion

Why Overload Is Worse Than Rejection#

Load shedding breaks this cycle by rejecting excess work early, before it consumes resources.

Incoming traffic
      │
      ▼
┌──────────────┐    over capacity    ┌────────────┐
│  Admission    │───────────────────▶│  Reject     │
│  Controller   │                    │  (503)      │
└──────┬───────┘                    └────────────┘
       │ within capacity
       ▼
┌──────────────┐
│  Process      │
│  Request      │
└──────────────┘

Overload Detection Signals#

Before shedding load, you need to know you are overloaded. Common signals include:

CPU utilization — When CPU exceeds 80-90%, new requests will queue.
Request queue depth — A growing queue means arrivals outpace completions.
In-flight request count — Too many concurrent requests exhaust threads and memory.
Latency percentiles — Rising p99 latency is an early warning of saturation.
Error rate — Upstream timeouts and downstream failures indicate cascading overload.

The best systems combine multiple signals rather than relying on a single metric.

Priority-Based Shedding#

Priority levels (example):
  P0 — Health checks, auth token refresh
  P1 — User-facing reads (page loads, API GETs)
  P2 — User-facing writes (form submissions)
  P3 — Background jobs, analytics, prefetch
  P4 — Internal batch processing

Implementation approaches:

Header-based priority — Clients set a priority header; the gateway enforces it.
Endpoint classification — Map URL patterns to priority tiers in configuration.
Caller identity — Paid customers get higher priority than free-tier users.
Request cost estimation — Expensive queries (large aggregations) get lower priority under pressure.

When load exceeds capacity, start rejecting P4, then P3, and so on until throughput stabilizes.

Adaptive Shedding#

Static thresholds break when traffic patterns change. Adaptive shedding adjusts its rejection rate based on real-time feedback.

Gradient-based: Google's Envoy proxy uses a gradient between current and historical success rates. When the ratio drops, the proxy reduces the fraction of traffic it forwards.

acceptance_rate = max(0, min(1,
    (target_latency / observed_p99_latency) * current_rate
))

Client Cooperation#

Load shedding works best when clients cooperate rather than blindly retrying.

503 with Retry-After#

When rejecting a request, return HTTP 503 (Service Unavailable) with a Retry-After header:

HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json

{"error": "service_overloaded", "retry_after_seconds": 30}

Well-behaved clients respect this header and wait before retrying, which gives the server time to recover.

Exponential Backoff with Jitter#

Clients should add randomized jitter to their retry delays. Without jitter, all clients retry at the same instant, creating a thundering herd that re-overloads the service.

delay = min(base * 2^attempt, max_delay) + random(0, jitter)

Client-Side Token Buckets#

Distribute token buckets to clients so they self-limit when tokens run out. The server replenishes tokens in response headers, creating a cooperative feedback loop.

Queue-Based Shedding#

Instead of rejecting requests at the front door, queue-based shedding accepts requests into a bounded queue and applies policies when the queue is full.

Deadline-aware queues: Each request carries a deadline. The queue periodically purges requests whose deadlines have passed, freeing slots for requests that can still succeed.

┌─────────────────────────────────────────┐
│  Bounded Queue (capacity: 1000)         │
│                                         │
│  ┌───┬───┬───┬───┬───┬─── ─ ─ ─┬───┐   │
│  │ R1│ R2│ R3│ R4│ R5│         │Rn │   │
│  └───┴───┴───┴───┴───┴─── ─ ─ ─┴───┘   │
│                                         │
│  Policy: LIFO eviction when full        │
│  Purge: Remove expired deadlines        │
└─────────────────────────────────────────┘

Shedding at Different Layers#

Load shedding is most effective when applied at multiple layers:

Layer	Mechanism	Benefit
CDN / Edge	Rate limiting by IP or API key	Stops abuse before it hits origin
Load Balancer	Connection limits, queue depth	Protects backend fleet uniformly
API Gateway	Priority routing, token buckets	Enforces per-client quotas
Service mesh	Circuit breakers, retry budgets	Prevents cascading failures
Application	In-process admission control	Fine-grained, context-aware decisions
Database	Connection pool limits, query timeouts	Protects the hardest-to-scale layer

Monitoring Load Shedding#

You cannot improve what you do not measure. Track these metrics:

Shed rate — Percentage of requests rejected over time.
Shed breakdown by priority — Confirms low-priority requests are shed first.
Goodput — Successful throughput (total throughput minus shed requests minus errors).
Latency of accepted requests — Should remain stable even as total traffic rises.
Recovery time — How quickly shed rate returns to zero after a spike.

Set alerts on shed rate crossing thresholds (e.g., shedding P1 requests means something serious is wrong).

Common Pitfalls#

Shedding health checks — If the load balancer's health check gets rejected, it marks the instance as unhealthy and removes it, worsening the overload. Always exempt health checks.
No feedback to clients — Rejecting without Retry-After causes immediate retries that amplify load.
Shedding too late — If admission control runs after expensive middleware (auth, parsing, logging), you still burn resources on rejected requests. Shed as early as possible.
Static thresholds only — A fixed "reject above 1000 RPS" breaks when request cost varies. Use resource-based signals instead.
Ignoring partial degradation — Sometimes serving a degraded response (cached data, fewer fields) is better than rejecting entirely.

Key Takeaways#

Load shedding is not failure — it is a deliberate reliability strategy that preserves service for the majority of requests.
Use priority tiers so that critical traffic survives while background work is shed first.
Adaptive algorithms (CoDel, PID controllers) outperform static thresholds in dynamic environments.
Client cooperation through Retry-After headers and exponential backoff with jitter prevents thundering herds.
Queue-based shedding with LIFO eviction and deadline awareness avoids wasting resources on stale requests.
Apply shedding at multiple layers — edge, gateway, service, and database — for defense in depth.

Build and explore system design concepts hands-on at codelit.io.

391 articles on system design at codelit.io/blog.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Microservices with API Gateway

Microservices architecture with API gateway, service discovery, circuit breakers, and distributed tracing.

10 components

Build this architecture

Generate an interactive architecture for Load Shedding Patterns in seconds.

Try it in Codelit →

Load Shedding Patterns: Protecting Services Under Overload

Why Overload Is Worse Than Rejection#

Overload Detection Signals#

Priority-Based Shedding#

Adaptive Shedding#

Client Cooperation#

503 with Retry-After#

Exponential Backoff with Jitter#

Client-Side Token Buckets#

Queue-Based Shedding#

Shedding at Different Layers#

Monitoring Load Shedding#

Common Pitfalls#

Key Takeaways#

Comments

Related articles

Agent Reliability Engineering

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Scalable SaaS Application

Microservices with API Gateway

Build this architecture

Load Shedding Patterns: Protecting Services Under Overload

Why Overload Is Worse Than Rejection#

Overload Detection Signals#

Priority-Based Shedding#

Adaptive Shedding#

Client Cooperation#

503 with Retry-After#

Exponential Backoff with Jitter#

Client-Side Token Buckets#

Queue-Based Shedding#

Shedding at Different Layers#

Monitoring Load Shedding#

Common Pitfalls#

Key Takeaways#

Comments

Related articles

Agent Reliability Engineering

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Scalable SaaS Application

Microservices with API Gateway

Build this architecture