Graceful Degradation: Keeping Systems Useful When Things Break
Systems fail. The question is not whether your service will experience a partial outage, but whether it will degrade gracefully or collapse entirely. Graceful degradation means shedding non-essential functionality under stress so that core operations continue working.
Degradation vs Failure#
Hard failure: The checkout page returns a 500 error. No one can buy anything.
Graceful degradation: The checkout page works, but personalized recommendations are replaced with a static "popular items" list because the recommendation service is down.
The difference is planning. Graceful degradation requires you to decide in advance which features are essential and which can be dropped, reduced, or replaced when resources are scarce.
The Degradation Spectrum#
Not all degradation is binary. Think of it as a spectrum:
Full Functionality
│
├── Reduced quality (lower-res images, cached data)
│
├── Partial features (disable recommendations, skip analytics)
│
├── Read-only mode (serve content, reject writes)
│
├── Static fallback (serve cached HTML, maintenance page)
│
└── Controlled shutdown (drain connections, return 503)
Each level preserves as much value as possible while shedding load from the failing component.
Feature Flags for Degradation#
Feature flags are not just for progressive rollouts — they are your primary degradation control plane:
// Degradation flags (normally ON, turned OFF under stress)
const flags = {
personalizedRecommendations: true,
realTimeInventory: true,
liveChat: true,
orderTracking: true,
analyticsCollection: true,
};
When the recommendation service degrades, flip personalizedRecommendations to false. The application falls back to a static list without any code deployment.
Automation: Connect feature flags to health checks and circuit breakers. When the circuit breaker for the recommendation service opens, automatically disable the flag. When it closes, re-enable.
Flag categories:
| Category | Examples | Impact of disabling |
|---|---|---|
| Non-essential | Recommendations, recently viewed, social proof | No revenue impact |
| Enhancing | Real-time inventory counts, live chat | Minor UX reduction |
| Important | Search autocomplete, order tracking | Noticeable but survivable |
| Critical | Checkout, authentication, payment | Cannot disable — must stay up |
Fallback Responses#
Every external dependency should have a defined fallback:
Cache-based fallback:
async function getProductPrice(id) {
try {
const price = await pricingService.getPrice(id);
await cache.set(`price:${id}`, price, TTL_1_HOUR);
return price;
} catch (err) {
const cached = await cache.get(`price:${id}`);
if (cached) return { ...cached, stale: true };
return { price: product.listPrice, stale: true, source: "catalog" };
}
}
Static fallback:
- Serve a pre-generated response when the dynamic service is unavailable.
- CDN edge workers can serve stale cache entries with a
stale-while-revalidateheader.
Default fallback:
- Return sensible defaults: empty arrays instead of errors, default configuration instead of fetched config, generic content instead of personalized content.
Read-Only Mode#
When write infrastructure is stressed but read infrastructure is healthy, switch to read-only mode:
What changes:
- Users can browse, search, and view content.
- Write operations (create, update, delete) return a friendly message: "This feature is temporarily unavailable."
- Background jobs that write are paused.
Implementation:
function middleware(req, res, next) {
if (isReadOnlyMode() && isMutatingRequest(req)) {
return res.status(503).json({
error: "read_only_mode",
message: "We are performing maintenance. Read operations are available.",
retryAfter: 300
});
}
next();
}
Use cases:
- Database failover in progress (reads go to replica, writes are blocked until promotion completes).
- Storage quota exceeded.
- Write-ahead log backlog too deep.
Static Fallback Pages#
When even the application server is struggling, serve static content from the CDN:
Strategy:
- Pre-generate static versions of critical pages (homepage, product pages, status page).
- Store them at the CDN edge.
- Configure the CDN to serve the static version when the origin returns 5xx errors.
# Cloudflare Page Rule (conceptual)
# If origin returns 5xx, serve /static-fallback/index.html
# with Cache-Control: public, max-age=60
What users see:
- A functional (but slightly stale) page instead of an error screen.
- A banner: "Some features are temporarily limited."
- Links to the status page for updates.
Queue Overflow Handling#
When message queues back up, you need a degradation strategy:
Backpressure:
- Stop accepting new messages when the queue depth exceeds a threshold.
- Return
429 Too Many Requeststo producers. - Producers retry with exponential backoff.
Priority shedding:
- Tag messages with priority levels.
- When the queue is above 80% capacity, drop low-priority messages (analytics events, non-critical notifications).
- Process high-priority messages (payments, order confirmations) first.
Overflow to secondary storage:
if (queue.depth() > THRESHOLD) {
// Spill to S3/DynamoDB for later processing
await overflowStore.put(message);
metrics.increment("queue.overflow");
} else {
await queue.enqueue(message);
}
Priority-Based Degradation#
Not all users and not all requests deserve equal treatment during degradation:
Request priority tiers:
| Tier | Examples | Degradation behavior |
|---|---|---|
| P0 — Critical | Checkout, payment processing, auth | Never shed. Allocate reserved capacity. |
| P1 — High | Search, product pages, order status | Serve from cache if origin is slow. |
| P2 — Medium | Recommendations, reviews, wishlist | Disable under moderate load. |
| P3 — Low | Analytics, A/B test tracking, social | First to be shed. |
Implementation:
- Assign priority at the API gateway or load balancer level.
- Use separate thread pools or rate limiters per priority tier.
- When CPU or memory exceeds thresholds, shed P3 first, then P2.
SLA Tiers and Degradation#
If your platform serves multiple customers with different SLA tiers, degradation should respect those commitments:
Enterprise tier (99.99% SLA):
- Dedicated capacity, never shed.
- Failover to standby infrastructure.
- Priority queue processing.
Business tier (99.9% SLA):
- Shared capacity, but reserved minimum.
- Degraded features before business-tier traffic is shed.
Free tier (best effort):
- First to experience degradation.
- Rate limits tighten under load.
- Non-essential features disabled first.
function getCapacityBudget(userTier, systemLoad) {
if (systemLoad > 0.9) {
if (userTier === "free") return MINIMAL;
if (userTier === "business") return REDUCED;
return FULL; // enterprise
}
return FULL;
}
Implementing Graceful Degradation#
Step 1: Classify features by criticality. Map every feature to a priority tier.
Step 2: Define fallbacks. For each non-critical feature, document what happens when it is unavailable.
Step 3: Wire up automation. Connect health checks and circuit breakers to feature flags.
Step 4: Test degradation paths. Use chaos engineering to trigger degradation and verify that fallbacks work.
Step 5: Communicate to users. Show clear messaging when features are degraded. Never silently fail.
Anti-Patterns#
1. All-or-nothing availability
If one microservice fails and the entire page returns 500, you have no degradation strategy.
2. Degradation paths that are never tested
Fallback code that has not run in production will fail when you need it most. Test degradation regularly.
3. Silent degradation without user feedback
If search results are stale, tell the user. Hiding degradation erodes trust more than acknowledging it.
4. Manual-only degradation triggers
If a human must flip a switch at 3 AM, degradation is not graceful — it is delayed. Automate triggers.
Key Takeaways#
Graceful degradation is the difference between a system that bends and one that breaks. Classify features by criticality, define fallbacks for every dependency, automate degradation triggers with feature flags and circuit breakers, and respect SLA tiers when shedding load. The goal is never perfection — it is maximum usefulness under imperfect conditions.
This is article #265 of the Codelit system design series. For more deep dives on resilience and fault-tolerance patterns, explore the full blog archive.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Build this architecture
Generate an interactive architecture for Graceful Degradation in seconds.
Try it in Codelit →
Comments