Chaos Testing in Production: Breaking Things on Purpose
Every distributed system fails. The question is whether you discover failure modes during a controlled experiment or at 3 AM when customers are affected. Chaos testing — deliberately injecting faults into production — converts unknown unknowns into documented, rehearsed scenarios with proven runbooks.
The Chaos Engineering Discipline#
Netflix coined the term with Chaos Monkey in 2011. The discipline has since matured into a rigorous scientific method:
- Define steady state — Pick a measurable business metric: orders per minute, p99 latency, error rate.
- Hypothesize — "If we terminate 30 % of API pods, the load balancer reroutes traffic and error rate stays below 0.5 %."
- Inject failure — Run the experiment in production (or a production-like staging environment).
- Observe — Compare steady-state metrics during and after injection.
- Learn — If the hypothesis holds, confidence increases. If it breaks, you found a weakness before customers did.
Failure Injection Categories#
Chaos experiments map to four broad categories of failure:
Network Failures#
# Example: tc-based network delay injection
experiment:
name: api-latency-spike
target: service/order-api
injection:
type: network-delay
latency: 500ms
jitter: 100ms
duration: 5m
percentage: 50
Common network faults:
- Latency injection — Add delay to inter-service calls to test timeout handling.
- Packet loss — Drop a percentage of packets to simulate degraded links.
- DNS failure — Return NXDOMAIN for a dependency to verify fallback behavior.
- Partition — Block traffic between two availability zones.
Compute Failures#
- Pod/container kill — Terminate random instances to test auto-scaling and load balancing.
- CPU stress — Saturate CPU to verify throttling and priority-based scheduling.
- Memory pressure — Allocate memory until OOM-killer activates to test graceful degradation.
- Node drain — Cordon and drain a Kubernetes node to test pod rescheduling.
State Failures#
- Disk fill — Fill the data volume to test write-ahead log behavior and alerting.
- Database failover — Trigger a primary-to-replica promotion to measure failover time.
- Cache eviction — Flush Redis/Memcached to test cold-cache performance.
- Clock skew — Shift the system clock to surface time-dependent bugs in TLS, tokens, or cron jobs.
Dependency Failures#
- Third-party API unavailability — Block egress to a payment provider to test circuit breakers.
- Message queue backlog — Pause consumers to build a backlog, then resume and verify ordering guarantees.
- Certificate expiry simulation — Inject an expired TLS cert to validate alerting and auto-renewal.
Game Days: Structured Chaos Events#
A game day is a scheduled, team-wide chaos exercise. It turns individual experiments into organizational learning.
Game Day Playbook#
1. PRE-GAME (1 week before)
├── Select 3-5 experiments
├── Notify on-call teams
├── Confirm rollback procedures
└── Set blast radius limits
2. GAME DAY (2-4 hours)
├── Briefing: steady-state metrics, hypotheses, abort criteria
├── Run experiments sequentially
├── Real-time observation in shared dashboard
└── Halt if any abort criterion is triggered
3. POST-GAME (same day)
├── Debrief: which hypotheses held, which broke
├── File action items for every failure
└── Update runbooks with new learnings
Who Should Participate#
- SRE / Platform team — Runs the experiments and monitors infrastructure.
- Application engineers — Observe service behavior and validate business logic resilience.
- Product / Business stakeholders — Understand customer impact and prioritize remediation.
Safety Mechanisms#
Running chaos in production requires guardrails. Without them, an experiment becomes an outage.
Blast Radius Control#
Limit the scope of every experiment:
safety:
max_targets: 3 # Never affect more than 3 instances
max_percentage: 30 # Or 30% of fleet, whichever is smaller
excluded_services:
- payment-gateway
- auth-service
excluded_hours:
- "17:00-09:00 UTC" # No experiments outside business hours
- weekends
Automatic Abort Conditions#
Define machine-enforced stop criteria:
abort_conditions:
- metric: error_rate_5xx
threshold: "> 2%"
window: 1m
- metric: p99_latency_ms
threshold: "> 3000"
window: 2m
- metric: orders_per_minute
threshold: "< 80% of baseline"
window: 3m
When any condition triggers, the chaos tool automatically rolls back the injection and sends an alert.
Progressive Rollout#
Start every new experiment at minimal blast radius and increase gradually:
- Canary — Inject into 1 instance, observe for 5 minutes.
- Limited — Expand to 10 % of targets, observe for 10 minutes.
- Broad — Scale to the planned percentage if metrics remain healthy.
Automated Chaos in CI/CD#
Mature organizations run chaos experiments as part of their deployment pipeline:
# .github/workflows/chaos-gate.yml
name: Chaos Gate
on:
deployment_status:
types: [success]
jobs:
chaos-smoke:
if: github.event.deployment.environment == 'production'
runs-on: ubuntu-latest
steps:
- name: Wait for deployment stabilization
run: sleep 120
- name: Run chaos experiment suite
uses: chaos-toolkit/run-experiment@v2
with:
experiment: experiments/post-deploy-suite.json
abort-on-failure: true
- name: Rollback deployment on chaos failure
if: failure()
run: |
gh api repos/$GITHUB_REPOSITORY/deployments \
--method POST \
--field ref=$PREVIOUS_SHA \
--field environment=production
What to Test in CI#
| Stage | Experiment | Blast Radius |
|---|---|---|
| Post-deploy | Kill 1 new pod, verify health check | Single pod |
| Nightly | Network partition between zones | One AZ |
| Weekly | Full dependency failure suite | 10-30 % |
| Pre-release | Game day with new feature flags | Staging |
Measuring Resilience Improvement#
Chaos testing is only valuable if you track improvement over time.
Key Metrics#
- Mean Time to Detect (MTTD) — How quickly alerts fire after injection starts.
- Mean Time to Recover (MTTR) — How long until steady state is restored after injection stops.
- Blast Radius Containment — Did the failure stay within the expected scope?
- Hypothesis Success Rate — Percentage of experiments where the system behaved as predicted.
Resilience Scorecard#
Service: order-api
Quarter: Q1 2026
Experiments Run: 24
Hypotheses Confirmed: 19 (79%)
New Weaknesses Found: 5
Weaknesses Remediated: 4
Avg MTTD: 45s (down from 2m last quarter)
Avg MTTR: 3.2m (down from 8m last quarter)
Track the scorecard quarterly. A healthy chaos program shows increasing hypothesis success rates and decreasing MTTD/MTTR.
Tooling Landscape#
| Tool | Scope | Environment |
|---|---|---|
| Chaos Monkey | Instance termination | AWS |
| Litmus | Kubernetes-native chaos | Any K8s cluster |
| Gremlin | Full-spectrum SaaS | Any |
| Chaos Toolkit | Open-source, extensible | Any |
| AWS FIS | AWS-native fault injection | AWS |
| Toxiproxy | Network-level proxy faults | Any |
Common Pitfalls#
- Skipping the hypothesis — Without a prediction, you are just breaking things. Always write down what you expect before injecting.
- No abort mechanism — Every experiment must have automatic rollback. Manual-only rollback is a single point of failure.
- Testing only in staging — Staging environments rarely match production topology, traffic patterns, or data volume. Start in staging, graduate to production.
- Chaos without observability — If you cannot measure steady state, you cannot detect deviation. Invest in metrics and tracing first.
- Blame culture — Chaos experiments should reveal system weaknesses, not individual mistakes. Blameless post-mortems are essential.
Getting Started Checklist#
- Define 3 steady-state business metrics you can measure in real time
- Pick one low-risk experiment (e.g., kill a single stateless pod)
- Set abort conditions tied to your SLOs
- Run the experiment with the team watching a shared dashboard
- Write up findings and file remediation tickets
- Schedule your first game day within 30 days
That is article #352 on codelit.io — explore the full library for more on reliability engineering, system design, and production-grade infrastructure.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Build this architecture
Generate an interactive architecture for Chaos Testing in Production in seconds.
Try it in Codelit →
Comments