chaos testingchaos engineeringproduction testingfailure injectiongame daysresilience engineeringreliabilitysystem design

Chaos Testing in Production: Breaking Things on Purpose

March 29, 2026 7 min readBy Codelit Team Discussion

Every distributed system fails. The question is whether you discover failure modes during a controlled experiment or at 3 AM when customers are affected. Chaos testing — deliberately injecting faults into production — converts unknown unknowns into documented, rehearsed scenarios with proven runbooks.

The Chaos Engineering Discipline#

Netflix coined the term with Chaos Monkey in 2011. The discipline has since matured into a rigorous scientific method:

Define steady state — Pick a measurable business metric: orders per minute, p99 latency, error rate.
Hypothesize — "If we terminate 30 % of API pods, the load balancer reroutes traffic and error rate stays below 0.5 %."
Inject failure — Run the experiment in production (or a production-like staging environment).
Observe — Compare steady-state metrics during and after injection.
Learn — If the hypothesis holds, confidence increases. If it breaks, you found a weakness before customers did.

Failure Injection Categories#

Chaos experiments map to four broad categories of failure:

Network Failures#

# Example: tc-based network delay injection
experiment:
  name: api-latency-spike
  target: service/order-api
  injection:
    type: network-delay
    latency: 500ms
    jitter: 100ms
    duration: 5m
    percentage: 50

Common network faults:

Latency injection — Add delay to inter-service calls to test timeout handling.
Packet loss — Drop a percentage of packets to simulate degraded links.
DNS failure — Return NXDOMAIN for a dependency to verify fallback behavior.
Partition — Block traffic between two availability zones.

Compute Failures#

Pod/container kill — Terminate random instances to test auto-scaling and load balancing.
CPU stress — Saturate CPU to verify throttling and priority-based scheduling.
Memory pressure — Allocate memory until OOM-killer activates to test graceful degradation.
Node drain — Cordon and drain a Kubernetes node to test pod rescheduling.

State Failures#

Disk fill — Fill the data volume to test write-ahead log behavior and alerting.
Database failover — Trigger a primary-to-replica promotion to measure failover time.
Cache eviction — Flush Redis/Memcached to test cold-cache performance.
Clock skew — Shift the system clock to surface time-dependent bugs in TLS, tokens, or cron jobs.

Dependency Failures#

Third-party API unavailability — Block egress to a payment provider to test circuit breakers.
Message queue backlog — Pause consumers to build a backlog, then resume and verify ordering guarantees.
Certificate expiry simulation — Inject an expired TLS cert to validate alerting and auto-renewal.

Game Days: Structured Chaos Events#

A game day is a scheduled, team-wide chaos exercise. It turns individual experiments into organizational learning.

Game Day Playbook#

1. PRE-GAME (1 week before)
   ├── Select 3-5 experiments
   ├── Notify on-call teams
   ├── Confirm rollback procedures
   └── Set blast radius limits

2. GAME DAY (2-4 hours)
   ├── Briefing: steady-state metrics, hypotheses, abort criteria
   ├── Run experiments sequentially
   ├── Real-time observation in shared dashboard
   └── Halt if any abort criterion is triggered

3. POST-GAME (same day)
   ├── Debrief: which hypotheses held, which broke
   ├── File action items for every failure
   └── Update runbooks with new learnings

Who Should Participate#

SRE / Platform team — Runs the experiments and monitors infrastructure.
Application engineers — Observe service behavior and validate business logic resilience.
Product / Business stakeholders — Understand customer impact and prioritize remediation.

Safety Mechanisms#

Running chaos in production requires guardrails. Without them, an experiment becomes an outage.

Blast Radius Control#

Limit the scope of every experiment:

safety:
  max_targets: 3              # Never affect more than 3 instances
  max_percentage: 30           # Or 30% of fleet, whichever is smaller
  excluded_services:
    - payment-gateway
    - auth-service
  excluded_hours:
    - "17:00-09:00 UTC"        # No experiments outside business hours
    - weekends

Automatic Abort Conditions#

Define machine-enforced stop criteria:

abort_conditions:
  - metric: error_rate_5xx
    threshold: "> 2%"
    window: 1m
  - metric: p99_latency_ms
    threshold: "> 3000"
    window: 2m
  - metric: orders_per_minute
    threshold: "< 80% of baseline"
    window: 3m

When any condition triggers, the chaos tool automatically rolls back the injection and sends an alert.

Progressive Rollout#

Start every new experiment at minimal blast radius and increase gradually:

Canary — Inject into 1 instance, observe for 5 minutes.
Limited — Expand to 10 % of targets, observe for 10 minutes.
Broad — Scale to the planned percentage if metrics remain healthy.

Automated Chaos in CI/CD#

Mature organizations run chaos experiments as part of their deployment pipeline:

# .github/workflows/chaos-gate.yml
name: Chaos Gate
on:
  deployment_status:
    types: [success]

jobs:
  chaos-smoke:
    if: github.event.deployment.environment == 'production'
    runs-on: ubuntu-latest
    steps:
      - name: Wait for deployment stabilization
        run: sleep 120

      - name: Run chaos experiment suite
        uses: chaos-toolkit/run-experiment@v2
        with:
          experiment: experiments/post-deploy-suite.json
          abort-on-failure: true

      - name: Rollback deployment on chaos failure
        if: failure()
        run: |
          gh api repos/$GITHUB_REPOSITORY/deployments \
            --method POST \
            --field ref=$PREVIOUS_SHA \
            --field environment=production

What to Test in CI#

Stage	Experiment	Blast Radius
Post-deploy	Kill 1 new pod, verify health check	Single pod
Nightly	Network partition between zones	One AZ
Weekly	Full dependency failure suite	10-30 %
Pre-release	Game day with new feature flags	Staging

Measuring Resilience Improvement#

Chaos testing is only valuable if you track improvement over time.

Key Metrics#

Mean Time to Detect (MTTD) — How quickly alerts fire after injection starts.
Mean Time to Recover (MTTR) — How long until steady state is restored after injection stops.
Blast Radius Containment — Did the failure stay within the expected scope?
Hypothesis Success Rate — Percentage of experiments where the system behaved as predicted.

Resilience Scorecard#

Service: order-api
Quarter: Q1 2026

Experiments Run:        24
Hypotheses Confirmed:   19  (79%)
New Weaknesses Found:    5
Weaknesses Remediated:   4
Avg MTTD:              45s  (down from 2m last quarter)
Avg MTTR:             3.2m  (down from 8m last quarter)

Track the scorecard quarterly. A healthy chaos program shows increasing hypothesis success rates and decreasing MTTD/MTTR.

Tooling Landscape#

Tool	Scope	Environment
Chaos Monkey	Instance termination	AWS
Litmus	Kubernetes-native chaos	Any K8s cluster
Gremlin	Full-spectrum SaaS	Any
Chaos Toolkit	Open-source, extensible	Any
AWS FIS	AWS-native fault injection	AWS
Toxiproxy	Network-level proxy faults	Any

Common Pitfalls#

Skipping the hypothesis — Without a prediction, you are just breaking things. Always write down what you expect before injecting.
No abort mechanism — Every experiment must have automatic rollback. Manual-only rollback is a single point of failure.
Testing only in staging — Staging environments rarely match production topology, traffic patterns, or data volume. Start in staging, graduate to production.
Chaos without observability — If you cannot measure steady state, you cannot detect deviation. Invest in metrics and tracing first.
Blame culture — Chaos experiments should reveal system weaknesses, not individual mistakes. Blameless post-mortems are essential.

Getting Started Checklist#

Define 3 steady-state business metrics you can measure in real time
Pick one low-risk experiment (e.g., kill a single stateless pod)
Set abort conditions tied to your SLOs
Run the experiment with the team watching a shared dashboard
Write up findings and file remediation tickets
Schedule your first game day within 30 days

That is article #352 on codelit.io — explore the full library for more on reliability engineering, system design, and production-grade infrastructure.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Agent Reliability Engineering

2 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Build this architecture

Generate an interactive architecture for Chaos Testing in Production in seconds.

Try it in Codelit →

chaos testingchaos engineeringproduction testingfailure injectiongame daysresilience engineeringreliabilitysystem design

Chaos Testing in Production: Breaking Things on Purpose

March 29, 2026 7 min readBy Codelit Team Discussion

The Chaos Engineering Discipline#

Netflix coined the term with Chaos Monkey in 2011. The discipline has since matured into a rigorous scientific method:

Define steady state — Pick a measurable business metric: orders per minute, p99 latency, error rate.
Hypothesize — "If we terminate 30 % of API pods, the load balancer reroutes traffic and error rate stays below 0.5 %."
Inject failure — Run the experiment in production (or a production-like staging environment).
Observe — Compare steady-state metrics during and after injection.
Learn — If the hypothesis holds, confidence increases. If it breaks, you found a weakness before customers did.

Failure Injection Categories#

Chaos experiments map to four broad categories of failure:

Network Failures#

# Example: tc-based network delay injection
experiment:
  name: api-latency-spike
  target: service/order-api
  injection:
    type: network-delay
    latency: 500ms
    jitter: 100ms
    duration: 5m
    percentage: 50

Common network faults:

Latency injection — Add delay to inter-service calls to test timeout handling.
Packet loss — Drop a percentage of packets to simulate degraded links.
DNS failure — Return NXDOMAIN for a dependency to verify fallback behavior.
Partition — Block traffic between two availability zones.

Compute Failures#

Pod/container kill — Terminate random instances to test auto-scaling and load balancing.
CPU stress — Saturate CPU to verify throttling and priority-based scheduling.
Memory pressure — Allocate memory until OOM-killer activates to test graceful degradation.
Node drain — Cordon and drain a Kubernetes node to test pod rescheduling.

State Failures#

Disk fill — Fill the data volume to test write-ahead log behavior and alerting.
Database failover — Trigger a primary-to-replica promotion to measure failover time.
Cache eviction — Flush Redis/Memcached to test cold-cache performance.
Clock skew — Shift the system clock to surface time-dependent bugs in TLS, tokens, or cron jobs.

Dependency Failures#

Third-party API unavailability — Block egress to a payment provider to test circuit breakers.
Message queue backlog — Pause consumers to build a backlog, then resume and verify ordering guarantees.
Certificate expiry simulation — Inject an expired TLS cert to validate alerting and auto-renewal.

Game Days: Structured Chaos Events#

A game day is a scheduled, team-wide chaos exercise. It turns individual experiments into organizational learning.

Game Day Playbook#

1. PRE-GAME (1 week before)
   ├── Select 3-5 experiments
   ├── Notify on-call teams
   ├── Confirm rollback procedures
   └── Set blast radius limits

2. GAME DAY (2-4 hours)
   ├── Briefing: steady-state metrics, hypotheses, abort criteria
   ├── Run experiments sequentially
   ├── Real-time observation in shared dashboard
   └── Halt if any abort criterion is triggered

3. POST-GAME (same day)
   ├── Debrief: which hypotheses held, which broke
   ├── File action items for every failure
   └── Update runbooks with new learnings

Who Should Participate#

SRE / Platform team — Runs the experiments and monitors infrastructure.
Application engineers — Observe service behavior and validate business logic resilience.
Product / Business stakeholders — Understand customer impact and prioritize remediation.

Safety Mechanisms#

Running chaos in production requires guardrails. Without them, an experiment becomes an outage.

Blast Radius Control#

Limit the scope of every experiment:

safety:
  max_targets: 3              # Never affect more than 3 instances
  max_percentage: 30           # Or 30% of fleet, whichever is smaller
  excluded_services:
    - payment-gateway
    - auth-service
  excluded_hours:
    - "17:00-09:00 UTC"        # No experiments outside business hours
    - weekends

Automatic Abort Conditions#

Define machine-enforced stop criteria:

abort_conditions:
  - metric: error_rate_5xx
    threshold: "> 2%"
    window: 1m
  - metric: p99_latency_ms
    threshold: "> 3000"
    window: 2m
  - metric: orders_per_minute
    threshold: "< 80% of baseline"
    window: 3m

When any condition triggers, the chaos tool automatically rolls back the injection and sends an alert.

Progressive Rollout#

Start every new experiment at minimal blast radius and increase gradually:

Canary — Inject into 1 instance, observe for 5 minutes.
Limited — Expand to 10 % of targets, observe for 10 minutes.
Broad — Scale to the planned percentage if metrics remain healthy.

Automated Chaos in CI/CD#

Mature organizations run chaos experiments as part of their deployment pipeline:

# .github/workflows/chaos-gate.yml
name: Chaos Gate
on:
  deployment_status:
    types: [success]

jobs:
  chaos-smoke:
    if: github.event.deployment.environment == 'production'
    runs-on: ubuntu-latest
    steps:
      - name: Wait for deployment stabilization
        run: sleep 120

      - name: Run chaos experiment suite
        uses: chaos-toolkit/run-experiment@v2
        with:
          experiment: experiments/post-deploy-suite.json
          abort-on-failure: true

      - name: Rollback deployment on chaos failure
        if: failure()
        run: |
          gh api repos/$GITHUB_REPOSITORY/deployments \
            --method POST \
            --field ref=$PREVIOUS_SHA \
            --field environment=production

What to Test in CI#

Stage	Experiment	Blast Radius
Post-deploy	Kill 1 new pod, verify health check	Single pod
Nightly	Network partition between zones	One AZ
Weekly	Full dependency failure suite	10-30 %
Pre-release	Game day with new feature flags	Staging

Measuring Resilience Improvement#

Chaos testing is only valuable if you track improvement over time.

Key Metrics#

Mean Time to Detect (MTTD) — How quickly alerts fire after injection starts.
Mean Time to Recover (MTTR) — How long until steady state is restored after injection stops.
Blast Radius Containment — Did the failure stay within the expected scope?
Hypothesis Success Rate — Percentage of experiments where the system behaved as predicted.

Resilience Scorecard#

Service: order-api
Quarter: Q1 2026

Experiments Run:        24
Hypotheses Confirmed:   19  (79%)
New Weaknesses Found:    5
Weaknesses Remediated:   4
Avg MTTD:              45s  (down from 2m last quarter)
Avg MTTR:             3.2m  (down from 8m last quarter)

Track the scorecard quarterly. A healthy chaos program shows increasing hypothesis success rates and decreasing MTTD/MTTR.

Tooling Landscape#

Tool	Scope	Environment
Chaos Monkey	Instance termination	AWS
Litmus	Kubernetes-native chaos	Any K8s cluster
Gremlin	Full-spectrum SaaS	Any
Chaos Toolkit	Open-source, extensible	Any
AWS FIS	AWS-native fault injection	AWS
Toxiproxy	Network-level proxy faults	Any

Common Pitfalls#

Skipping the hypothesis — Without a prediction, you are just breaking things. Always write down what you expect before injecting.
No abort mechanism — Every experiment must have automatic rollback. Manual-only rollback is a single point of failure.
Testing only in staging — Staging environments rarely match production topology, traffic patterns, or data volume. Start in staging, graduate to production.
Chaos without observability — If you cannot measure steady state, you cannot detect deviation. Invest in metrics and tracing first.
Blame culture — Chaos experiments should reveal system weaknesses, not individual mistakes. Blameless post-mortems are essential.

Getting Started Checklist#

Define 3 steady-state business metrics you can measure in real time
Pick one low-risk experiment (e.g., kill a single stateless pod)
Set abort conditions tied to your SLOs
Run the experiment with the team watching a shared dashboard
Write up findings and file remediation tickets
Schedule your first game day within 30 days

That is article #352 on codelit.io — explore the full library for more on reliability engineering, system design, and production-grade infrastructure.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Chaos Testing in Production in seconds.

Try it in Codelit →

Chaos Testing in Production: Breaking Things on Purpose

The Chaos Engineering Discipline#

Failure Injection Categories#

Network Failures#

Compute Failures#

State Failures#

Dependency Failures#

Game Days: Structured Chaos Events#

Game Day Playbook#

Who Should Participate#

Safety Mechanisms#

Blast Radius Control#

Automatic Abort Conditions#

Progressive Rollout#

Automated Chaos in CI/CD#

What to Test in CI#

Measuring Resilience Improvement#

Key Metrics#

Resilience Scorecard#

Tooling Landscape#

Common Pitfalls#

Getting Started Checklist#

Comments

Related articles

Agent Reliability Engineering

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Build this architecture

Chaos Testing in Production: Breaking Things on Purpose

The Chaos Engineering Discipline#

Failure Injection Categories#

Network Failures#

Compute Failures#

State Failures#

Dependency Failures#

Game Days: Structured Chaos Events#

Game Day Playbook#

Who Should Participate#

Safety Mechanisms#

Blast Radius Control#

Automatic Abort Conditions#

Progressive Rollout#

Automated Chaos in CI/CD#

What to Test in CI#

Measuring Resilience Improvement#

Key Metrics#

Resilience Scorecard#

Tooling Landscape#

Common Pitfalls#

Getting Started Checklist#

Comments

Related articles

Agent Reliability Engineering

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Build this architecture