chaos engineeringresilienceChaos MonkeyLitmusGremlinchaos-meshToxiproxyNetflix Simian Armygame dayssite reliability engineeringsystem design

Chaos Engineering: A Practical Guide to Building Resilient Systems

March 28, 2026 6 min readBy Codelit Team Discussion

Production will surprise you. Disks fill, networks partition, pods vanish, and DNS lies. Chaos engineering is the discipline of proactively injecting failures into a system to uncover weaknesses before they surface as outages. Instead of waiting for 3 a.m. pages, you break things on purpose — on your terms.

Principles of Chaos Engineering#

The Principles of Chaos Engineering distill the practice into four ideas:

Build a hypothesis around steady state — Define what "normal" looks like using business metrics (orders per second, error rate, p99 latency), not just CPU graphs.
Vary real-world events — Simulate failures that actually happen: server crashes, network partitions, clock skew, certificate expiry.
Run experiments in production — Staging cannot replicate the full topology, traffic shape, and data volume of production.
Minimize blast radius — Start small. Affect one pod, one availability zone, one percentage of traffic. Expand only after confidence grows.

Steady-State Hypothesis#

Every chaos experiment begins with a steady-state hypothesis: a measurable statement about the system's normal behavior.

Hypothesis: When we terminate 1 of 3 API pods,
the error rate stays below 0.5 % and p99 latency stays under 300 ms
for the next 5 minutes.

If the hypothesis holds, the system is resilient to that failure mode. If it breaks, you have found a weakness worth fixing — and you found it before your users did.

Blast Radius Control#

Running chaos without guardrails is just breaking things. Control the blast radius with:

Scope limiting — Target a single pod, container, or availability zone rather than the whole fleet.
Traffic percentage — Route only a fraction of requests through the faulty path.
Automatic rollback — Define abort conditions. If error rate exceeds a threshold, halt the experiment immediately.
Time boxing — Cap experiment duration. A 60-second network partition is informative; a 60-minute one is an outage.
Feature flags — Gate experiments behind flags so you can kill them in one click.

Common Experiments#

Here are the experiments teams run most often:

Kill Pod / Instance#

Terminate a random pod or VM. Validates that the scheduler reschedules work and load balancers drain correctly.

kubectl delete pod api-server-7b4d9 --grace-period=0

Network Partition#

Isolate a service from its dependency. Reveals whether timeouts, retries, and circuit breakers are configured correctly.

# Using tc to add 100 % packet loss to port 5432 (Postgres)
tc qdisc add dev eth0 root netem loss 100% \
  && sleep 30 \
  && tc qdisc del dev eth0 root

Latency Injection#

Add artificial delay to a network call. Exposes cascading timeout failures and thread pool exhaustion.

# Toxiproxy config
- name: redis-latency
  listen: 0.0.0.0:6380
  upstream: redis:6379
  toxics:
    - type: latency
      attributes:
        latency: 2000
        jitter: 500

CPU / Memory Stress#

Saturate compute resources on a node. Tests autoscaling policies and pod eviction priorities.

stress-ng --cpu 4 --vm 2 --vm-bytes 512M --timeout 60s

DNS Failure#

Return NXDOMAIN or SERVFAIL for a downstream service. Validates DNS caching and fallback behavior.

Clock Skew#

Shift the system clock forward or backward. Breaks certificate validation, token expiry checks, and lease-based distributed locks.

The Netflix Simian Army#

Netflix pioneered chaos engineering with the Simian Army — a suite of tools that each inject a different failure:

Tool	What It Does
Chaos Monkey	Randomly terminates instances in production
Latency Monkey	Injects artificial delays into RESTful calls
Conformity Monkey	Shuts down instances that don't follow best practices
Chaos Gorilla	Simulates an entire availability zone outage
Chaos Kong	Simulates an entire region outage

Chaos Monkey remains the most widely adopted. Netflix open-sourced it, and it now integrates with Spinnaker for automated deployments.

Chaos Engineering Tools#

Chaos Monkey (Netflix)#

The original. Runs on Spinnaker, randomly terminates instances on a schedule. Best for teams already in the Netflix/Spinnaker ecosystem.

Litmus#

A CNCF project designed for Kubernetes-native chaos. Experiments are defined as CRDs (ChaosEngine, ChaosExperiment) and stored in a hub.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: app=api-server
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Gremlin#

A commercial platform with a polished UI, RBAC, and safety controls built in. Offers attack categories: resource, network, state, and application-level. Great for enterprises that need audit trails.

Toxiproxy (Shopify)#

A TCP proxy for simulating network conditions. Language-agnostic, lightweight, and excellent for local development and integration tests.

chaos-mesh#

Another CNCF project for Kubernetes. Supports time chaos (clock skew), I/O chaos (filesystem faults), and kernel chaos. Managed through a dashboard or CRDs.

Game Days#

A game day is a scheduled, team-wide chaos exercise. Think of it as a fire drill for your infrastructure.

How to Run a Game Day#

Choose a scenario — e.g., "Primary database becomes unreachable for 2 minutes."
Define the steady-state hypothesis — "The application serves degraded but functional responses; error rate stays below 1 %."
Brief the team — Everyone knows the experiment is happening. Observability dashboards are open.
Execute the experiment — Inject the failure.
Observe — Watch dashboards, logs, and alerts. Note anything unexpected.
Debrief — Write up findings. What broke? What held? What needs fixing?
Remediate — File tickets, fix the gaps, and re-run the experiment to verify the fix.

Game days build organizational muscle memory. The more you practice failure, the calmer you respond to real incidents.

Chaos in CI/CD#

Mature teams shift chaos left by integrating experiments into the deployment pipeline:

┌──────────┐    ┌───────────┐    ┌────────────┐    ┌──────────┐
│  Build &  │───▶│  Deploy   │───▶│   Chaos    │───▶│  Promote │
│  Test     │    │  Staging  │    │  Experiment│    │  to Prod │
└──────────┘    └───────────┘    └────────────┘    └──────────┘
                                       │
                                  Abort if SLO
                                  violated

Run lightweight chaos experiments (pod kill, latency injection) in staging after every deploy.
Gate promotion to production on passing chaos results.
Use Litmus or chaos-mesh CRDs triggered by your CI runner (GitHub Actions, GitLab CI, Argo Workflows).
Store experiment results as artifacts for audit and trend analysis.

When to Start#

You do not need a perfectly observable system to start chaos engineering. Start with:

A single service you understand well.
One hypothesis about a known failure mode.
A safe environment (staging) and a plan to graduate to production.

The goal is not to prove your system is perfect. The goal is to find the weaknesses you didn't know existed — and fix them before your users find them for you.

Building resilient distributed systems? Codelit helps teams design, visualize, and document architectures that survive chaos — not just on game days, but every day.

Article #177 on codelit.io

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Build this architecture

Generate an interactive architecture for Chaos Engineering in seconds.

Try it in Codelit →

chaos engineeringresilienceChaos MonkeyLitmusGremlinchaos-meshToxiproxyNetflix Simian Armygame dayssite reliability engineeringsystem design

Chaos Engineering: A Practical Guide to Building Resilient Systems

March 28, 2026 6 min readBy Codelit Team Discussion

Principles of Chaos Engineering#

The Principles of Chaos Engineering distill the practice into four ideas:

Build a hypothesis around steady state — Define what "normal" looks like using business metrics (orders per second, error rate, p99 latency), not just CPU graphs.
Vary real-world events — Simulate failures that actually happen: server crashes, network partitions, clock skew, certificate expiry.
Run experiments in production — Staging cannot replicate the full topology, traffic shape, and data volume of production.
Minimize blast radius — Start small. Affect one pod, one availability zone, one percentage of traffic. Expand only after confidence grows.

Steady-State Hypothesis#

Every chaos experiment begins with a steady-state hypothesis: a measurable statement about the system's normal behavior.

Hypothesis: When we terminate 1 of 3 API pods,
the error rate stays below 0.5 % and p99 latency stays under 300 ms
for the next 5 minutes.

If the hypothesis holds, the system is resilient to that failure mode. If it breaks, you have found a weakness worth fixing — and you found it before your users did.

Blast Radius Control#

Running chaos without guardrails is just breaking things. Control the blast radius with:

Scope limiting — Target a single pod, container, or availability zone rather than the whole fleet.
Traffic percentage — Route only a fraction of requests through the faulty path.
Automatic rollback — Define abort conditions. If error rate exceeds a threshold, halt the experiment immediately.
Time boxing — Cap experiment duration. A 60-second network partition is informative; a 60-minute one is an outage.
Feature flags — Gate experiments behind flags so you can kill them in one click.

Common Experiments#

Here are the experiments teams run most often:

Kill Pod / Instance#

Terminate a random pod or VM. Validates that the scheduler reschedules work and load balancers drain correctly.

kubectl delete pod api-server-7b4d9 --grace-period=0

Network Partition#

Isolate a service from its dependency. Reveals whether timeouts, retries, and circuit breakers are configured correctly.

# Using tc to add 100 % packet loss to port 5432 (Postgres)
tc qdisc add dev eth0 root netem loss 100% \
  && sleep 30 \
  && tc qdisc del dev eth0 root

Latency Injection#

Add artificial delay to a network call. Exposes cascading timeout failures and thread pool exhaustion.

# Toxiproxy config
- name: redis-latency
  listen: 0.0.0.0:6380
  upstream: redis:6379
  toxics:
    - type: latency
      attributes:
        latency: 2000
        jitter: 500

CPU / Memory Stress#

Saturate compute resources on a node. Tests autoscaling policies and pod eviction priorities.

stress-ng --cpu 4 --vm 2 --vm-bytes 512M --timeout 60s

DNS Failure#

Return NXDOMAIN or SERVFAIL for a downstream service. Validates DNS caching and fallback behavior.

Clock Skew#

Shift the system clock forward or backward. Breaks certificate validation, token expiry checks, and lease-based distributed locks.

The Netflix Simian Army#

Netflix pioneered chaos engineering with the Simian Army — a suite of tools that each inject a different failure:

Tool	What It Does
Chaos Monkey	Randomly terminates instances in production
Latency Monkey	Injects artificial delays into RESTful calls
Conformity Monkey	Shuts down instances that don't follow best practices
Chaos Gorilla	Simulates an entire availability zone outage
Chaos Kong	Simulates an entire region outage

Chaos Monkey remains the most widely adopted. Netflix open-sourced it, and it now integrates with Spinnaker for automated deployments.

Chaos Engineering Tools#

Chaos Monkey (Netflix)#

The original. Runs on Spinnaker, randomly terminates instances on a schedule. Best for teams already in the Netflix/Spinnaker ecosystem.

Litmus#

A CNCF project designed for Kubernetes-native chaos. Experiments are defined as CRDs (ChaosEngine, ChaosExperiment) and stored in a hub.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: app=api-server
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"

Gremlin#

A commercial platform with a polished UI, RBAC, and safety controls built in. Offers attack categories: resource, network, state, and application-level. Great for enterprises that need audit trails.

Toxiproxy (Shopify)#

A TCP proxy for simulating network conditions. Language-agnostic, lightweight, and excellent for local development and integration tests.

chaos-mesh#

Another CNCF project for Kubernetes. Supports time chaos (clock skew), I/O chaos (filesystem faults), and kernel chaos. Managed through a dashboard or CRDs.

Game Days#

A game day is a scheduled, team-wide chaos exercise. Think of it as a fire drill for your infrastructure.

How to Run a Game Day#

Choose a scenario — e.g., "Primary database becomes unreachable for 2 minutes."
Define the steady-state hypothesis — "The application serves degraded but functional responses; error rate stays below 1 %."
Brief the team — Everyone knows the experiment is happening. Observability dashboards are open.
Execute the experiment — Inject the failure.
Observe — Watch dashboards, logs, and alerts. Note anything unexpected.
Debrief — Write up findings. What broke? What held? What needs fixing?
Remediate — File tickets, fix the gaps, and re-run the experiment to verify the fix.

Game days build organizational muscle memory. The more you practice failure, the calmer you respond to real incidents.

Chaos in CI/CD#

Mature teams shift chaos left by integrating experiments into the deployment pipeline:

┌──────────┐    ┌───────────┐    ┌────────────┐    ┌──────────┐
│  Build &  │───▶│  Deploy   │───▶│   Chaos    │───▶│  Promote │
│  Test     │    │  Staging  │    │  Experiment│    │  to Prod │
└──────────┘    └───────────┘    └────────────┘    └──────────┘
                                       │
                                  Abort if SLO
                                  violated

Run lightweight chaos experiments (pod kill, latency injection) in staging after every deploy.
Gate promotion to production on passing chaos results.
Use Litmus or chaos-mesh CRDs triggered by your CI runner (GitHub Actions, GitLab CI, Argo Workflows).
Store experiment results as artifacts for audit and trend analysis.

When to Start#

You do not need a perfectly observable system to start chaos engineering. Start with:

A single service you understand well.
One hypothesis about a known failure mode.
A safe environment (staging) and a plan to graduate to production.

The goal is not to prove your system is perfect. The goal is to find the weaknesses you didn't know existed — and fix them before your users find them for you.

Building resilient distributed systems? Codelit helps teams design, visualize, and document architectures that survive chaos — not just on game days, but every day.

Article #177 on codelit.io

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for Chaos Engineering in seconds.

Try it in Codelit →

Chaos Engineering: A Practical Guide to Building Resilient Systems

Principles of Chaos Engineering#

Steady-State Hypothesis#

Blast Radius Control#

Common Experiments#

Kill Pod / Instance#

Network Partition#

Latency Injection#

CPU / Memory Stress#

DNS Failure#

Clock Skew#

The Netflix Simian Army#

Chaos Engineering Tools#

Chaos Monkey (Netflix)#

Litmus#

Gremlin#

Toxiproxy (Shopify)#

chaos-mesh#

Game Days#

How to Run a Game Day#

Chaos in CI/CD#

When to Start#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Build this architecture

Chaos Engineering: A Practical Guide to Building Resilient Systems

Principles of Chaos Engineering#

Steady-State Hypothesis#

Blast Radius Control#

Common Experiments#

Kill Pod / Instance#

Network Partition#

Latency Injection#

CPU / Memory Stress#

DNS Failure#

Clock Skew#

The Netflix Simian Army#

Chaos Engineering Tools#

Chaos Monkey (Netflix)#

Litmus#

Gremlin#

Toxiproxy (Shopify)#

chaos-mesh#

Game Days#

How to Run a Game Day#

Chaos in CI/CD#

When to Start#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Build this architecture