Chaos Engineering: A Practical Guide to Building Resilient Systems
Production will surprise you. Disks fill, networks partition, pods vanish, and DNS lies. Chaos engineering is the discipline of proactively injecting failures into a system to uncover weaknesses before they surface as outages. Instead of waiting for 3 a.m. pages, you break things on purpose — on your terms.
Principles of Chaos Engineering#
The Principles of Chaos Engineering distill the practice into four ideas:
- Build a hypothesis around steady state — Define what "normal" looks like using business metrics (orders per second, error rate, p99 latency), not just CPU graphs.
- Vary real-world events — Simulate failures that actually happen: server crashes, network partitions, clock skew, certificate expiry.
- Run experiments in production — Staging cannot replicate the full topology, traffic shape, and data volume of production.
- Minimize blast radius — Start small. Affect one pod, one availability zone, one percentage of traffic. Expand only after confidence grows.
Steady-State Hypothesis#
Every chaos experiment begins with a steady-state hypothesis: a measurable statement about the system's normal behavior.
Hypothesis: When we terminate 1 of 3 API pods,
the error rate stays below 0.5 % and p99 latency stays under 300 ms
for the next 5 minutes.
If the hypothesis holds, the system is resilient to that failure mode. If it breaks, you have found a weakness worth fixing — and you found it before your users did.
Blast Radius Control#
Running chaos without guardrails is just breaking things. Control the blast radius with:
- Scope limiting — Target a single pod, container, or availability zone rather than the whole fleet.
- Traffic percentage — Route only a fraction of requests through the faulty path.
- Automatic rollback — Define abort conditions. If error rate exceeds a threshold, halt the experiment immediately.
- Time boxing — Cap experiment duration. A 60-second network partition is informative; a 60-minute one is an outage.
- Feature flags — Gate experiments behind flags so you can kill them in one click.
Common Experiments#
Here are the experiments teams run most often:
Kill Pod / Instance#
Terminate a random pod or VM. Validates that the scheduler reschedules work and load balancers drain correctly.
kubectl delete pod api-server-7b4d9 --grace-period=0
Network Partition#
Isolate a service from its dependency. Reveals whether timeouts, retries, and circuit breakers are configured correctly.
# Using tc to add 100 % packet loss to port 5432 (Postgres)
tc qdisc add dev eth0 root netem loss 100% \
&& sleep 30 \
&& tc qdisc del dev eth0 root
Latency Injection#
Add artificial delay to a network call. Exposes cascading timeout failures and thread pool exhaustion.
# Toxiproxy config
- name: redis-latency
listen: 0.0.0.0:6380
upstream: redis:6379
toxics:
- type: latency
attributes:
latency: 2000
jitter: 500
CPU / Memory Stress#
Saturate compute resources on a node. Tests autoscaling policies and pod eviction priorities.
stress-ng --cpu 4 --vm 2 --vm-bytes 512M --timeout 60s
DNS Failure#
Return NXDOMAIN or SERVFAIL for a downstream service. Validates DNS caching and fallback behavior.
Clock Skew#
Shift the system clock forward or backward. Breaks certificate validation, token expiry checks, and lease-based distributed locks.
The Netflix Simian Army#
Netflix pioneered chaos engineering with the Simian Army — a suite of tools that each inject a different failure:
| Tool | What It Does |
|---|---|
| Chaos Monkey | Randomly terminates instances in production |
| Latency Monkey | Injects artificial delays into RESTful calls |
| Conformity Monkey | Shuts down instances that don't follow best practices |
| Chaos Gorilla | Simulates an entire availability zone outage |
| Chaos Kong | Simulates an entire region outage |
Chaos Monkey remains the most widely adopted. Netflix open-sourced it, and it now integrates with Spinnaker for automated deployments.
Chaos Engineering Tools#
Chaos Monkey (Netflix)#
The original. Runs on Spinnaker, randomly terminates instances on a schedule. Best for teams already in the Netflix/Spinnaker ecosystem.
Litmus#
A CNCF project designed for Kubernetes-native chaos. Experiments are defined as CRDs (ChaosEngine, ChaosExperiment) and stored in a hub.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-chaos
spec:
appinfo:
appns: production
applabel: app=api-server
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
Gremlin#
A commercial platform with a polished UI, RBAC, and safety controls built in. Offers attack categories: resource, network, state, and application-level. Great for enterprises that need audit trails.
Toxiproxy (Shopify)#
A TCP proxy for simulating network conditions. Language-agnostic, lightweight, and excellent for local development and integration tests.
chaos-mesh#
Another CNCF project for Kubernetes. Supports time chaos (clock skew), I/O chaos (filesystem faults), and kernel chaos. Managed through a dashboard or CRDs.
Game Days#
A game day is a scheduled, team-wide chaos exercise. Think of it as a fire drill for your infrastructure.
How to Run a Game Day#
- Choose a scenario — e.g., "Primary database becomes unreachable for 2 minutes."
- Define the steady-state hypothesis — "The application serves degraded but functional responses; error rate stays below 1 %."
- Brief the team — Everyone knows the experiment is happening. Observability dashboards are open.
- Execute the experiment — Inject the failure.
- Observe — Watch dashboards, logs, and alerts. Note anything unexpected.
- Debrief — Write up findings. What broke? What held? What needs fixing?
- Remediate — File tickets, fix the gaps, and re-run the experiment to verify the fix.
Game days build organizational muscle memory. The more you practice failure, the calmer you respond to real incidents.
Chaos in CI/CD#
Mature teams shift chaos left by integrating experiments into the deployment pipeline:
┌──────────┐ ┌───────────┐ ┌────────────┐ ┌──────────┐
│ Build & │───▶│ Deploy │───▶│ Chaos │───▶│ Promote │
│ Test │ │ Staging │ │ Experiment│ │ to Prod │
└──────────┘ └───────────┘ └────────────┘ └──────────┘
│
Abort if SLO
violated
- Run lightweight chaos experiments (pod kill, latency injection) in staging after every deploy.
- Gate promotion to production on passing chaos results.
- Use Litmus or chaos-mesh CRDs triggered by your CI runner (GitHub Actions, GitLab CI, Argo Workflows).
- Store experiment results as artifacts for audit and trend analysis.
When to Start#
You do not need a perfectly observable system to start chaos engineering. Start with:
- A single service you understand well.
- One hypothesis about a known failure mode.
- A safe environment (staging) and a plan to graduate to production.
The goal is not to prove your system is perfect. The goal is to find the weaknesses you didn't know existed — and fix them before your users find them for you.
Building resilient distributed systems? Codelit helps teams design, visualize, and document architectures that survive chaos — not just on game days, but every day.
Article #177 on codelit.io
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Build this architecture
Generate an interactive architecture for Chaos Engineering in seconds.
Try it in Codelit →
Comments