Testing Strategy: From the Testing Pyramid to Chaos Engineering
Shipping fast means nothing if every release is a coin flip between a clean deploy and a 3 a.m. incident. A well-designed testing strategy turns that coin flip into a confidence gradient — cheap, fast checks at the bottom, expensive but realistic checks at the top, and feedback loops that catch regressions before they reach users.
The Testing Pyramid#
Mike Cohn's testing pyramid remains the most practical mental model for allocating test effort:
╱ E2E ╲ ← few, slow, expensive
╱─────────╲
╱ Integration╲ ← moderate count
╱───────────────╲
╱ Unit Tests ╲ ← many, fast, cheap
╱───────────────────╲
Unit tests verify individual functions or classes in isolation. They run in milliseconds, require no infrastructure, and give pinpoint failure messages. Aim for thousands.
Integration tests verify that two or more modules cooperate correctly — a service talking to its database, or an API gateway routing to a downstream service. They need real (or containerised) dependencies and run in seconds.
End-to-end (E2E) tests simulate real user journeys through the full stack. Tools like Playwright and Cypress drive a browser; API-level E2E tests hit the deployed system. They are slow, flaky, and expensive — but they catch issues nothing else can.
The pyramid's lesson is proportionality: if your E2E suite takes longer than your deploy pipeline, the feedback loop is broken.
TDD vs BDD#
Test-Driven Development (TDD) follows the red-green-refactor cycle:
- Write a failing test that defines the desired behaviour.
- Write the minimum code to make it pass.
- Refactor while keeping tests green.
TDD works best for library code, algorithms, and domain logic where the interface is well-understood.
Behaviour-Driven Development (BDD) extends TDD with a shared language between developers, testers, and product owners. Specifications are written in Gherkin syntax:
Feature: Cart checkout
Scenario: Apply discount code
Given a cart with two items totalling $50
When the user applies code "SAVE10"
Then the total should be $45
Tools like Cucumber, SpecFlow, and Behave parse these specs into executable tests. BDD shines when requirements are ambiguous and stakeholder alignment matters more than raw speed.
The two are not mutually exclusive. Many teams use TDD for unit tests and BDD for acceptance tests.
Contract Testing with Pact#
In a microservices architecture, integration tests between services become combinatorially expensive. Contract testing solves this by decoupling provider and consumer verification.
Pact is the most widely adopted contract testing framework:
- The consumer writes a test that defines the requests it will make and the responses it expects. Pact records these as a contract (a JSON pact file).
- The provider replays the contract against its real implementation and verifies every interaction.
- A Pact Broker stores contracts, tracks verification status, and gates deployments via
can-i-deploy.
Consumer Test ──▶ Pact File ──▶ Pact Broker ──▶ Provider Verification
Contract testing catches breaking API changes without deploying both services together. It is faster than integration tests and more reliable than shared staging environments.
Chaos Testing#
Chaos testing deliberately injects failures into a running system to verify resilience. Netflix popularised the discipline with Chaos Monkey, which randomly terminates production instances.
Modern chaos engineering follows a scientific method:
- Hypothesise — "If we kill one replica of the payment service, latency stays under 500 ms."
- Inject — Use tools like LitmusChaos, Gremlin, or AWS Fault Injection Simulator to introduce the failure.
- Observe — Monitor dashboards, SLOs, and alerts.
- Learn — If the hypothesis fails, fix the weakness and re-run.
Common fault injections include network latency, DNS failure, CPU stress, disk fill, and dependency unavailability. Start in staging, graduate to production with a small blast radius, and always have a kill switch.
Load Testing with k6 and Locust#
Performance bugs are invisible until traffic spikes. Load testing makes them visible on your schedule.
k6 (Grafana) uses JavaScript to define virtual user scenarios:
import http from "k6/http";
import { check } from "k6";
export const options = { vus: 200, duration: "5m" };
export default function () {
const res = http.get("https://api.example.com/products");
check(res, { "status 200": (r) => r.status === 200 });
}
k6 excels at developer ergonomics, CI integration, and deterministic output.
Locust (Python) is event-driven and distributed:
from locust import HttpUser, task
class ProductUser(HttpUser):
@task
def browse(self):
self.client.get("/products")
Locust shines when you need a web UI for real-time monitoring or when your team already writes Python.
Both tools support ramping strategies, thresholds, and cloud execution. The key is testing regularly — not just before launch.
Mutation Testing#
Code coverage tells you which lines were executed, not whether your tests actually verify behaviour. Mutation testing fills that gap.
A mutation testing tool (Stryker for JS/TS, PIT for Java, mutmut for Python) modifies your source code — flipping conditionals, removing statements, changing return values — and reruns your test suite. If tests still pass after a mutation, they are too weak.
Source: if (age >= 18) return "adult";
Mutant: if (age > 18) return "adult"; // boundary mutation
If no test fails for the mutant above, you are missing a test for age === 18. Mutation score (killed mutants / total mutants) is a far more meaningful quality metric than line coverage.
Run mutation testing on critical business logic, not the entire codebase — it is computationally expensive.
Test Environments#
A reliable testing strategy requires environments that mirror production without blocking each other:
| Environment | Purpose | Data | Infra |
|---|---|---|---|
| Local | Developer inner loop | Seed / mocks | Docker Compose |
| CI | Automated gate | Fixtures | Ephemeral containers |
| Staging | Pre-prod validation | Sanitised prod clone | Prod-like cluster |
| Preview | PR-level demo | Seed | Ephemeral (Vercel, Fly) |
| Production | Real traffic | Real | Full |
Ephemeral environments (one per PR) are becoming the default. Tools like Argo CD, Terraform workspaces, and Neon database branching make them practical. They eliminate the "staging is broken" bottleneck.
Feature Flag Testing#
Feature flags introduce a new dimension to testing: every flag combination is a potential code path.
Best practices:
- Test both states — Every flag should have unit tests for its on and off paths.
- Integration test the rollout — Verify that a 50 % rollout does not break sessions or cause data inconsistency.
- Clean up stale flags — Untested dead code is worse than no code. Set expiry dates and automate flag removal.
- Test flag evaluation performance — A synchronous remote flag check on every request adds latency. Use local evaluation with SDKs from LaunchDarkly, Unleash, or Flagsmith.
Feature flag testing is especially important for trunk-based development, where incomplete features are merged behind flags daily.
Shift-Left Testing#
Shift-left means moving quality checks earlier in the development lifecycle — from QA after merge to the developer's machine before commit.
Concrete shift-left practices:
- Pre-commit hooks — Run linters, formatters, and fast unit tests via Husky or lefthook.
- IDE integration — Show test results and coverage inline (Wallaby.js, Jest runner).
- Static analysis — Catch bugs at compile time with TypeScript strict mode, Rust's borrow checker, or Semgrep rules.
- Dependency scanning — Detect vulnerable packages before they enter the lockfile (Snyk, Socket, Renovate).
- Schema validation — Validate OpenAPI specs, GraphQL schemas, and database migrations in CI before any runtime test.
The further left a bug is caught, the cheaper it is to fix. A type error caught by the compiler costs seconds; the same bug caught in production costs hours of incident response.
Putting It All Together#
A modern testing strategy is not a single technique — it is a layered system:
┌──────────────────────────────────────────────────┐
│ Shift-Left: linters, types, pre-commit hooks │
├──────────────────────────────────────────────────┤
│ Unit Tests (TDD) — fast, isolated, thousands │
├──────────────────────────────────────────────────┤
│ Integration / Contract Tests (Pact) — services │
├──────────────────────────────────────────────────┤
│ E2E / BDD — critical user journeys │
├──────────────────────────────────────────────────┤
│ Load Tests (k6 / Locust) — performance budgets │
├──────────────────────────────────────────────────┤
│ Chaos Tests — resilience under failure │
├──────────────────────────────────────────────────┤
│ Mutation Tests — test quality verification │
├──────────────────────────────────────────────────┤
│ Feature Flag Tests — flag state coverage │
└──────────────────────────────────────────────────┘
Each layer catches a different class of defect. Skip one and you leave a gap that will eventually be filled by an incident.
Build better systems with confidence. Explore architecture deep dives, tooling guides, and engineering culture posts at codelit.io.
This is article #157 in the Codelit engineering blog series.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Build this architecture
Generate an interactive architecture for Testing Strategy in seconds.
Try it in Codelit →
Comments