Performance Testing: A Complete Guide to Load, Stress, Soak, and Spike Testing
Performance bugs are invisible until they are catastrophic. A system that handles 100 requests per second in development can collapse at 1,000 in production. Performance testing replaces hope with data — it tells you exactly where your system breaks and how far you are from that threshold. This guide covers every type of performance test, the tools to run them, and the metrics that matter.
Types of Performance Tests#
Each test type answers a different question about your system.
Load
┌─────┐
Stress │ │ Soak
┌─────┐ │ │ ┌─────┐
│/////│ │ │ │ │
│/////│ │ │ │ │
──────┘ └───┘ └───┘ └────── time
▲ ▲ ▲
Find the Normal Find leaks
breaking operating over time
point capacity
Load Testing#
Load testing validates that your system meets expected traffic levels. You simulate the anticipated number of concurrent users and verify that response times and error rates stay within acceptable bounds.
When to use: Before every release, after infrastructure changes, and as part of CI/CD pipelines.
Stress Testing#
Stress testing pushes the system beyond its expected capacity to find the breaking point. You ramp up load until response times degrade or errors spike, then observe how the system recovers.
When to use: Capacity planning, pre-launch validation, after architectural changes.
Soak Testing (Endurance Testing)#
Soak testing applies a moderate, sustained load over an extended period — hours or even days. It exposes problems that only surface over time: memory leaks, connection pool exhaustion, disk space depletion, and garbage collection pauses.
When to use: Before production launches, after memory-management changes, for services that run continuously.
Spike Testing#
Spike testing subjects the system to sudden, dramatic increases in load. It validates auto-scaling behavior, circuit breakers, and graceful degradation.
When to use: Systems exposed to flash sales, viral events, or unpredictable traffic patterns.
Key Metrics#
Latency Percentiles#
Averages hide outliers. Always measure percentiles.
| Percentile | Meaning |
|---|---|
| P50 (median) | Half of requests are faster than this |
| P95 | 95% of requests are faster — the "typical worst case" |
| P99 | 99% of requests are faster — tail latency |
| P99.9 | One-in-a-thousand worst case |
A system with P50 = 50ms and P99 = 2,000ms has a severe tail latency problem even though the median looks healthy.
Throughput#
Requests per second (RPS) or transactions per second (TPS). Track this alongside latency — throughput without latency context is meaningless.
Error Rate#
The percentage of requests that return errors (5xx, timeouts, connection refused). Your SLO should define the maximum acceptable error rate under load.
Resource Utilization#
CPU, memory, disk I/O, and network bandwidth on the server side. Correlate resource metrics with latency to identify bottlenecks.
Latency ╱
│ ╱
│ ╱ ← saturation point
│ ───────╱
│ ─────────╱
│───╱
└──────────────────────────── Throughput (RPS)
Tools#
k6#
k6 (by Grafana Labs) is a modern load testing tool with tests written in JavaScript.
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
stages: [
{ duration: "2m", target: 100 }, // ramp up
{ duration: "5m", target: 100 }, // steady state
{ duration: "2m", target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ["p(95)<500", "p(99)<1000"],
http_req_failed: ["rate<0.01"],
},
};
export default function () {
const res = http.get("https://api.example.com/products");
check(res, {
"status is 200": (r) => r.status === 200,
"body is not empty": (r) => r.body.length > 0,
});
sleep(1);
}
Run it:
k6 run load-test.js
Strengths: Developer-friendly, scriptable, built-in thresholds, Grafana integration, supports protocol-level testing (HTTP, gRPC, WebSocket).
Locust#
Locust is a Python-based load testing framework where user behavior is defined as code.
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 3)
@task(3)
def view_products(self):
self.client.get("/api/products")
@task(1)
def view_cart(self):
self.client.get("/api/cart")
Strengths: Python ecosystem, distributed mode, real-time web UI, easy to model complex user flows.
Gatling#
Gatling uses Scala DSL for test scenarios and produces detailed HTML reports.
class BasicSimulation extends Simulation {
val httpProtocol = http.baseUrl("https://api.example.com")
val scn = scenario("Load Test")
.exec(http("Get Products").get("/products"))
.pause(1)
setUp(
scn.inject(
rampUsersPerSec(1).to(100).during(120),
constantUsersPerSec(100).during(300)
)
).protocols(httpProtocol)
.assertions(
global.responseTime.percentile3.lt(1000),
global.successfulRequests.percent.gt(99.0)
)
}
Strengths: Excellent reports, high performance (Akka-based), CI/CD friendly, protocol support (HTTP, WebSocket, JMS).
Apache JMeter#
JMeter is a veteran tool with a GUI for building test plans and extensive protocol support.
Strengths: GUI for non-developers, massive plugin ecosystem, supports JDBC, LDAP, FTP, and SMTP testing.
Weaknesses: Resource-heavy, XML-based test plans are hard to version control, GUI-centric workflow.
Tool Comparison#
| Feature | k6 | Locust | Gatling | JMeter |
|---|---|---|---|---|
| Language | JavaScript | Python | Scala | GUI/XML |
| Protocol support | HTTP, gRPC, WS | HTTP | HTTP, WS, JMS | HTTP, JDBC, LDAP |
| Distributed mode | k6 Cloud / xk6 | Built-in | Enterprise | Built-in |
| CI/CD integration | Excellent | Good | Excellent | Moderate |
| Resource efficiency | High | Moderate | High | Low |
Establishing Baselines#
A baseline is a performance snapshot under known conditions. Without one, you cannot tell whether a change improved or degraded performance.
How to Create a Baseline#
- Fix the environment — use a dedicated performance testing environment that mirrors production sizing.
- Define the workload — model realistic user behavior, not just single-endpoint hammering.
- Run the test — execute a standard load test (e.g., 30 minutes at expected peak load).
- Record metrics — capture P50, P95, P99, throughput, error rate, and resource utilization.
- Store results — commit results to version control or a metrics store for comparison.
Baseline Example#
Baseline: v2.4.1 — 2026-03-15
Environment: 3x c5.2xlarge, RDS r6g.xlarge
Workload: 500 concurrent users, 60/30/10 product/cart/checkout split
P50 latency: 45ms
P95 latency: 180ms
P99 latency: 420ms
Throughput: 2,340 RPS
Error rate: 0.02%
CPU (avg): 62%
Memory (avg): 71%
Run the same test after every significant change and compare against this baseline.
Integrating Performance Tests into CI/CD#
Pipeline Strategy#
Code Push ──► Unit Tests ──► Integration Tests ──► Performance Tests ──► Deploy
│
Compare against
baseline
│
Pass / Fail gate
Practical Tips#
- Run a smoke performance test on every PR — 1–2 minutes at low load to catch obvious regressions.
- Run a full load test nightly — longer duration against a dedicated environment.
- Run soak and stress tests weekly — these take hours and need dedicated resources.
- Set threshold-based gates — fail the pipeline if P95 latency exceeds the baseline by more than 10%.
k6 in GitHub Actions#
- name: Run performance test
uses: grafana/k6-action@v0.3.1
with:
filename: tests/performance/load-test.js
flags: --out json=results.json
- name: Check thresholds
run: |
if grep -q '"thresholds":.*"failures":[^0]' results.json; then
echo "Performance regression detected"
exit 1
fi
Common Bottlenecks and Fixes#
| Symptom | Likely Cause | Fix |
|---|---|---|
| Latency climbs linearly with load | CPU saturation | Scale horizontally or optimize hot paths |
| Sudden latency spike at threshold | Connection pool exhaustion | Increase pool size, add circuit breakers |
| Gradual memory increase in soak test | Memory leak | Profile with heap dumps, fix retention |
| High P99 with normal P50 | GC pauses or lock contention | Tune GC, reduce synchronization |
| Error rate spikes on scale-down | Ungraceful shutdown | Implement drain and readiness probes |
Key Takeaways#
- Test all four types: load, stress, soak, and spike — each reveals different failures.
- Measure percentiles (P50, P95, P99), not averages.
- Choose tools that fit your team: k6 for JavaScript teams, Locust for Python, Gatling for JVM shops.
- Establish baselines and compare every change against them.
- Integrate performance tests into CI/CD with automated threshold gates.
- Correlate latency with resource utilization to pinpoint bottlenecks.
- Run soak tests to catch memory leaks and resource exhaustion before production does.
If this guide helped you build a performance testing practice, explore the rest of our engineering blog — we have published 371 articles and counting on software engineering, DevOps, and system design. Browse all articles to keep leveling up.
Try it on Codelit
AI Architecture Review
Get an AI audit covering security gaps, bottlenecks, and scaling risks
Related articles
Build this architecture
Generate an interactive architecture for Performance Testing in seconds.
Try it in Codelit →
Comments