performance testingload testingstress testingsoak testingspike testingk6LocustGatlingJMeterP99 latencythroughputbenchmarking

Performance Testing: A Complete Guide to Load, Stress, Soak, and Spike Testing

March 29, 2026 7 min readBy Codelit Team Discussion

Performance bugs are invisible until they are catastrophic. A system that handles 100 requests per second in development can collapse at 1,000 in production. Performance testing replaces hope with data — it tells you exactly where your system breaks and how far you are from that threshold. This guide covers every type of performance test, the tools to run them, and the metrics that matter.

Types of Performance Tests#

Each test type answers a different question about your system.

                    Load
                   ┌─────┐
         Stress    │     │    Soak
        ┌─────┐   │     │   ┌─────┐
        │/////│   │     │   │     │
        │/////│   │     │   │     │
  ──────┘     └───┘     └───┘     └────── time
        ▲              ▲          ▲
     Find the       Normal     Find leaks
     breaking      operating    over time
      point        capacity

Load Testing#

Load testing validates that your system meets expected traffic levels. You simulate the anticipated number of concurrent users and verify that response times and error rates stay within acceptable bounds.

When to use: Before every release, after infrastructure changes, and as part of CI/CD pipelines.

Stress Testing#

Stress testing pushes the system beyond its expected capacity to find the breaking point. You ramp up load until response times degrade or errors spike, then observe how the system recovers.

When to use: Capacity planning, pre-launch validation, after architectural changes.

Soak Testing (Endurance Testing)#

Soak testing applies a moderate, sustained load over an extended period — hours or even days. It exposes problems that only surface over time: memory leaks, connection pool exhaustion, disk space depletion, and garbage collection pauses.

When to use: Before production launches, after memory-management changes, for services that run continuously.

Spike Testing#

Spike testing subjects the system to sudden, dramatic increases in load. It validates auto-scaling behavior, circuit breakers, and graceful degradation.

When to use: Systems exposed to flash sales, viral events, or unpredictable traffic patterns.

Key Metrics#

Latency Percentiles#

Averages hide outliers. Always measure percentiles.

Percentile	Meaning
P50 (median)	Half of requests are faster than this
P95	95% of requests are faster — the "typical worst case"
P99	99% of requests are faster — tail latency
P99.9	One-in-a-thousand worst case

A system with P50 = 50ms and P99 = 2,000ms has a severe tail latency problem even though the median looks healthy.

Throughput#

Requests per second (RPS) or transactions per second (TPS). Track this alongside latency — throughput without latency context is meaningless.

Error Rate#

The percentage of requests that return errors (5xx, timeouts, connection refused). Your SLO should define the maximum acceptable error rate under load.

Resource Utilization#

CPU, memory, disk I/O, and network bandwidth on the server side. Correlate resource metrics with latency to identify bottlenecks.

  Latency                      ╱
    │                        ╱
    │                      ╱    ← saturation point
    │              ───────╱
    │    ─────────╱
    │───╱
    └──────────────────────────── Throughput (RPS)

Tools#

k6#

k6 (by Grafana Labs) is a modern load testing tool with tests written in JavaScript.

import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  stages: [
    { duration: "2m", target: 100 },  // ramp up
    { duration: "5m", target: 100 },  // steady state
    { duration: "2m", target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500", "p(99)<1000"],
    http_req_failed: ["rate<0.01"],
  },
};

export default function () {
  const res = http.get("https://api.example.com/products");
  check(res, {
    "status is 200": (r) => r.status === 200,
    "body is not empty": (r) => r.body.length > 0,
  });
  sleep(1);
}

Run it:

k6 run load-test.js

Strengths: Developer-friendly, scriptable, built-in thresholds, Grafana integration, supports protocol-level testing (HTTP, gRPC, WebSocket).

Locust#

Locust is a Python-based load testing framework where user behavior is defined as code.

from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)

    @task(3)
    def view_products(self):
        self.client.get("/api/products")

    @task(1)
    def view_cart(self):
        self.client.get("/api/cart")

Strengths: Python ecosystem, distributed mode, real-time web UI, easy to model complex user flows.

Gatling#

Gatling uses Scala DSL for test scenarios and produces detailed HTML reports.

class BasicSimulation extends Simulation {
  val httpProtocol = http.baseUrl("https://api.example.com")

  val scn = scenario("Load Test")
    .exec(http("Get Products").get("/products"))
    .pause(1)

  setUp(
    scn.inject(
      rampUsersPerSec(1).to(100).during(120),
      constantUsersPerSec(100).during(300)
    )
  ).protocols(httpProtocol)
    .assertions(
      global.responseTime.percentile3.lt(1000),
      global.successfulRequests.percent.gt(99.0)
    )
}

Strengths: Excellent reports, high performance (Akka-based), CI/CD friendly, protocol support (HTTP, WebSocket, JMS).

Apache JMeter#

JMeter is a veteran tool with a GUI for building test plans and extensive protocol support.

Strengths: GUI for non-developers, massive plugin ecosystem, supports JDBC, LDAP, FTP, and SMTP testing.

Weaknesses: Resource-heavy, XML-based test plans are hard to version control, GUI-centric workflow.

Tool Comparison#

Feature	k6	Locust	Gatling	JMeter
Language	JavaScript	Python	Scala	GUI/XML
Protocol support	HTTP, gRPC, WS	HTTP	HTTP, WS, JMS	HTTP, JDBC, LDAP
Distributed mode	k6 Cloud / xk6	Built-in	Enterprise	Built-in
CI/CD integration	Excellent	Good	Excellent	Moderate
Resource efficiency	High	Moderate	High	Low

Establishing Baselines#

A baseline is a performance snapshot under known conditions. Without one, you cannot tell whether a change improved or degraded performance.

How to Create a Baseline#

Fix the environment — use a dedicated performance testing environment that mirrors production sizing.
Define the workload — model realistic user behavior, not just single-endpoint hammering.
Run the test — execute a standard load test (e.g., 30 minutes at expected peak load).
Record metrics — capture P50, P95, P99, throughput, error rate, and resource utilization.
Store results — commit results to version control or a metrics store for comparison.

Baseline Example#

Baseline: v2.4.1 — 2026-03-15
Environment: 3x c5.2xlarge, RDS r6g.xlarge
Workload: 500 concurrent users, 60/30/10 product/cart/checkout split

P50 latency:   45ms
P95 latency:  180ms
P99 latency:  420ms
Throughput:   2,340 RPS
Error rate:   0.02%
CPU (avg):    62%
Memory (avg): 71%

Run the same test after every significant change and compare against this baseline.

Integrating Performance Tests into CI/CD#

Pipeline Strategy#

Code Push ──► Unit Tests ──► Integration Tests ──► Performance Tests ──► Deploy
                                                        │
                                                   Compare against
                                                     baseline
                                                        │
                                                   Pass / Fail gate

Practical Tips#

Run a smoke performance test on every PR — 1–2 minutes at low load to catch obvious regressions.
Run a full load test nightly — longer duration against a dedicated environment.
Run soak and stress tests weekly — these take hours and need dedicated resources.
Set threshold-based gates — fail the pipeline if P95 latency exceeds the baseline by more than 10%.

k6 in GitHub Actions#

- name: Run performance test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/performance/load-test.js
    flags: --out json=results.json
- name: Check thresholds
  run: |
    if grep -q '"thresholds":.*"failures":[^0]' results.json; then
      echo "Performance regression detected"
      exit 1
    fi

Common Bottlenecks and Fixes#

Symptom	Likely Cause	Fix
Latency climbs linearly with load	CPU saturation	Scale horizontally or optimize hot paths
Sudden latency spike at threshold	Connection pool exhaustion	Increase pool size, add circuit breakers
Gradual memory increase in soak test	Memory leak	Profile with heap dumps, fix retention
High P99 with normal P50	GC pauses or lock contention	Tune GC, reduce synchronization
Error rate spikes on scale-down	Ungraceful shutdown	Implement drain and readiness probes

Key Takeaways#

Test all four types: load, stress, soak, and spike — each reveals different failures.
Measure percentiles (P50, P95, P99), not averages.
Choose tools that fit your team: k6 for JavaScript teams, Locust for Python, Gatling for JVM shops.
Establish baselines and compare every change against them.
Integrate performance tests into CI/CD with automated threshold gates.
Correlate latency with resource utilization to pinpoint bottlenecks.
Run soak tests to catch memory leaks and resource exhaustion before production does.

If this guide helped you build a performance testing practice, explore the rest of our engineering blog — we have published 371 articles and counting on software engineering, DevOps, and system design. Browse all articles to keep leveling up.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

Build this architecture →

Comments

capacity planning

Capacity Planning: A Practical Guide for Engineers

8 min read

latency optimization

Latency Optimization Techniques: From P50 to P99

6 min read

testing strategy

Testing Strategy: From the Testing Pyramid to Chaos Engineering

8 min read

Build this architecture

Generate an interactive architecture for Performance Testing in seconds.

Try it in Codelit →

performance testingload testingstress testingsoak testingspike testingk6LocustGatlingJMeterP99 latencythroughputbenchmarking

Performance Testing: A Complete Guide to Load, Stress, Soak, and Spike Testing

March 29, 2026 7 min readBy Codelit Team Discussion

Types of Performance Tests#

Each test type answers a different question about your system.

                    Load
                   ┌─────┐
         Stress    │     │    Soak
        ┌─────┐   │     │   ┌─────┐
        │/////│   │     │   │     │
        │/////│   │     │   │     │
  ──────┘     └───┘     └───┘     └────── time
        ▲              ▲          ▲
     Find the       Normal     Find leaks
     breaking      operating    over time
      point        capacity

Load Testing#

When to use: Before every release, after infrastructure changes, and as part of CI/CD pipelines.

Stress Testing#

Stress testing pushes the system beyond its expected capacity to find the breaking point. You ramp up load until response times degrade or errors spike, then observe how the system recovers.

When to use: Capacity planning, pre-launch validation, after architectural changes.

Soak Testing (Endurance Testing)#

When to use: Before production launches, after memory-management changes, for services that run continuously.

Spike Testing#

Spike testing subjects the system to sudden, dramatic increases in load. It validates auto-scaling behavior, circuit breakers, and graceful degradation.

When to use: Systems exposed to flash sales, viral events, or unpredictable traffic patterns.

Key Metrics#

Latency Percentiles#

Averages hide outliers. Always measure percentiles.

Percentile	Meaning
P50 (median)	Half of requests are faster than this
P95	95% of requests are faster — the "typical worst case"
P99	99% of requests are faster — tail latency
P99.9	One-in-a-thousand worst case

A system with P50 = 50ms and P99 = 2,000ms has a severe tail latency problem even though the median looks healthy.

Throughput#

Requests per second (RPS) or transactions per second (TPS). Track this alongside latency — throughput without latency context is meaningless.

Error Rate#

The percentage of requests that return errors (5xx, timeouts, connection refused). Your SLO should define the maximum acceptable error rate under load.

Resource Utilization#

CPU, memory, disk I/O, and network bandwidth on the server side. Correlate resource metrics with latency to identify bottlenecks.

  Latency                      ╱
    │                        ╱
    │                      ╱    ← saturation point
    │              ───────╱
    │    ─────────╱
    │───╱
    └──────────────────────────── Throughput (RPS)

Tools#

k6#

k6 (by Grafana Labs) is a modern load testing tool with tests written in JavaScript.

import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  stages: [
    { duration: "2m", target: 100 },  // ramp up
    { duration: "5m", target: 100 },  // steady state
    { duration: "2m", target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500", "p(99)<1000"],
    http_req_failed: ["rate<0.01"],
  },
};

export default function () {
  const res = http.get("https://api.example.com/products");
  check(res, {
    "status is 200": (r) => r.status === 200,
    "body is not empty": (r) => r.body.length > 0,
  });
  sleep(1);
}

Run it:

k6 run load-test.js

Strengths: Developer-friendly, scriptable, built-in thresholds, Grafana integration, supports protocol-level testing (HTTP, gRPC, WebSocket).

Locust#

Locust is a Python-based load testing framework where user behavior is defined as code.

from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)

    @task(3)
    def view_products(self):
        self.client.get("/api/products")

    @task(1)
    def view_cart(self):
        self.client.get("/api/cart")

Strengths: Python ecosystem, distributed mode, real-time web UI, easy to model complex user flows.

Gatling#

Gatling uses Scala DSL for test scenarios and produces detailed HTML reports.

class BasicSimulation extends Simulation {
  val httpProtocol = http.baseUrl("https://api.example.com")

  val scn = scenario("Load Test")
    .exec(http("Get Products").get("/products"))
    .pause(1)

  setUp(
    scn.inject(
      rampUsersPerSec(1).to(100).during(120),
      constantUsersPerSec(100).during(300)
    )
  ).protocols(httpProtocol)
    .assertions(
      global.responseTime.percentile3.lt(1000),
      global.successfulRequests.percent.gt(99.0)
    )
}

Strengths: Excellent reports, high performance (Akka-based), CI/CD friendly, protocol support (HTTP, WebSocket, JMS).

Apache JMeter#

JMeter is a veteran tool with a GUI for building test plans and extensive protocol support.

Strengths: GUI for non-developers, massive plugin ecosystem, supports JDBC, LDAP, FTP, and SMTP testing.

Weaknesses: Resource-heavy, XML-based test plans are hard to version control, GUI-centric workflow.

Tool Comparison#

Feature	k6	Locust	Gatling	JMeter
Language	JavaScript	Python	Scala	GUI/XML
Protocol support	HTTP, gRPC, WS	HTTP	HTTP, WS, JMS	HTTP, JDBC, LDAP
Distributed mode	k6 Cloud / xk6	Built-in	Enterprise	Built-in
CI/CD integration	Excellent	Good	Excellent	Moderate
Resource efficiency	High	Moderate	High	Low

Establishing Baselines#

A baseline is a performance snapshot under known conditions. Without one, you cannot tell whether a change improved or degraded performance.

How to Create a Baseline#

Fix the environment — use a dedicated performance testing environment that mirrors production sizing.
Define the workload — model realistic user behavior, not just single-endpoint hammering.
Run the test — execute a standard load test (e.g., 30 minutes at expected peak load).
Record metrics — capture P50, P95, P99, throughput, error rate, and resource utilization.
Store results — commit results to version control or a metrics store for comparison.

Baseline Example#

Baseline: v2.4.1 — 2026-03-15
Environment: 3x c5.2xlarge, RDS r6g.xlarge
Workload: 500 concurrent users, 60/30/10 product/cart/checkout split

P50 latency:   45ms
P95 latency:  180ms
P99 latency:  420ms
Throughput:   2,340 RPS
Error rate:   0.02%
CPU (avg):    62%
Memory (avg): 71%

Run the same test after every significant change and compare against this baseline.

Integrating Performance Tests into CI/CD#

Pipeline Strategy#

Code Push ──► Unit Tests ──► Integration Tests ──► Performance Tests ──► Deploy
                                                        │
                                                   Compare against
                                                     baseline
                                                        │
                                                   Pass / Fail gate

Practical Tips#

Run a smoke performance test on every PR — 1–2 minutes at low load to catch obvious regressions.
Run a full load test nightly — longer duration against a dedicated environment.
Run soak and stress tests weekly — these take hours and need dedicated resources.
Set threshold-based gates — fail the pipeline if P95 latency exceeds the baseline by more than 10%.

k6 in GitHub Actions#

- name: Run performance test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/performance/load-test.js
    flags: --out json=results.json
- name: Check thresholds
  run: |
    if grep -q '"thresholds":.*"failures":[^0]' results.json; then
      echo "Performance regression detected"
      exit 1
    fi

Common Bottlenecks and Fixes#

Symptom	Likely Cause	Fix
Latency climbs linearly with load	CPU saturation	Scale horizontally or optimize hot paths
Sudden latency spike at threshold	Connection pool exhaustion	Increase pool size, add circuit breakers
Gradual memory increase in soak test	Memory leak	Profile with heap dumps, fix retention
High P99 with normal P50	GC pauses or lock contention	Tune GC, reduce synchronization
Error rate spikes on scale-down	Ungraceful shutdown	Implement drain and readiness probes

Key Takeaways#

Test all four types: load, stress, soak, and spike — each reveals different failures.
Measure percentiles (P50, P95, P99), not averages.
Choose tools that fit your team: k6 for JavaScript teams, Locust for Python, Gatling for JVM shops.
Establish baselines and compare every change against them.
Integrate performance tests into CI/CD with automated threshold gates.
Correlate latency with resource utilization to pinpoint bottlenecks.
Run soak tests to catch memory leaks and resource exhaustion before production does.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

Build this architecture →

Comments

capacity planning

Build this architecture

Generate an interactive architecture for Performance Testing in seconds.

Try it in Codelit →

Performance Testing: A Complete Guide to Load, Stress, Soak, and Spike Testing

Types of Performance Tests#

Load Testing#

Stress Testing#

Soak Testing (Endurance Testing)#

Spike Testing#

Key Metrics#

Latency Percentiles#

Throughput#

Error Rate#

Resource Utilization#

Tools#

k6#

Locust#

Gatling#

Apache JMeter#

Tool Comparison#

Establishing Baselines#

How to Create a Baseline#

Baseline Example#

Integrating Performance Tests into CI/CD#

Pipeline Strategy#

Practical Tips#

k6 in GitHub Actions#

Common Bottlenecks and Fixes#

Key Takeaways#

Comments

Related articles

Capacity Planning: A Practical Guide for Engineers

Latency Optimization Techniques: From P50 to P99

Testing Strategy: From the Testing Pyramid to Chaos Engineering

Build this architecture

Performance Testing: A Complete Guide to Load, Stress, Soak, and Spike Testing

Types of Performance Tests#

Load Testing#

Stress Testing#

Soak Testing (Endurance Testing)#

Spike Testing#

Key Metrics#

Latency Percentiles#

Throughput#

Error Rate#

Resource Utilization#

Tools#

k6#

Locust#

Gatling#

Apache JMeter#

Tool Comparison#

Establishing Baselines#

How to Create a Baseline#

Baseline Example#

Integrating Performance Tests into CI/CD#

Pipeline Strategy#

Practical Tips#

k6 in GitHub Actions#

Common Bottlenecks and Fixes#

Key Takeaways#

Comments

Related articles

Capacity Planning: A Practical Guide for Engineers

Latency Optimization Techniques: From P50 to P99

Testing Strategy: From the Testing Pyramid to Chaos Engineering

Build this architecture