distributed tracingobservabilityOpenTelemetrymicroservicessystem design

Distributed Tracing: See Every Request Across Your Entire System

March 28, 2026 6 min readBy Codelit Team Discussion

Distributed Tracing#

A user clicks "Place Order." The request hits your API gateway, routes to the order service, calls inventory, payment, notification, and shipping services. The response takes 4.2 seconds. Which service is slow?

Without distributed tracing, you're guessing. With it, you see the entire journey of every request across every service.

Why Distributed Tracing?#

Logs tell you what happened inside one service. Metrics tell you aggregate behavior. Traces tell you what happened across services for a single request.

Logs:    "Payment processed in 200ms"  (one service, no context)
Metrics: "p99 latency is 3.2s"         (aggregate, no specifics)
Traces:  "Request abc-123 spent 2.8s waiting for inventory lock"  (end-to-end, specific)

In a monolith, a stack trace gives you the full picture. In microservices, distributed tracing is your stack trace.

Core Concepts#

Traces#

A trace represents the entire lifecycle of a request through your system. It has a unique trace ID that follows the request across every service boundary.

Trace ID: abc-123-def-456
Duration: 4.2s
Services: api-gateway → order → inventory → payment → notification

Spans#

A trace is made of spans — individual units of work. Each span has a parent, forming a tree.

[api-gateway: 4.2s]
  ├── [order-service: 3.8s]
  │     ├── [inventory-check: 2.8s]    ← bottleneck
  │     ├── [payment-process: 0.6s]
  │     └── [notification-send: 0.2s]
  └── [response-serialize: 0.1s]

Each span contains:

Operation name: what work was done
Start/end timestamps: duration
Tags/attributes: metadata like http.status_code=200
Events/logs: timestamped annotations within the span
Parent span ID: the causal relationship

Context Propagation#

The trace ID must travel with the request across service boundaries. This is context propagation.

Service A → HTTP Header → Service B → gRPC Metadata → Service C
           traceparent:                  traceparent:
           00-abc123-span1-01           00-abc123-span2-01

The W3C Trace Context standard defines the traceparent header:

traceparent: 00-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-abc123def456-span789-01

Without context propagation, you get disconnected spans — useless fragments instead of a complete picture.

OpenTelemetry#

OpenTelemetry (OTel) is the industry standard for instrumentation. It provides APIs, SDKs, and the OTLP protocol for traces, metrics, and logs.

Basic Instrumentation#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("order-service")

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", 99.99)

    with tracer.start_as_current_span("check_inventory"):
        inventory_result = check_inventory(order.items)

    with tracer.start_as_current_span("process_payment"):
        payment_result = charge_card(order.payment)

Auto-Instrumentation#

OTel provides automatic instrumentation for common libraries — HTTP clients, database drivers, message queues — so most spans appear without code changes.

# Python: instrument Flask + requests + psycopg2 automatically
opentelemetry-instrument --service_name order-service python app.py

# Node.js: auto-instrument Express + pg + ioredis
node --require @opentelemetry/auto-instrumentations-node app.js

Tracing Backends#

Jaeger#

Open-source, built by Uber. Strong at trace visualization, dependency graphs, and performance analysis.

Best for: teams wanting full control, self-hosted deployments
Storage: Cassandra, Elasticsearch, or memory
UI: built-in web interface with trace comparison

Zipkin#

Open-source, originally from Twitter. Simpler than Jaeger, well-established ecosystem.

Best for: simpler setups, broad language support
Storage: Cassandra, Elasticsearch, MySQL, in-memory
UI: clean trace timeline view

Managed Options#

Grafana Tempo: integrates with Grafana dashboards, cost-effective object storage
AWS X-Ray: native AWS integration
Datadog APM / Honeycomb / Lightstep: SaaS with advanced analysis

All modern backends accept OTLP, so switching is straightforward.

Sampling Strategies#

Tracing every request is expensive at scale. Sampling controls what gets recorded.

Head-Based Sampling#

Decision made at the start of the trace.

Sample 10% of all requests randomly
→ Simple, predictable cost
→ Misses rare errors (only 10% chance of capturing them)

Tail-Based Sampling#

Decision made at the end of the trace, after seeing all spans.

Keep traces that:
  - Have errors (status >= 500)
  - Exceed latency threshold (> 2s)
  - Hit specific services
  - Random 5% of the rest
→ Captures all interesting traces
→ Requires a collection tier that buffers traces before deciding

Priority Sampling#

Tag certain requests as must-sample:

with tracer.start_as_current_span("critical_operation") as span:
    span.set_attribute("sampling.priority", 1)  # always keep

Practical Sampling Config#

# OpenTelemetry Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Trace-Based Testing#

Traces are not just for debugging — they validate system behavior.

Contract Testing with Traces#

Assert that expected spans exist with correct attributes:

def test_order_creates_payment_span():
    response = client.post("/orders", json=order_data)
    trace = get_trace(response.headers["traceparent"])

    payment_span = find_span(trace, "process_payment")
    assert payment_span is not None
    assert payment_span.attributes["payment.method"] == "card"
    assert payment_span.status == "OK"

Performance Regression Detection#

Compare span durations across deployments:

v2.3.0: inventory-check p50=120ms p99=800ms
v2.4.0: inventory-check p50=350ms p99=2400ms  ← regression

Correlating Traces with Logs and Metrics#

The real power comes from connecting all three pillars of observability.

Inject Trace ID into Logs#

import logging

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x')
        record.span_id = format(ctx.span_id, '016x')
        return True

Now every log line carries the trace ID:

2026-03-28 14:22:01 [trace_id=abc123 span_id=span789] Payment processed for order #42

Click the trace ID in Grafana, Datadog, or your logging tool to jump directly to the full trace.

Exemplars: Metrics to Traces#

Exemplars attach trace IDs to metric data points. When you see a latency spike in a histogram, click through to the exact trace that caused it.

http_request_duration{service="order"} 4.2s  exemplar={trace_id="abc123"}

Common Pitfalls#

Missing context propagation — one service that doesn't forward headers breaks the entire trace.
Over-sampling in production — 100% sampling at scale is expensive. Start with tail-based sampling.
Ignoring async flows — message queues need manual context injection into message headers.
Too many custom spans — instrument boundaries and slow operations, not every function call.
No trace-to-log correlation — traces without log context lose half their debugging value.

Key Takeaways#

Distributed tracing shows the end-to-end journey of a request across services.
OpenTelemetry is the standard — instrument once, export anywhere.
Context propagation is the foundation — broken propagation means broken traces.
Tail-based sampling captures errors and slow traces while controlling cost.
Correlate traces with logs and metrics for the full observability picture.

Start with auto-instrumentation, add custom spans at service boundaries, and connect traces to your logging pipeline. Within a week, you'll wonder how you ever debugged without it.

Build tools that help teams visualize system architecture and trace flows at codelit.io.

Article #154 in the Codelit engineering series.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Build this architecture

Generate an interactive architecture for Distributed Tracing in seconds.

Try it in Codelit →

distributed tracingobservabilityOpenTelemetrymicroservicessystem design

Distributed Tracing: See Every Request Across Your Entire System

March 28, 2026 6 min readBy Codelit Team Discussion

Distributed Tracing#

Without distributed tracing, you're guessing. With it, you see the entire journey of every request across every service.

Why Distributed Tracing?#

Logs tell you what happened inside one service. Metrics tell you aggregate behavior. Traces tell you what happened across services for a single request.

Logs:    "Payment processed in 200ms"  (one service, no context)
Metrics: "p99 latency is 3.2s"         (aggregate, no specifics)
Traces:  "Request abc-123 spent 2.8s waiting for inventory lock"  (end-to-end, specific)

In a monolith, a stack trace gives you the full picture. In microservices, distributed tracing is your stack trace.

Core Concepts#

Traces#

A trace represents the entire lifecycle of a request through your system. It has a unique trace ID that follows the request across every service boundary.

Trace ID: abc-123-def-456
Duration: 4.2s
Services: api-gateway → order → inventory → payment → notification

Spans#

A trace is made of spans — individual units of work. Each span has a parent, forming a tree.

[api-gateway: 4.2s]
  ├── [order-service: 3.8s]
  │     ├── [inventory-check: 2.8s]    ← bottleneck
  │     ├── [payment-process: 0.6s]
  │     └── [notification-send: 0.2s]
  └── [response-serialize: 0.1s]

Each span contains:

Operation name: what work was done
Start/end timestamps: duration
Tags/attributes: metadata like http.status_code=200
Events/logs: timestamped annotations within the span
Parent span ID: the causal relationship

Context Propagation#

The trace ID must travel with the request across service boundaries. This is context propagation.

Service A → HTTP Header → Service B → gRPC Metadata → Service C
           traceparent:                  traceparent:
           00-abc123-span1-01           00-abc123-span2-01

The W3C Trace Context standard defines the traceparent header:

traceparent: 00-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-abc123def456-span789-01

Without context propagation, you get disconnected spans — useless fragments instead of a complete picture.

OpenTelemetry#

OpenTelemetry (OTel) is the industry standard for instrumentation. It provides APIs, SDKs, and the OTLP protocol for traces, metrics, and logs.

Basic Instrumentation#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("order-service")

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", 99.99)

    with tracer.start_as_current_span("check_inventory"):
        inventory_result = check_inventory(order.items)

    with tracer.start_as_current_span("process_payment"):
        payment_result = charge_card(order.payment)

Auto-Instrumentation#

OTel provides automatic instrumentation for common libraries — HTTP clients, database drivers, message queues — so most spans appear without code changes.

# Python: instrument Flask + requests + psycopg2 automatically
opentelemetry-instrument --service_name order-service python app.py

# Node.js: auto-instrument Express + pg + ioredis
node --require @opentelemetry/auto-instrumentations-node app.js

Tracing Backends#

Jaeger#

Open-source, built by Uber. Strong at trace visualization, dependency graphs, and performance analysis.

Best for: teams wanting full control, self-hosted deployments
Storage: Cassandra, Elasticsearch, or memory
UI: built-in web interface with trace comparison

Zipkin#

Open-source, originally from Twitter. Simpler than Jaeger, well-established ecosystem.

Best for: simpler setups, broad language support
Storage: Cassandra, Elasticsearch, MySQL, in-memory
UI: clean trace timeline view

Managed Options#

Grafana Tempo: integrates with Grafana dashboards, cost-effective object storage
AWS X-Ray: native AWS integration
Datadog APM / Honeycomb / Lightstep: SaaS with advanced analysis

All modern backends accept OTLP, so switching is straightforward.

Sampling Strategies#

Tracing every request is expensive at scale. Sampling controls what gets recorded.

Head-Based Sampling#

Decision made at the start of the trace.

Sample 10% of all requests randomly
→ Simple, predictable cost
→ Misses rare errors (only 10% chance of capturing them)

Tail-Based Sampling#

Decision made at the end of the trace, after seeing all spans.

Keep traces that:
  - Have errors (status >= 500)
  - Exceed latency threshold (> 2s)
  - Hit specific services
  - Random 5% of the rest
→ Captures all interesting traces
→ Requires a collection tier that buffers traces before deciding

Priority Sampling#

Tag certain requests as must-sample:

with tracer.start_as_current_span("critical_operation") as span:
    span.set_attribute("sampling.priority", 1)  # always keep

Practical Sampling Config#

# OpenTelemetry Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Trace-Based Testing#

Traces are not just for debugging — they validate system behavior.

Contract Testing with Traces#

Assert that expected spans exist with correct attributes:

def test_order_creates_payment_span():
    response = client.post("/orders", json=order_data)
    trace = get_trace(response.headers["traceparent"])

    payment_span = find_span(trace, "process_payment")
    assert payment_span is not None
    assert payment_span.attributes["payment.method"] == "card"
    assert payment_span.status == "OK"

Performance Regression Detection#

Compare span durations across deployments:

v2.3.0: inventory-check p50=120ms p99=800ms
v2.4.0: inventory-check p50=350ms p99=2400ms  ← regression

Correlating Traces with Logs and Metrics#

The real power comes from connecting all three pillars of observability.

Inject Trace ID into Logs#

import logging

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x')
        record.span_id = format(ctx.span_id, '016x')
        return True

Now every log line carries the trace ID:

2026-03-28 14:22:01 [trace_id=abc123 span_id=span789] Payment processed for order #42

Click the trace ID in Grafana, Datadog, or your logging tool to jump directly to the full trace.

Exemplars: Metrics to Traces#

Exemplars attach trace IDs to metric data points. When you see a latency spike in a histogram, click through to the exact trace that caused it.

http_request_duration{service="order"} 4.2s  exemplar={trace_id="abc123"}

Common Pitfalls#

Missing context propagation — one service that doesn't forward headers breaks the entire trace.
Over-sampling in production — 100% sampling at scale is expensive. Start with tail-based sampling.
Ignoring async flows — message queues need manual context injection into message headers.
Too many custom spans — instrument boundaries and slow operations, not every function call.
No trace-to-log correlation — traces without log context lose half their debugging value.

Key Takeaways#

Distributed tracing shows the end-to-end journey of a request across services.
OpenTelemetry is the standard — instrument once, export anywhere.
Context propagation is the foundation — broken propagation means broken traces.
Tail-based sampling captures errors and slow traces while controlling cost.
Correlate traces with logs and metrics for the full observability picture.

Start with auto-instrumentation, add custom spans at service boundaries, and connect traces to your logging pipeline. Within a week, you'll wonder how you ever debugged without it.

Build tools that help teams visualize system architecture and trace flows at codelit.io.

Article #154 in the Codelit engineering series.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Distributed Tracing in seconds.

Try it in Codelit →

Distributed Tracing: See Every Request Across Your Entire System

Distributed Tracing#

Why Distributed Tracing?#

Core Concepts#

Traces#

Spans#

Context Propagation#

OpenTelemetry#

Basic Instrumentation#

Auto-Instrumentation#

Tracing Backends#

Jaeger#

Zipkin#

Managed Options#

Sampling Strategies#

Head-Based Sampling#

Tail-Based Sampling#

Priority Sampling#

Practical Sampling Config#

Trace-Based Testing#

Contract Testing with Traces#

Performance Regression Detection#

Correlating Traces with Logs and Metrics#

Inject Trace ID into Logs#

Exemplars: Metrics to Traces#

Common Pitfalls#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Uber Real-Time Location System

OpenAI API Request Pipeline

Scalable SaaS Application

Build this architecture

Distributed Tracing: See Every Request Across Your Entire System

Distributed Tracing#

Why Distributed Tracing?#

Core Concepts#

Traces#

Spans#

Context Propagation#

OpenTelemetry#

Basic Instrumentation#

Auto-Instrumentation#

Tracing Backends#

Jaeger#

Zipkin#

Managed Options#

Sampling Strategies#

Head-Based Sampling#

Tail-Based Sampling#

Priority Sampling#

Practical Sampling Config#

Trace-Based Testing#

Contract Testing with Traces#

Performance Regression Detection#

Correlating Traces with Logs and Metrics#

Inject Trace ID into Logs#

Exemplars: Metrics to Traces#

Common Pitfalls#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Uber Real-Time Location System

OpenAI API Request Pipeline

Scalable SaaS Application

Build this architecture