Distributed Tracing: See Every Request Across Your Entire System
Distributed Tracing#
A user clicks "Place Order." The request hits your API gateway, routes to the order service, calls inventory, payment, notification, and shipping services. The response takes 4.2 seconds. Which service is slow?
Without distributed tracing, you're guessing. With it, you see the entire journey of every request across every service.
Why Distributed Tracing?#
Logs tell you what happened inside one service. Metrics tell you aggregate behavior. Traces tell you what happened across services for a single request.
Logs: "Payment processed in 200ms" (one service, no context)
Metrics: "p99 latency is 3.2s" (aggregate, no specifics)
Traces: "Request abc-123 spent 2.8s waiting for inventory lock" (end-to-end, specific)
In a monolith, a stack trace gives you the full picture. In microservices, distributed tracing is your stack trace.
Core Concepts#
Traces#
A trace represents the entire lifecycle of a request through your system. It has a unique trace ID that follows the request across every service boundary.
Trace ID: abc-123-def-456
Duration: 4.2s
Services: api-gateway → order → inventory → payment → notification
Spans#
A trace is made of spans — individual units of work. Each span has a parent, forming a tree.
[api-gateway: 4.2s]
├── [order-service: 3.8s]
│ ├── [inventory-check: 2.8s] ← bottleneck
│ ├── [payment-process: 0.6s]
│ └── [notification-send: 0.2s]
└── [response-serialize: 0.1s]
Each span contains:
- Operation name: what work was done
- Start/end timestamps: duration
- Tags/attributes: metadata like
http.status_code=200 - Events/logs: timestamped annotations within the span
- Parent span ID: the causal relationship
Context Propagation#
The trace ID must travel with the request across service boundaries. This is context propagation.
Service A → HTTP Header → Service B → gRPC Metadata → Service C
traceparent: traceparent:
00-abc123-span1-01 00-abc123-span2-01
The W3C Trace Context standard defines the traceparent header:
traceparent: 00-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-abc123def456-span789-01
Without context propagation, you get disconnected spans — useless fragments instead of a complete picture.
OpenTelemetry#
OpenTelemetry (OTel) is the industry standard for instrumentation. It provides APIs, SDKs, and the OTLP protocol for traces, metrics, and logs.
Basic Instrumentation#
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
# Create spans
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.total", 99.99)
with tracer.start_as_current_span("check_inventory"):
inventory_result = check_inventory(order.items)
with tracer.start_as_current_span("process_payment"):
payment_result = charge_card(order.payment)
Auto-Instrumentation#
OTel provides automatic instrumentation for common libraries — HTTP clients, database drivers, message queues — so most spans appear without code changes.
# Python: instrument Flask + requests + psycopg2 automatically
opentelemetry-instrument --service_name order-service python app.py
# Node.js: auto-instrument Express + pg + ioredis
node --require @opentelemetry/auto-instrumentations-node app.js
Tracing Backends#
Jaeger#
Open-source, built by Uber. Strong at trace visualization, dependency graphs, and performance analysis.
- Best for: teams wanting full control, self-hosted deployments
- Storage: Cassandra, Elasticsearch, or memory
- UI: built-in web interface with trace comparison
Zipkin#
Open-source, originally from Twitter. Simpler than Jaeger, well-established ecosystem.
- Best for: simpler setups, broad language support
- Storage: Cassandra, Elasticsearch, MySQL, in-memory
- UI: clean trace timeline view
Managed Options#
- Grafana Tempo: integrates with Grafana dashboards, cost-effective object storage
- AWS X-Ray: native AWS integration
- Datadog APM / Honeycomb / Lightstep: SaaS with advanced analysis
All modern backends accept OTLP, so switching is straightforward.
Sampling Strategies#
Tracing every request is expensive at scale. Sampling controls what gets recorded.
Head-Based Sampling#
Decision made at the start of the trace.
Sample 10% of all requests randomly
→ Simple, predictable cost
→ Misses rare errors (only 10% chance of capturing them)
Tail-Based Sampling#
Decision made at the end of the trace, after seeing all spans.
Keep traces that:
- Have errors (status >= 500)
- Exceed latency threshold (> 2s)
- Hit specific services
- Random 5% of the rest
→ Captures all interesting traces
→ Requires a collection tier that buffers traces before deciding
Priority Sampling#
Tag certain requests as must-sample:
with tracer.start_as_current_span("critical_operation") as span:
span.set_attribute("sampling.priority", 1) # always keep
Practical Sampling Config#
# OpenTelemetry Collector config
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 2000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
Trace-Based Testing#
Traces are not just for debugging — they validate system behavior.
Contract Testing with Traces#
Assert that expected spans exist with correct attributes:
def test_order_creates_payment_span():
response = client.post("/orders", json=order_data)
trace = get_trace(response.headers["traceparent"])
payment_span = find_span(trace, "process_payment")
assert payment_span is not None
assert payment_span.attributes["payment.method"] == "card"
assert payment_span.status == "OK"
Performance Regression Detection#
Compare span durations across deployments:
v2.3.0: inventory-check p50=120ms p99=800ms
v2.4.0: inventory-check p50=350ms p99=2400ms ← regression
Correlating Traces with Logs and Metrics#
The real power comes from connecting all three pillars of observability.
Inject Trace ID into Logs#
import logging
class TraceContextFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, '032x')
record.span_id = format(ctx.span_id, '016x')
return True
Now every log line carries the trace ID:
2026-03-28 14:22:01 [trace_id=abc123 span_id=span789] Payment processed for order #42
Click the trace ID in Grafana, Datadog, or your logging tool to jump directly to the full trace.
Exemplars: Metrics to Traces#
Exemplars attach trace IDs to metric data points. When you see a latency spike in a histogram, click through to the exact trace that caused it.
http_request_duration{service="order"} 4.2s exemplar={trace_id="abc123"}
Common Pitfalls#
- Missing context propagation — one service that doesn't forward headers breaks the entire trace.
- Over-sampling in production — 100% sampling at scale is expensive. Start with tail-based sampling.
- Ignoring async flows — message queues need manual context injection into message headers.
- Too many custom spans — instrument boundaries and slow operations, not every function call.
- No trace-to-log correlation — traces without log context lose half their debugging value.
Key Takeaways#
- Distributed tracing shows the end-to-end journey of a request across services.
- OpenTelemetry is the standard — instrument once, export anywhere.
- Context propagation is the foundation — broken propagation means broken traces.
- Tail-based sampling captures errors and slow traces while controlling cost.
- Correlate traces with logs and metrics for the full observability picture.
Start with auto-instrumentation, add custom spans at service boundaries, and connect traces to your logging pipeline. Within a week, you'll wonder how you ever debugged without it.
Build tools that help teams visualize system architecture and trace flows at codelit.io.
Article #154 in the Codelit engineering series.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsOpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsScalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsBuild this architecture
Generate an interactive architecture for Distributed Tracing in seconds.
Try it in Codelit →
Comments