Microservices Observability Stack: Traces, Metrics, and Logs at Scale
Microservices Observability Stack#
A monolith fails in one place. Microservices fail in the spaces between services — network calls, message queues, shared databases. Without observability, debugging a 15-service request chain is like finding a needle in 15 haystacks simultaneously.
The Three Pillars Per Service#
Every service must emit three types of telemetry:
Logs — discrete events with context ("user 123 failed auth at 14:32:01")
Metrics — numeric measurements over time (request_duration_seconds, error_count)
Traces — end-to-end request journeys across service boundaries
Why All Three Matter#
Scenario: Checkout latency spikes from 200ms to 3 seconds
Metrics tell you: "checkout-service p99 latency spiked at 14:30"
Traces tell you: "the slow requests spend 2.8s waiting on inventory-service"
Logs tell you: "inventory-service is retrying database queries due to lock contention"
No single pillar gives the full picture. Metrics detect the problem, traces localize it, logs explain it.
Distributed Tracing Correlation#
The cornerstone of microservices observability is trace context propagation:
User Request
│
▼
[API Gateway] trace_id: abc123, span_id: span-1
│
├──► [Auth Service] trace_id: abc123, span_id: span-2, parent: span-1
│
├──► [Order Service] trace_id: abc123, span_id: span-3, parent: span-1
│ │
│ ├──► [Inventory] trace_id: abc123, span_id: span-4, parent: span-3
│ │
│ └──► [Payment] trace_id: abc123, span_id: span-5, parent: span-3
│
└──► [Notification] trace_id: abc123, span_id: span-6, parent: span-1
OpenTelemetry Instrumentation#
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Initialize once at service startup
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
# Automatic context propagation in handlers
@app.route("/orders", methods=["POST"])
def create_order():
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("order.customer_id", request.json["customer_id"])
# Context automatically propagated to downstream calls
inventory = requests.post(
"http://inventory-service/reserve",
json={"items": request.json["items"]}
)
if inventory.status_code != 200:
span.set_status(StatusCode.ERROR, "Inventory reservation failed")
span.record_exception(InventoryError(inventory.text))
return jsonify({"order_id": order.id})
Correlating Logs with Traces#
Inject trace context into every log line:
import logging
from opentelemetry import trace
class TraceContextFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ""
record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ""
return True
logger = logging.getLogger(__name__)
logger.addFilter(TraceContextFilter())
# Log format includes trace context
formatter = logging.Formatter(
'%(asctime)s %(levelname)s [trace=%(trace_id)s span=%(span_id)s] %(message)s'
)
Now you can jump from a log line directly to the full distributed trace.
Service Maps#
Auto-generated topology maps show how services communicate:
┌─────────────┐
│ API Gateway │
└──────┬──────┘
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Auth │ │ Orders │ │ Search │
│ 12ms p50│ │ 45ms p50│ │ 30ms p50│
│ 0.1% err│ │ 0.5% err│ │ 0.2% err│
└──────────┘ └────┬─────┘ └──────────┘
┌─────┴──────┐
▼ ▼
┌──────────┐ ┌──────────┐
│Inventory │ │ Payment │
│ 80ms p50│ │ 120ms p50│
│ 0.3% err│ │ 0.8% err│
└──────────┘ └──────────┘
Service maps are generated from trace data — no manual configuration. They reveal:
- Dependencies you did not know existed
- Latency bottlenecks at each hop
- Error propagation paths across services
- Traffic patterns and request volumes
Golden Signals#
Google SRE's four golden signals, applied per service:
1. Latency#
# p50, p95, p99 latency per service
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
by (le)
)
2. Traffic#
# Requests per second by service and endpoint
sum(rate(http_requests_total{service="order-service"}[5m]))
by (method, path)
3. Errors#
# Error rate as percentage
sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-service"}[5m]))
* 100
4. Saturation#
# CPU and memory pressure
container_cpu_usage_seconds_total{service="order-service"}
/ container_spec_cpu_quota{service="order-service"}
* 100
SLO Dashboards#
Service Level Objectives turn metrics into business commitments:
# SLO definitions
slos:
- name: "Order API Availability"
target: 99.95
indicator:
type: availability
query: |
sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
/ sum(rate(http_requests_total{service="order-service"}[30d]))
- name: "Order API Latency"
target: 99.0
indicator:
type: latency
threshold: 500ms
query: |
sum(rate(http_request_duration_seconds_bucket{
service="order-service", le="0.5"
}[30d]))
/ sum(rate(http_request_duration_seconds_count{
service="order-service"
}[30d]))
Error Budget Tracking#
Monthly Error Budget for 99.95% availability:
Total minutes: 43,200 (30 days)
Budget: 21.6 minutes of downtime
Consumed: 8.3 minutes (38.4%)
Remaining: 13.3 minutes (61.6%)
Burn rate: 1.2x (on track)
When the error budget is exhausted:
- Freeze feature releases until reliability improves
- Prioritize reliability work (retries, circuit breakers, capacity)
- Conduct incident reviews for every budget-burning event
Incident Response with Traces#
When an alert fires, traces accelerate root cause analysis:
Alert: "order-service p99 latency > 2s for 5 minutes"
Step 1: Open service map → see order-service depends on inventory + payment
Step 2: Check golden signals → inventory-service error rate spiked to 15%
Step 3: Find slow traces → filter traces where order-service span > 2s
Step 4: Drill into trace → inventory-service span shows 1.8s database query
Step 5: Check inventory logs (filtered by trace_id) → "lock wait timeout"
Step 6: Root cause → database migration running on inventory table
Time to root cause: 4 minutes (vs 30+ minutes without traces)
Alert Routing Configuration#
# Alertmanager rules tied to golden signals
groups:
- name: golden-signals
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} error rate above 5%"
runbook: "https://runbooks.internal/high-error-rate"
dashboard: "https://grafana.internal/d/golden-signals"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 5m
labels:
severity: warning
The Grafana LGTM Stack#
A complete open-source observability platform:
L — Loki (logs) — like Prometheus, but for log aggregation
G — Grafana (dashboards) — unified visualization for all telemetry
T — Tempo (traces) — distributed tracing backend
M — Mimir (metrics) — horizontally scalable Prometheus
Architecture#
Services → OpenTelemetry Collector → ┬─► Mimir (metrics)
├─► Loki (logs)
└─► Tempo (traces)
│
Grafana ◄─── Engineers
Deployment with Docker Compose#
services:
otel-collector:
image: otel/opentelemetry-collector-contrib
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
volumes:
- ./provisioning:/etc/grafana/provisioning
loki:
image: grafana/loki
ports:
- "3100:3100"
tempo:
image: grafana/tempo
ports:
- "3200:3200"
mimir:
image: grafana/mimir
ports:
- "9009:9009"
The power of the LGTM stack is correlation: click a metric spike in Grafana, jump to traces from that time window, then drill into logs for the specific trace. All in one UI.
Key Takeaways#
Observability is not monitoring with a fancier name — it is the ability to ask new questions about your system without deploying new code:
- Emit all three pillars from every service: logs, metrics, traces
- Propagate trace context across every service boundary with OpenTelemetry
- Build service maps automatically from trace data to visualize dependencies
- Track golden signals (latency, traffic, errors, saturation) per service
- Define SLOs with error budgets to balance reliability and velocity
- Use the Grafana LGTM stack for a unified, open-source observability platform
The cost of observability is infrastructure. The cost of no observability is 3 AM debugging sessions with kubectl logs and guesswork.
Article #321 in the Codelit engineering series. Level up your DevOps and microservices architecture at codelit.io.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsWhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsBuild this architecture
Generate an interactive architecture for Microservices Observability Stack in seconds.
Try it in Codelit →
Comments