observabilitymicroservicesmonitoringDevOpssystem design

Microservices Observability Stack: Traces, Metrics, and Logs at Scale

March 29, 2026 6 min readBy Codelit Team Discussion

Microservices Observability Stack#

A monolith fails in one place. Microservices fail in the spaces between services — network calls, message queues, shared databases. Without observability, debugging a 15-service request chain is like finding a needle in 15 haystacks simultaneously.

The Three Pillars Per Service#

Every service must emit three types of telemetry:

Logs    — discrete events with context ("user 123 failed auth at 14:32:01")
Metrics — numeric measurements over time (request_duration_seconds, error_count)
Traces  — end-to-end request journeys across service boundaries

Why All Three Matter#

Scenario: Checkout latency spikes from 200ms to 3 seconds

Metrics tell you: "checkout-service p99 latency spiked at 14:30"
Traces tell you:  "the slow requests spend 2.8s waiting on inventory-service"
Logs tell you:    "inventory-service is retrying database queries due to lock contention"

No single pillar gives the full picture. Metrics detect the problem, traces localize it, logs explain it.

Distributed Tracing Correlation#

The cornerstone of microservices observability is trace context propagation:

User Request
  │
  ▼
[API Gateway]  trace_id: abc123, span_id: span-1
  │
  ├──► [Auth Service]       trace_id: abc123, span_id: span-2, parent: span-1
  │
  ├──► [Order Service]      trace_id: abc123, span_id: span-3, parent: span-1
  │       │
  │       ├──► [Inventory]  trace_id: abc123, span_id: span-4, parent: span-3
  │       │
  │       └──► [Payment]    trace_id: abc123, span_id: span-5, parent: span-3
  │
  └──► [Notification]       trace_id: abc123, span_id: span-6, parent: span-1

OpenTelemetry Instrumentation#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize once at service startup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")

# Automatic context propagation in handlers
@app.route("/orders", methods=["POST"])
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", request.json["customer_id"])

        # Context automatically propagated to downstream calls
        inventory = requests.post(
            "http://inventory-service/reserve",
            json={"items": request.json["items"]}
        )

        if inventory.status_code != 200:
            span.set_status(StatusCode.ERROR, "Inventory reservation failed")
            span.record_exception(InventoryError(inventory.text))

        return jsonify({"order_id": order.id})

Correlating Logs with Traces#

Inject trace context into every log line:

import logging
from opentelemetry import trace

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ""
        record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ""
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceContextFilter())

# Log format includes trace context
formatter = logging.Formatter(
    '%(asctime)s %(levelname)s [trace=%(trace_id)s span=%(span_id)s] %(message)s'
)

Now you can jump from a log line directly to the full distributed trace.

Service Maps#

Auto-generated topology maps show how services communicate:

                    ┌─────────────┐
                    │ API Gateway │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │   Auth   │ │  Orders  │ │  Search  │
        │  12ms p50│ │  45ms p50│ │  30ms p50│
        │  0.1% err│ │  0.5% err│ │  0.2% err│
        └──────────┘ └────┬─────┘ └──────────┘
                     ┌─────┴──────┐
                     ▼            ▼
               ┌──────────┐ ┌──────────┐
               │Inventory │ │ Payment  │
               │  80ms p50│ │ 120ms p50│
               │  0.3% err│ │  0.8% err│
               └──────────┘ └──────────┘

Service maps are generated from trace data — no manual configuration. They reveal:

Dependencies you did not know existed
Latency bottlenecks at each hop
Error propagation paths across services
Traffic patterns and request volumes

Golden Signals#

Google SRE's four golden signals, applied per service:

1. Latency#

# p50, p95, p99 latency per service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
  by (le)
)

2. Traffic#

# Requests per second by service and endpoint
sum(rate(http_requests_total{service="order-service"}[5m]))
  by (method, path)

3. Errors#

# Error rate as percentage
sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-service"}[5m]))
* 100

4. Saturation#

# CPU and memory pressure
container_cpu_usage_seconds_total{service="order-service"}
  / container_spec_cpu_quota{service="order-service"}
* 100

SLO Dashboards#

Service Level Objectives turn metrics into business commitments:

# SLO definitions
slos:
  - name: "Order API Availability"
    target: 99.95
    indicator:
      type: availability
      query: |
        sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
        / sum(rate(http_requests_total{service="order-service"}[30d]))

  - name: "Order API Latency"
    target: 99.0
    indicator:
      type: latency
      threshold: 500ms
      query: |
        sum(rate(http_request_duration_seconds_bucket{
          service="order-service", le="0.5"
        }[30d]))
        / sum(rate(http_request_duration_seconds_count{
          service="order-service"
        }[30d]))

Error Budget Tracking#

Monthly Error Budget for 99.95% availability:
  Total minutes:     43,200 (30 days)
  Budget:            21.6 minutes of downtime
  Consumed:          8.3 minutes (38.4%)
  Remaining:         13.3 minutes (61.6%)
  Burn rate:         1.2x (on track)

When the error budget is exhausted:

Freeze feature releases until reliability improves
Prioritize reliability work (retries, circuit breakers, capacity)
Conduct incident reviews for every budget-burning event

Incident Response with Traces#

When an alert fires, traces accelerate root cause analysis:

Alert: "order-service p99 latency > 2s for 5 minutes"

Step 1: Open service map → see order-service depends on inventory + payment
Step 2: Check golden signals → inventory-service error rate spiked to 15%
Step 3: Find slow traces → filter traces where order-service span > 2s
Step 4: Drill into trace → inventory-service span shows 1.8s database query
Step 5: Check inventory logs (filtered by trace_id) → "lock wait timeout"
Step 6: Root cause → database migration running on inventory table

Time to root cause: 4 minutes (vs 30+ minutes without traces)

Alert Routing Configuration#

# Alertmanager rules tied to golden signals
groups:
  - name: golden-signals
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} error rate above 5%"
          runbook: "https://runbooks.internal/high-error-rate"
          dashboard: "https://grafana.internal/d/golden-signals"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 5m
        labels:
          severity: warning

The Grafana LGTM Stack#

A complete open-source observability platform:

L — Loki       (logs)     — like Prometheus, but for log aggregation
G — Grafana    (dashboards) — unified visualization for all telemetry
T — Tempo      (traces)   — distributed tracing backend
M — Mimir      (metrics)  — horizontally scalable Prometheus

Architecture#

Services → OpenTelemetry Collector → ┬─► Mimir  (metrics)
                                     ├─► Loki   (logs)
                                     └─► Tempo  (traces)
                                              │
                                         Grafana ◄─── Engineers

Deployment with Docker Compose#

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./provisioning:/etc/grafana/provisioning

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

  tempo:
    image: grafana/tempo
    ports:
      - "3200:3200"

  mimir:
    image: grafana/mimir
    ports:
      - "9009:9009"

The power of the LGTM stack is correlation: click a metric spike in Grafana, jump to traces from that time window, then drill into logs for the specific trace. All in one UI.

Key Takeaways#

Observability is not monitoring with a fancier name — it is the ability to ask new questions about your system without deploying new code:

Emit all three pillars from every service: logs, metrics, traces
Propagate trace context across every service boundary with OpenTelemetry
Build service maps automatically from trace data to visualize dependencies
Track golden signals (latency, traffic, errors, saturation) per service
Define SLOs with error budgets to balance reliability and velocity
Use the Grafana LGTM stack for a unified, open-source observability platform

The cost of observability is infrastructure. The cost of no observability is 3 AM debugging sessions with kubectl logs and guesswork.

Article #321 in the Codelit engineering series. Level up your DevOps and microservices architecture at codelit.io.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

WhatsApp-Scale Messaging System

End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.

9 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Build this architecture

Generate an interactive architecture for Microservices Observability Stack in seconds.

Try it in Codelit →

observabilitymicroservicesmonitoringDevOpssystem design

Microservices Observability Stack: Traces, Metrics, and Logs at Scale

March 29, 2026 6 min readBy Codelit Team Discussion

Microservices Observability Stack#

The Three Pillars Per Service#

Every service must emit three types of telemetry:

Logs    — discrete events with context ("user 123 failed auth at 14:32:01")
Metrics — numeric measurements over time (request_duration_seconds, error_count)
Traces  — end-to-end request journeys across service boundaries

Why All Three Matter#

Scenario: Checkout latency spikes from 200ms to 3 seconds

Metrics tell you: "checkout-service p99 latency spiked at 14:30"
Traces tell you:  "the slow requests spend 2.8s waiting on inventory-service"
Logs tell you:    "inventory-service is retrying database queries due to lock contention"

No single pillar gives the full picture. Metrics detect the problem, traces localize it, logs explain it.

Distributed Tracing Correlation#

The cornerstone of microservices observability is trace context propagation:

User Request
  │
  ▼
[API Gateway]  trace_id: abc123, span_id: span-1
  │
  ├──► [Auth Service]       trace_id: abc123, span_id: span-2, parent: span-1
  │
  ├──► [Order Service]      trace_id: abc123, span_id: span-3, parent: span-1
  │       │
  │       ├──► [Inventory]  trace_id: abc123, span_id: span-4, parent: span-3
  │       │
  │       └──► [Payment]    trace_id: abc123, span_id: span-5, parent: span-3
  │
  └──► [Notification]       trace_id: abc123, span_id: span-6, parent: span-1

OpenTelemetry Instrumentation#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize once at service startup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")

# Automatic context propagation in handlers
@app.route("/orders", methods=["POST"])
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", request.json["customer_id"])

        # Context automatically propagated to downstream calls
        inventory = requests.post(
            "http://inventory-service/reserve",
            json={"items": request.json["items"]}
        )

        if inventory.status_code != 200:
            span.set_status(StatusCode.ERROR, "Inventory reservation failed")
            span.record_exception(InventoryError(inventory.text))

        return jsonify({"order_id": order.id})

Correlating Logs with Traces#

Inject trace context into every log line:

import logging
from opentelemetry import trace

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ""
        record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ""
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceContextFilter())

# Log format includes trace context
formatter = logging.Formatter(
    '%(asctime)s %(levelname)s [trace=%(trace_id)s span=%(span_id)s] %(message)s'
)

Now you can jump from a log line directly to the full distributed trace.

Service Maps#

Auto-generated topology maps show how services communicate:

                    ┌─────────────┐
                    │ API Gateway │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │   Auth   │ │  Orders  │ │  Search  │
        │  12ms p50│ │  45ms p50│ │  30ms p50│
        │  0.1% err│ │  0.5% err│ │  0.2% err│
        └──────────┘ └────┬─────┘ └──────────┘
                     ┌─────┴──────┐
                     ▼            ▼
               ┌──────────┐ ┌──────────┐
               │Inventory │ │ Payment  │
               │  80ms p50│ │ 120ms p50│
               │  0.3% err│ │  0.8% err│
               └──────────┘ └──────────┘

Service maps are generated from trace data — no manual configuration. They reveal:

Dependencies you did not know existed
Latency bottlenecks at each hop
Error propagation paths across services
Traffic patterns and request volumes

Golden Signals#

Google SRE's four golden signals, applied per service:

1. Latency#

# p50, p95, p99 latency per service
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
  by (le)
)

2. Traffic#

# Requests per second by service and endpoint
sum(rate(http_requests_total{service="order-service"}[5m]))
  by (method, path)

3. Errors#

# Error rate as percentage
sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-service"}[5m]))
* 100

4. Saturation#

# CPU and memory pressure
container_cpu_usage_seconds_total{service="order-service"}
  / container_spec_cpu_quota{service="order-service"}
* 100

SLO Dashboards#

Service Level Objectives turn metrics into business commitments:

# SLO definitions
slos:
  - name: "Order API Availability"
    target: 99.95
    indicator:
      type: availability
      query: |
        sum(rate(http_requests_total{service="order-service",status!~"5.."}[30d]))
        / sum(rate(http_requests_total{service="order-service"}[30d]))

  - name: "Order API Latency"
    target: 99.0
    indicator:
      type: latency
      threshold: 500ms
      query: |
        sum(rate(http_request_duration_seconds_bucket{
          service="order-service", le="0.5"
        }[30d]))
        / sum(rate(http_request_duration_seconds_count{
          service="order-service"
        }[30d]))

Error Budget Tracking#

Monthly Error Budget for 99.95% availability:
  Total minutes:     43,200 (30 days)
  Budget:            21.6 minutes of downtime
  Consumed:          8.3 minutes (38.4%)
  Remaining:         13.3 minutes (61.6%)
  Burn rate:         1.2x (on track)

When the error budget is exhausted:

Freeze feature releases until reliability improves
Prioritize reliability work (retries, circuit breakers, capacity)
Conduct incident reviews for every budget-burning event

Incident Response with Traces#

When an alert fires, traces accelerate root cause analysis:

Alert: "order-service p99 latency > 2s for 5 minutes"

Step 1: Open service map → see order-service depends on inventory + payment
Step 2: Check golden signals → inventory-service error rate spiked to 15%
Step 3: Find slow traces → filter traces where order-service span > 2s
Step 4: Drill into trace → inventory-service span shows 1.8s database query
Step 5: Check inventory logs (filtered by trace_id) → "lock wait timeout"
Step 6: Root cause → database migration running on inventory table

Time to root cause: 4 minutes (vs 30+ minutes without traces)

Alert Routing Configuration#

# Alertmanager rules tied to golden signals
groups:
  - name: golden-signals
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }} error rate above 5%"
          runbook: "https://runbooks.internal/high-error-rate"
          dashboard: "https://grafana.internal/d/golden-signals"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 5m
        labels:
          severity: warning

The Grafana LGTM Stack#

A complete open-source observability platform:

L — Loki       (logs)     — like Prometheus, but for log aggregation
G — Grafana    (dashboards) — unified visualization for all telemetry
T — Tempo      (traces)   — distributed tracing backend
M — Mimir      (metrics)  — horizontally scalable Prometheus

Architecture#

Services → OpenTelemetry Collector → ┬─► Mimir  (metrics)
                                     ├─► Loki   (logs)
                                     └─► Tempo  (traces)
                                              │
                                         Grafana ◄─── Engineers

Deployment with Docker Compose#

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./provisioning:/etc/grafana/provisioning

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

  tempo:
    image: grafana/tempo
    ports:
      - "3200:3200"

  mimir:
    image: grafana/mimir
    ports:
      - "9009:9009"

The power of the LGTM stack is correlation: click a metric spike in Grafana, jump to traces from that time window, then drill into logs for the specific trace. All in one UI.

Key Takeaways#

Observability is not monitoring with a fancier name — it is the ability to ask new questions about your system without deploying new code:

Emit all three pillars from every service: logs, metrics, traces
Propagate trace context across every service boundary with OpenTelemetry
Build service maps automatically from trace data to visualize dependencies
Track golden signals (latency, traffic, errors, saturation) per service
Define SLOs with error budgets to balance reliability and velocity
Use the Grafana LGTM stack for a unified, open-source observability platform

The cost of observability is infrastructure. The cost of no observability is 3 AM debugging sessions with kubectl logs and guesswork.

Article #321 in the Codelit engineering series. Level up your DevOps and microservices architecture at codelit.io.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Microservices Observability Stack in seconds.

Try it in Codelit →

Microservices Observability Stack: Traces, Metrics, and Logs at Scale

Microservices Observability Stack#

The Three Pillars Per Service#

Why All Three Matter#

Distributed Tracing Correlation#

OpenTelemetry Instrumentation#

Correlating Logs with Traces#

Service Maps#

Golden Signals#

1. Latency#

2. Traffic#

3. Errors#

4. Saturation#

SLO Dashboards#

Error Budget Tracking#

Incident Response with Traces#

Alert Routing Configuration#

The Grafana LGTM Stack#

Architecture#

Deployment with Docker Compose#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

Try these templates

Scalable SaaS Application

WhatsApp-Scale Messaging System

Logging & Observability Platform

Build this architecture

Microservices Observability Stack: Traces, Metrics, and Logs at Scale

Microservices Observability Stack#

The Three Pillars Per Service#

Why All Three Matter#

Distributed Tracing Correlation#

OpenTelemetry Instrumentation#

Correlating Logs with Traces#

Service Maps#

Golden Signals#

1. Latency#

2. Traffic#

3. Errors#

4. Saturation#

SLO Dashboards#

Error Budget Tracking#

Incident Response with Traces#

Alert Routing Configuration#

The Grafana LGTM Stack#

Architecture#

Deployment with Docker Compose#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

Try these templates

Scalable SaaS Application

WhatsApp-Scale Messaging System

Logging & Observability Platform

Build this architecture