observabilitymonitoringdistributed tracingOpenTelemetryloggingmetricsGrafanaPrometheus

Observability Architecture: Logs, Metrics & Traces at Scale

March 28, 2026 6 min readBy Codelit Team Discussion

Modern distributed systems fail in ways no single log file can explain. Observability architecture gives engineering teams the ability to ask arbitrary questions about system behavior — without deploying new code to answer them.

This guide covers the three pillars, how observability differs from monitoring, the OpenTelemetry standard, and practical patterns for logs, metrics, traces, alerting, and cost control.

Monitoring vs Observability#

Monitoring tells you when something is broken. Observability tells you why.

Aspect	Monitoring	Observability
Approach	Predefined checks and thresholds	Explore arbitrary questions
Data model	Known-unknowns	Unknown-unknowns
Tooling	Dashboards, alerts	Traces, high-cardinality queries
Failure mode	Alert fatigue	Higher storage cost

Monitoring is a subset of observability. You still need alerts — but an observable system lets you debug issues that no one anticipated when the alerts were written.

The Three Pillars#

1. Logs — What Happened#

Structured logs are the foundation. Emit JSON, not plain text:

{
  "timestamp": "2026-03-28T14:32:01.003Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "message": "charge failed",
  "customer_id": "cust_9182",
  "error_code": "insufficient_funds"
}

Key practices:

Always include trace_id so logs correlate with traces.
Use severity levels consistently: debug, info, warn, error, fatal.
Ship logs to a centralized store (Elasticsearch, Loki, Datadog Logs).

2. Metrics — How the System Behaves Over Time#

Metrics are numeric time-series data: counters, gauges, histograms.

# Prometheus scrape config
scrape_configs:
  - job_name: "payment-api"
    scrape_interval: 15s
    static_configs:
      - targets: ["payment-api:9090"]
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "go_gc_.*"
        action: drop

The four golden signals to track for every service:

Latency — response time distribution (p50, p95, p99)
Traffic — requests per second
Errors — 5xx rate, application error rate
Saturation — CPU, memory, queue depth

3. Traces — The Request Journey#

Distributed tracing follows a single request across service boundaries. Each service creates a span; the collection of spans forms a trace.

[Gateway 12ms] → [Auth 4ms] → [Payment 85ms] → [Notification 22ms]

Tools like Jaeger and Zipkin visualize trace waterfalls and surface slow spans.

OpenTelemetry: The Standard#

OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It unifies logs, metrics, and traces under one SDK.

// Node.js OpenTelemetry setup
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  serviceName: "payment-api",
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://otel-collector:4318/v1/metrics",
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

The OTel Collector acts as a pipeline between your apps and backends:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

By decoupling collection from export, you can swap backends (Jaeger to Tempo, Prometheus to Datadog) without touching application code.

Alerting Strategies#

Alerts should be actionable, not noisy. Follow these principles:

Alert on symptoms, not causes. Alert on "error rate > 1%" not "pod restarted."
Use severity tiers. Page for customer-facing impact; ticket for degradation; log for informational.
Include runbook links in every alert so the on-call engineer knows what to do.

# Prometheus alerting rule
groups:
  - name: payment-api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="payment-api", status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="payment-api"}[5m]))
          > 0.01
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Payment API error rate above 1%"
          runbook: "https://wiki.internal/runbooks/payment-errors"

SLOs, SLIs, and SLAs#

Term	Definition	Example
SLI	Service Level Indicator — a measured metric	99.2% of requests < 300ms
SLO	Service Level Objective — internal target	99.5% success rate over 30 days
SLA	Service Level Agreement — contractual promise	99.9% uptime or credits issued

Use error budgets to balance reliability and velocity. If your SLO is 99.5%, you have a 0.5% error budget. When the budget is depleted, freeze feature releases and focus on reliability.

Dashboards with Grafana#

Effective dashboards follow a hierarchy:

Service overview — golden signals for all services on one screen.
Service detail — per-service latency histograms, error breakdowns, saturation.
Investigation — trace search, log drill-down, correlated views.

Avoid "dashboard sprawl." Every dashboard should answer a specific question. If no one looks at it during incidents, delete it.

Logging Architecture at Scale#

For high-throughput systems, a buffered pipeline prevents log loss:

App → Fluentd/Vector (buffer) → Kafka → Elasticsearch/Loki

Buffer locally so a downstream outage does not back-pressure application threads.
Sample verbose logs (debug/info) at high traffic — keep 100% of error and warn.
Set retention policies — 7 days hot, 30 days warm, 90 days cold in object storage.

Cost Management#

Observability costs grow with cardinality and volume. Control it:

Drop unused metrics at the collector level (see the metric_relabel_configs example above).
Use tail-based sampling for traces — keep 100% of error traces, sample 5% of success traces.
Aggregate logs before indexing — count repeated messages instead of storing each one.
Set per-team budgets in multi-tenant platforms and charge back.

A common mistake is indexing every field. Index only fields you query. Store the rest as unindexed payload.

Putting It All Together#

A mature observability stack looks like this:

Applications (OTel SDK)
    ↓
OTel Collector (process, route, sample)
    ↓
┌──────────────┬──────────────┬──────────────┐
│  Jaeger/Tempo │  Prometheus  │  Loki/ES     │
│  (traces)     │  (metrics)   │  (logs)      │
└──────────────┴──────────────┴──────────────┘
    ↓                ↓              ↓
         Grafana (unified dashboards)
              ↓
    Alertmanager → PagerDuty/Slack

Start with OpenTelemetry instrumentation, define SLOs before building dashboards, and treat your observability pipeline as infrastructure — version-controlled and reviewed like any other code.

Build observable systems faster with codelit.io — architecture guides, config templates, and system design deep-dives for engineering teams.

133 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

WhatsApp-Scale Messaging System

End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.

9 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Gmail-Scale Email Service

Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.

10 components

Build this architecture

Generate an interactive Observability Architecture in seconds.

Try it in Codelit →

observabilitymonitoringdistributed tracingOpenTelemetryloggingmetricsGrafanaPrometheus

Observability Architecture: Logs, Metrics & Traces at Scale

March 28, 2026 6 min readBy Codelit Team Discussion

This guide covers the three pillars, how observability differs from monitoring, the OpenTelemetry standard, and practical patterns for logs, metrics, traces, alerting, and cost control.

Monitoring vs Observability#

Monitoring tells you when something is broken. Observability tells you why.

Aspect	Monitoring	Observability
Approach	Predefined checks and thresholds	Explore arbitrary questions
Data model	Known-unknowns	Unknown-unknowns
Tooling	Dashboards, alerts	Traces, high-cardinality queries
Failure mode	Alert fatigue	Higher storage cost

Monitoring is a subset of observability. You still need alerts — but an observable system lets you debug issues that no one anticipated when the alerts were written.

The Three Pillars#

1. Logs — What Happened#

Structured logs are the foundation. Emit JSON, not plain text:

{
  "timestamp": "2026-03-28T14:32:01.003Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "message": "charge failed",
  "customer_id": "cust_9182",
  "error_code": "insufficient_funds"
}

Key practices:

Always include trace_id so logs correlate with traces.
Use severity levels consistently: debug, info, warn, error, fatal.
Ship logs to a centralized store (Elasticsearch, Loki, Datadog Logs).

2. Metrics — How the System Behaves Over Time#

Metrics are numeric time-series data: counters, gauges, histograms.

# Prometheus scrape config
scrape_configs:
  - job_name: "payment-api"
    scrape_interval: 15s
    static_configs:
      - targets: ["payment-api:9090"]
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "go_gc_.*"
        action: drop

The four golden signals to track for every service:

Latency — response time distribution (p50, p95, p99)
Traffic — requests per second
Errors — 5xx rate, application error rate
Saturation — CPU, memory, queue depth

3. Traces — The Request Journey#

Distributed tracing follows a single request across service boundaries. Each service creates a span; the collection of spans forms a trace.

[Gateway 12ms] → [Auth 4ms] → [Payment 85ms] → [Notification 22ms]

Tools like Jaeger and Zipkin visualize trace waterfalls and surface slow spans.

OpenTelemetry: The Standard#

OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It unifies logs, metrics, and traces under one SDK.

// Node.js OpenTelemetry setup
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  serviceName: "payment-api",
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://otel-collector:4318/v1/metrics",
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

The OTel Collector acts as a pipeline between your apps and backends:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

By decoupling collection from export, you can swap backends (Jaeger to Tempo, Prometheus to Datadog) without touching application code.

Alerting Strategies#

Alerts should be actionable, not noisy. Follow these principles:

Alert on symptoms, not causes. Alert on "error rate > 1%" not "pod restarted."
Use severity tiers. Page for customer-facing impact; ticket for degradation; log for informational.
Include runbook links in every alert so the on-call engineer knows what to do.

# Prometheus alerting rule
groups:
  - name: payment-api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="payment-api", status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="payment-api"}[5m]))
          > 0.01
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Payment API error rate above 1%"
          runbook: "https://wiki.internal/runbooks/payment-errors"

SLOs, SLIs, and SLAs#

Term	Definition	Example
SLI	Service Level Indicator — a measured metric	99.2% of requests < 300ms
SLO	Service Level Objective — internal target	99.5% success rate over 30 days
SLA	Service Level Agreement — contractual promise	99.9% uptime or credits issued

Use error budgets to balance reliability and velocity. If your SLO is 99.5%, you have a 0.5% error budget. When the budget is depleted, freeze feature releases and focus on reliability.

Dashboards with Grafana#

Effective dashboards follow a hierarchy:

Service overview — golden signals for all services on one screen.
Service detail — per-service latency histograms, error breakdowns, saturation.
Investigation — trace search, log drill-down, correlated views.

Avoid "dashboard sprawl." Every dashboard should answer a specific question. If no one looks at it during incidents, delete it.

Logging Architecture at Scale#

For high-throughput systems, a buffered pipeline prevents log loss:

App → Fluentd/Vector (buffer) → Kafka → Elasticsearch/Loki

Buffer locally so a downstream outage does not back-pressure application threads.
Sample verbose logs (debug/info) at high traffic — keep 100% of error and warn.
Set retention policies — 7 days hot, 30 days warm, 90 days cold in object storage.

Cost Management#

Observability costs grow with cardinality and volume. Control it:

Drop unused metrics at the collector level (see the metric_relabel_configs example above).
Use tail-based sampling for traces — keep 100% of error traces, sample 5% of success traces.
Aggregate logs before indexing — count repeated messages instead of storing each one.
Set per-team budgets in multi-tenant platforms and charge back.

A common mistake is indexing every field. Index only fields you query. Store the rest as unindexed payload.

Putting It All Together#

A mature observability stack looks like this:

Applications (OTel SDK)
    ↓
OTel Collector (process, route, sample)
    ↓
┌──────────────┬──────────────┬──────────────┐
│  Jaeger/Tempo │  Prometheus  │  Loki/ES     │
│  (traces)     │  (metrics)   │  (logs)      │
└──────────────┴──────────────┴──────────────┘
    ↓                ↓              ↓
         Grafana (unified dashboards)
              ↓
    Alertmanager → PagerDuty/Slack

Start with OpenTelemetry instrumentation, define SLOs before building dashboards, and treat your observability pipeline as infrastructure — version-controlled and reviewed like any other code.

Build observable systems faster with codelit.io — architecture guides, config templates, and system design deep-dives for engineering teams.

133 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Build this architecture

Generate an interactive Observability Architecture in seconds.

Try it in Codelit →

Observability Architecture: Logs, Metrics & Traces at Scale

Monitoring vs Observability#

The Three Pillars#

1. Logs — What Happened#

2. Metrics — How the System Behaves Over Time#

3. Traces — The Request Journey#

OpenTelemetry: The Standard#

Alerting Strategies#

SLOs, SLIs, and SLAs#

Dashboards with Grafana#

Logging Architecture at Scale#

Cost Management#

Putting It All Together#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

WhatsApp-Scale Messaging System

Logging & Observability Platform

Gmail-Scale Email Service

Build this architecture

Observability Architecture: Logs, Metrics & Traces at Scale

Monitoring vs Observability#

The Three Pillars#

1. Logs — What Happened#

2. Metrics — How the System Behaves Over Time#

3. Traces — The Request Journey#

OpenTelemetry: The Standard#

Alerting Strategies#

SLOs, SLIs, and SLAs#

Dashboards with Grafana#

Logging Architecture at Scale#

Cost Management#

Putting It All Together#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

WhatsApp-Scale Messaging System

Logging & Observability Platform

Gmail-Scale Email Service

Build this architecture