Observability Architecture: Logs, Metrics & Traces at Scale
Modern distributed systems fail in ways no single log file can explain. Observability architecture gives engineering teams the ability to ask arbitrary questions about system behavior — without deploying new code to answer them.
This guide covers the three pillars, how observability differs from monitoring, the OpenTelemetry standard, and practical patterns for logs, metrics, traces, alerting, and cost control.
Monitoring vs Observability#
Monitoring tells you when something is broken. Observability tells you why.
| Aspect | Monitoring | Observability |
|---|---|---|
| Approach | Predefined checks and thresholds | Explore arbitrary questions |
| Data model | Known-unknowns | Unknown-unknowns |
| Tooling | Dashboards, alerts | Traces, high-cardinality queries |
| Failure mode | Alert fatigue | Higher storage cost |
Monitoring is a subset of observability. You still need alerts — but an observable system lets you debug issues that no one anticipated when the alerts were written.
The Three Pillars#
1. Logs — What Happened#
Structured logs are the foundation. Emit JSON, not plain text:
{
"timestamp": "2026-03-28T14:32:01.003Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123def456",
"message": "charge failed",
"customer_id": "cust_9182",
"error_code": "insufficient_funds"
}
Key practices:
- Always include
trace_idso logs correlate with traces. - Use severity levels consistently:
debug,info,warn,error,fatal. - Ship logs to a centralized store (Elasticsearch, Loki, Datadog Logs).
2. Metrics — How the System Behaves Over Time#
Metrics are numeric time-series data: counters, gauges, histograms.
# Prometheus scrape config
scrape_configs:
- job_name: "payment-api"
scrape_interval: 15s
static_configs:
- targets: ["payment-api:9090"]
metric_relabel_configs:
- source_labels: [__name__]
regex: "go_gc_.*"
action: drop
The four golden signals to track for every service:
- Latency — response time distribution (p50, p95, p99)
- Traffic — requests per second
- Errors — 5xx rate, application error rate
- Saturation — CPU, memory, queue depth
3. Traces — The Request Journey#
Distributed tracing follows a single request across service boundaries. Each service creates a span; the collection of spans forms a trace.
[Gateway 12ms] → [Auth 4ms] → [Payment 85ms] → [Notification 22ms]
Tools like Jaeger and Zipkin visualize trace waterfalls and surface slow spans.
OpenTelemetry: The Standard#
OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It unifies logs, metrics, and traces under one SDK.
// Node.js OpenTelemetry setup
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
serviceName: "payment-api",
traceExporter: new OTLPTraceExporter({
url: "http://otel-collector:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: "http://otel-collector:4318/v1/metrics",
}),
exportIntervalMillis: 30000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
The OTel Collector acts as a pipeline between your apps and backends:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
By decoupling collection from export, you can swap backends (Jaeger to Tempo, Prometheus to Datadog) without touching application code.
Alerting Strategies#
Alerts should be actionable, not noisy. Follow these principles:
- Alert on symptoms, not causes. Alert on "error rate > 1%" not "pod restarted."
- Use severity tiers. Page for customer-facing impact; ticket for degradation; log for informational.
- Include runbook links in every alert so the on-call engineer knows what to do.
# Prometheus alerting rule
groups:
- name: payment-api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{service="payment-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payment-api"}[5m]))
> 0.01
for: 3m
labels:
severity: critical
annotations:
summary: "Payment API error rate above 1%"
runbook: "https://wiki.internal/runbooks/payment-errors"
SLOs, SLIs, and SLAs#
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator — a measured metric | 99.2% of requests < 300ms |
| SLO | Service Level Objective — internal target | 99.5% success rate over 30 days |
| SLA | Service Level Agreement — contractual promise | 99.9% uptime or credits issued |
Use error budgets to balance reliability and velocity. If your SLO is 99.5%, you have a 0.5% error budget. When the budget is depleted, freeze feature releases and focus on reliability.
Dashboards with Grafana#
Effective dashboards follow a hierarchy:
- Service overview — golden signals for all services on one screen.
- Service detail — per-service latency histograms, error breakdowns, saturation.
- Investigation — trace search, log drill-down, correlated views.
Avoid "dashboard sprawl." Every dashboard should answer a specific question. If no one looks at it during incidents, delete it.
Logging Architecture at Scale#
For high-throughput systems, a buffered pipeline prevents log loss:
App → Fluentd/Vector (buffer) → Kafka → Elasticsearch/Loki
- Buffer locally so a downstream outage does not back-pressure application threads.
- Sample verbose logs (debug/info) at high traffic — keep 100% of error and warn.
- Set retention policies — 7 days hot, 30 days warm, 90 days cold in object storage.
Cost Management#
Observability costs grow with cardinality and volume. Control it:
- Drop unused metrics at the collector level (see the
metric_relabel_configsexample above). - Use tail-based sampling for traces — keep 100% of error traces, sample 5% of success traces.
- Aggregate logs before indexing — count repeated messages instead of storing each one.
- Set per-team budgets in multi-tenant platforms and charge back.
A common mistake is indexing every field. Index only fields you query. Store the rest as unindexed payload.
Putting It All Together#
A mature observability stack looks like this:
Applications (OTel SDK)
↓
OTel Collector (process, route, sample)
↓
┌──────────────┬──────────────┬──────────────┐
│ Jaeger/Tempo │ Prometheus │ Loki/ES │
│ (traces) │ (metrics) │ (logs) │
└──────────────┴──────────────┴──────────────┘
↓ ↓ ↓
Grafana (unified dashboards)
↓
Alertmanager → PagerDuty/Slack
Start with OpenTelemetry instrumentation, define SLOs before building dashboards, and treat your observability pipeline as infrastructure — version-controlled and reviewed like any other code.
Build observable systems faster with codelit.io — architecture guides, config templates, and system design deep-dives for engineering teams.
133 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Trace Context Propagation — W3C Headers, B3, Baggage, and Cross-Service Correlation
6 min read
distributed tracingDistributed Tracing Sampling Strategies: Head-Based, Tail-Based & Beyond
7 min read
health check patternsHealth Check Patterns: Liveness, Readiness, and Monitoring Strategies
6 min read
Try these templates
WhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsGmail-Scale Email Service
Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.
10 componentsBuild this architecture
Generate an interactive Observability Architecture in seconds.
Try it in Codelit →
Comments