Observability vs Monitoring: Metrics, Logs, and Traces Explained
Observability vs Monitoring: Metrics, Logs, and Traces#
Monitoring tells you when something is broken. Observability tells you why.
Most teams start with monitoring (alerts when CPU is high) and realize too late they need observability (understanding why requests are slow for users in Europe on Tuesdays). This guide covers the three pillars and how to implement them.
Monitoring vs Observability#
| Monitoring | Observability | |
|---|---|---|
| Question | "Is it broken?" | "Why is it broken?" |
| Approach | Predefined dashboards and alerts | Explore any question about system behavior |
| Data | Known metrics (CPU, memory, error rate) | High-cardinality data (per-request, per-user) |
| When useful | Known failure modes | Unknown unknowns |
You need both. Monitoring catches known issues fast. Observability helps debug novel problems.
The Three Pillars#
1. Metrics — Numbers Over Time#
Time-series data: values measured at regular intervals.
http_requests_total{service="api", status="200"} 145892
http_request_duration_seconds{service="api", quantile="0.99"} 0.42
cpu_usage_percent{host="web-1"} 72.5
Types:
- Counter — only goes up (total requests, errors)
- Gauge — goes up and down (CPU, memory, queue depth)
- Histogram — distribution of values (request duration P50/P95/P99)
Best for: Dashboards, alerting, capacity planning, SLO tracking
Tools: Prometheus, Datadog, CloudWatch, Grafana + Mimir
2. Logs — Events with Context#
Structured records of what happened:
{
"timestamp": "2026-03-28T14:32:01Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123",
"user_id": "user_456",
"message": "Stripe charge failed",
"error": "card_declined",
"amount": 99.99
}
Structured logs > unstructured. Always use JSON with consistent fields.
Best for: Debugging specific errors, audit trails, compliance
Tools: ELK (Elasticsearch + Logstash + Kibana), Grafana Loki, Datadog Logs, CloudWatch Logs
3. Traces — Request Journey#
Follow a single request across all services:
[Trace: abc123] 450ms total
├── API Gateway 12ms
├── Auth Service 8ms
├── Order Service 180ms
│ ├── PostgreSQL 45ms
│ └── Redis Cache 2ms (cache miss → DB)
├── Payment Service 230ms ← bottleneck!
│ ├── Fraud Check 15ms
│ └── Stripe API 210ms ← external dependency
└── Email Service 20ms (async, not blocking)
Best for: Finding bottlenecks, understanding latency, debugging distributed systems
Tools: Jaeger, Zipkin, Datadog APM, Grafana Tempo, OpenTelemetry
How They Work Together#
A user reports: "Checkout is slow."
- Metrics → Dashboard shows P99 latency spiked from 200ms to 2s at 2pm
- Traces → Filter traces where duration > 1s → Payment Service is the bottleneck
- Logs → Filter Payment Service logs at 2pm → Stripe API timeout errors, retries causing delays
Without all three, you'd be guessing.
OpenTelemetry — The Standard#
OpenTelemetry (OTel) is the open standard for all three pillars:
Your App → OTel SDK → OTel Collector → Datadog / Grafana / Jaeger
(any backend)
Why OTel matters:
- One SDK for metrics + logs + traces
- Vendor-neutral — switch backends without code changes
- Auto-instrumentation for popular frameworks (Express, Django, Spring)
- Context propagation (trace_id flows across services)
// Node.js with OpenTelemetry
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
const span = tracer.startSpan("process-order");
span.setAttribute("order.id", orderId);
span.setAttribute("order.amount", amount);
// ... do work ...
span.end();
Observability Stack Comparison#
| Stack | Metrics | Logs | Traces | Cost Model |
|---|---|---|---|---|
| Datadog | Yes | Yes | Yes | Per host + ingestion |
| Grafana Cloud | Mimir | Loki | Tempo | Per metric/log/trace |
| New Relic | Yes | Yes | Yes | Per GB ingested |
| AWS | CloudWatch | CloudWatch | X-Ray | Per metric/request |
| Elastic | Metrics | ELK | APM | Per node/cluster |
| Self-hosted | Prometheus | Loki | Jaeger | Infrastructure only |
Decision Guide#
- Startup (< 20 engineers): Grafana Cloud free tier or Datadog
- Scale-up: Datadog or New Relic (full platform)
- Cost-conscious: Self-hosted Prometheus + Loki + Jaeger
- AWS-native: CloudWatch + X-Ray
- Enterprise: Datadog or Elastic
Key Metrics to Track#
The Four Golden Signals (Google SRE)#
| Signal | What | Example |
|---|---|---|
| Latency | How long requests take | P50: 50ms, P99: 200ms |
| Traffic | Request volume | 5000 req/sec |
| Errors | Failure rate | 0.5% 5xx responses |
| Saturation | Resource utilization | CPU 75%, memory 80% |
RED Method (Microservices)#
- Rate — requests per second
- Errors — errors per second
- Duration — latency histogram
USE Method (Infrastructure)#
- Utilization — % of resource used
- Saturation — work queued
- Errors — error count
SLOs, SLIs, and SLAs#
| Term | Meaning | Example |
|---|---|---|
| SLI (Indicator) | What you measure | P99 latency, error rate |
| SLO (Objective) | Your target | P99 < 200ms, errors < 0.1% |
| SLA (Agreement) | Customer contract | 99.9% uptime or credits |
SLI ← you measure it. SLO ← you set it. SLA ← you promise it.
Error budget: if SLO is 99.9%, you have 0.1% budget for failures (43 minutes/month). Spend it on deployments and experiments.
Architecture Example#
Microservices Observability Stack#
Services (with OTel SDK)
→ OTel Collector
→ Prometheus (metrics) → Grafana Dashboards
→ Loki (logs) → Grafana Explore
→ Jaeger (traces) → Grafana Traces
→ Alert Manager → PagerDuty / Slack
Codelit can also generate Datadog, Sentry, and Grafana configs automatically from any architecture — just use the Monitoring Setup export.
Summary#
- Metrics for dashboards and alerts — "is it broken?"
- Logs for debugging — "what went wrong?"
- Traces for latency — "where is it slow?"
- OpenTelemetry as the standard — one SDK, any backend
- Four Golden Signals for every service — latency, traffic, errors, saturation
- Set SLOs — measure what matters to users, not what's easy to measure
Generate observability configs at codelit.io — Datadog, Sentry, and Grafana dashboard exports from any architecture diagram.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Observability vs Monitoring in seconds.
Try it in Codelit →
Comments