observabilitymonitoringmetricsloggingtracingsystem design

Observability vs Monitoring: Metrics, Logs, and Traces Explained

March 28, 2026 5 min readBy Codelit Team Discussion

Observability vs Monitoring: Metrics, Logs, and Traces#

Monitoring tells you when something is broken. Observability tells you why.

Most teams start with monitoring (alerts when CPU is high) and realize too late they need observability (understanding why requests are slow for users in Europe on Tuesdays). This guide covers the three pillars and how to implement them.

Monitoring vs Observability#

	Monitoring	Observability
Question	"Is it broken?"	"Why is it broken?"
Approach	Predefined dashboards and alerts	Explore any question about system behavior
Data	Known metrics (CPU, memory, error rate)	High-cardinality data (per-request, per-user)
When useful	Known failure modes	Unknown unknowns

You need both. Monitoring catches known issues fast. Observability helps debug novel problems.

The Three Pillars#

1. Metrics — Numbers Over Time#

Time-series data: values measured at regular intervals.

http_requests_total{service="api", status="200"} 145892
http_request_duration_seconds{service="api", quantile="0.99"} 0.42
cpu_usage_percent{host="web-1"} 72.5

Types:

Counter — only goes up (total requests, errors)
Gauge — goes up and down (CPU, memory, queue depth)
Histogram — distribution of values (request duration P50/P95/P99)

Best for: Dashboards, alerting, capacity planning, SLO tracking

Tools: Prometheus, Datadog, CloudWatch, Grafana + Mimir

2. Logs — Events with Context#

Structured records of what happened:

{
  "timestamp": "2026-03-28T14:32:01Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "user_456",
  "message": "Stripe charge failed",
  "error": "card_declined",
  "amount": 99.99
}

Structured logs > unstructured. Always use JSON with consistent fields.

Best for: Debugging specific errors, audit trails, compliance

Tools: ELK (Elasticsearch + Logstash + Kibana), Grafana Loki, Datadog Logs, CloudWatch Logs

3. Traces — Request Journey#

Follow a single request across all services:

[Trace: abc123] 450ms total
├── API Gateway        12ms
├── Auth Service       8ms
├── Order Service      180ms
│   ├── PostgreSQL     45ms
│   └── Redis Cache    2ms (cache miss → DB)
├── Payment Service    230ms ← bottleneck!
│   ├── Fraud Check    15ms
│   └── Stripe API     210ms ← external dependency
└── Email Service      20ms (async, not blocking)

Best for: Finding bottlenecks, understanding latency, debugging distributed systems

Tools: Jaeger, Zipkin, Datadog APM, Grafana Tempo, OpenTelemetry

How They Work Together#

A user reports: "Checkout is slow."

Metrics → Dashboard shows P99 latency spiked from 200ms to 2s at 2pm
Traces → Filter traces where duration > 1s → Payment Service is the bottleneck
Logs → Filter Payment Service logs at 2pm → Stripe API timeout errors, retries causing delays

Without all three, you'd be guessing.

OpenTelemetry — The Standard#

OpenTelemetry (OTel) is the open standard for all three pillars:

Your App → OTel SDK → OTel Collector → Datadog / Grafana / Jaeger
                                      (any backend)

Why OTel matters:

One SDK for metrics + logs + traces
Vendor-neutral — switch backends without code changes
Auto-instrumentation for popular frameworks (Express, Django, Spring)
Context propagation (trace_id flows across services)

// Node.js with OpenTelemetry
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");
const span = tracer.startSpan("process-order");
span.setAttribute("order.id", orderId);
span.setAttribute("order.amount", amount);
// ... do work ...
span.end();

Observability Stack Comparison#

Stack	Metrics	Logs	Traces	Cost Model
Datadog	Yes	Yes	Yes	Per host + ingestion
Grafana Cloud	Mimir	Loki	Tempo	Per metric/log/trace
New Relic	Yes	Yes	Yes	Per GB ingested
AWS	CloudWatch	CloudWatch	X-Ray	Per metric/request
Elastic	Metrics	ELK	APM	Per node/cluster
Self-hosted	Prometheus	Loki	Jaeger	Infrastructure only

Decision Guide#

Startup (< 20 engineers): Grafana Cloud free tier or Datadog
Scale-up: Datadog or New Relic (full platform)
Cost-conscious: Self-hosted Prometheus + Loki + Jaeger
AWS-native: CloudWatch + X-Ray
Enterprise: Datadog or Elastic

Key Metrics to Track#

The Four Golden Signals (Google SRE)#

Signal	What	Example
Latency	How long requests take	P50: 50ms, P99: 200ms
Traffic	Request volume	5000 req/sec
Errors	Failure rate	0.5% 5xx responses
Saturation	Resource utilization	CPU 75%, memory 80%

RED Method (Microservices)#

Rate — requests per second
Errors — errors per second
Duration — latency histogram

USE Method (Infrastructure)#

Utilization — % of resource used
Saturation — work queued
Errors — error count

SLOs, SLIs, and SLAs#

Term	Meaning	Example
SLI (Indicator)	What you measure	P99 latency, error rate
SLO (Objective)	Your target	P99 < 200ms, errors < 0.1%
SLA (Agreement)	Customer contract	99.9% uptime or credits

SLI ← you measure it. SLO ← you set it. SLA ← you promise it.

Error budget: if SLO is 99.9%, you have 0.1% budget for failures (43 minutes/month). Spend it on deployments and experiments.

Architecture Example#

Microservices Observability Stack#

Services (with OTel SDK)
    → OTel Collector
        → Prometheus (metrics) → Grafana Dashboards
        → Loki (logs) → Grafana Explore
        → Jaeger (traces) → Grafana Traces
    → Alert Manager → PagerDuty / Slack

Generate your observability architecture →

Codelit can also generate Datadog, Sentry, and Grafana configs automatically from any architecture — just use the Monitoring Setup export.

Summary#

Metrics for dashboards and alerts — "is it broken?"
Logs for debugging — "what went wrong?"
Traces for latency — "where is it slow?"
OpenTelemetry as the standard — one SDK, any backend
Four Golden Signals for every service — latency, traffic, errors, saturation
Set SLOs — measure what matters to users, not what's easy to measure

Generate observability configs at codelit.io — Datadog, Sentry, and Grafana dashboard exports from any architecture diagram.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Prometheus Monitoring Stack

Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.

10 components

Build this architecture

Generate an interactive architecture for Observability vs Monitoring in seconds.

Try it in Codelit →

observabilitymonitoringmetricsloggingtracingsystem design

Observability vs Monitoring: Metrics, Logs, and Traces Explained

March 28, 2026 5 min readBy Codelit Team Discussion

Observability vs Monitoring: Metrics, Logs, and Traces#

Monitoring tells you when something is broken. Observability tells you why.

Monitoring vs Observability#

	Monitoring	Observability
Question	"Is it broken?"	"Why is it broken?"
Approach	Predefined dashboards and alerts	Explore any question about system behavior
Data	Known metrics (CPU, memory, error rate)	High-cardinality data (per-request, per-user)
When useful	Known failure modes	Unknown unknowns

You need both. Monitoring catches known issues fast. Observability helps debug novel problems.

The Three Pillars#

1. Metrics — Numbers Over Time#

Time-series data: values measured at regular intervals.

http_requests_total{service="api", status="200"} 145892
http_request_duration_seconds{service="api", quantile="0.99"} 0.42
cpu_usage_percent{host="web-1"} 72.5

Types:

Counter — only goes up (total requests, errors)
Gauge — goes up and down (CPU, memory, queue depth)
Histogram — distribution of values (request duration P50/P95/P99)

Best for: Dashboards, alerting, capacity planning, SLO tracking

Tools: Prometheus, Datadog, CloudWatch, Grafana + Mimir

2. Logs — Events with Context#

Structured records of what happened:

{
  "timestamp": "2026-03-28T14:32:01Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "user_456",
  "message": "Stripe charge failed",
  "error": "card_declined",
  "amount": 99.99
}

Structured logs > unstructured. Always use JSON with consistent fields.

Best for: Debugging specific errors, audit trails, compliance

Tools: ELK (Elasticsearch + Logstash + Kibana), Grafana Loki, Datadog Logs, CloudWatch Logs

3. Traces — Request Journey#

Follow a single request across all services:

[Trace: abc123] 450ms total
├── API Gateway        12ms
├── Auth Service       8ms
├── Order Service      180ms
│   ├── PostgreSQL     45ms
│   └── Redis Cache    2ms (cache miss → DB)
├── Payment Service    230ms ← bottleneck!
│   ├── Fraud Check    15ms
│   └── Stripe API     210ms ← external dependency
└── Email Service      20ms (async, not blocking)

Best for: Finding bottlenecks, understanding latency, debugging distributed systems

Tools: Jaeger, Zipkin, Datadog APM, Grafana Tempo, OpenTelemetry

How They Work Together#

A user reports: "Checkout is slow."

Metrics → Dashboard shows P99 latency spiked from 200ms to 2s at 2pm
Traces → Filter traces where duration > 1s → Payment Service is the bottleneck
Logs → Filter Payment Service logs at 2pm → Stripe API timeout errors, retries causing delays

Without all three, you'd be guessing.

OpenTelemetry — The Standard#

OpenTelemetry (OTel) is the open standard for all three pillars:

Your App → OTel SDK → OTel Collector → Datadog / Grafana / Jaeger
                                      (any backend)

Why OTel matters:

One SDK for metrics + logs + traces
Vendor-neutral — switch backends without code changes
Auto-instrumentation for popular frameworks (Express, Django, Spring)
Context propagation (trace_id flows across services)

// Node.js with OpenTelemetry
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");
const span = tracer.startSpan("process-order");
span.setAttribute("order.id", orderId);
span.setAttribute("order.amount", amount);
// ... do work ...
span.end();

Observability Stack Comparison#

Stack	Metrics	Logs	Traces	Cost Model
Datadog	Yes	Yes	Yes	Per host + ingestion
Grafana Cloud	Mimir	Loki	Tempo	Per metric/log/trace
New Relic	Yes	Yes	Yes	Per GB ingested
AWS	CloudWatch	CloudWatch	X-Ray	Per metric/request
Elastic	Metrics	ELK	APM	Per node/cluster
Self-hosted	Prometheus	Loki	Jaeger	Infrastructure only

Decision Guide#

Startup (< 20 engineers): Grafana Cloud free tier or Datadog
Scale-up: Datadog or New Relic (full platform)
Cost-conscious: Self-hosted Prometheus + Loki + Jaeger
AWS-native: CloudWatch + X-Ray
Enterprise: Datadog or Elastic

Key Metrics to Track#

The Four Golden Signals (Google SRE)#

Signal	What	Example
Latency	How long requests take	P50: 50ms, P99: 200ms
Traffic	Request volume	5000 req/sec
Errors	Failure rate	0.5% 5xx responses
Saturation	Resource utilization	CPU 75%, memory 80%

RED Method (Microservices)#

Rate — requests per second
Errors — errors per second
Duration — latency histogram

USE Method (Infrastructure)#

Utilization — % of resource used
Saturation — work queued
Errors — error count

SLOs, SLIs, and SLAs#

Term	Meaning	Example
SLI (Indicator)	What you measure	P99 latency, error rate
SLO (Objective)	Your target	P99 < 200ms, errors < 0.1%
SLA (Agreement)	Customer contract	99.9% uptime or credits

SLI ← you measure it. SLO ← you set it. SLA ← you promise it.

Error budget: if SLO is 99.9%, you have 0.1% budget for failures (43 minutes/month). Spend it on deployments and experiments.

Architecture Example#

Microservices Observability Stack#

Services (with OTel SDK)
    → OTel Collector
        → Prometheus (metrics) → Grafana Dashboards
        → Loki (logs) → Grafana Explore
        → Jaeger (traces) → Grafana Traces
    → Alert Manager → PagerDuty / Slack

Generate your observability architecture →

Codelit can also generate Datadog, Sentry, and Grafana configs automatically from any architecture — just use the Monitoring Setup export.

Summary#

Metrics for dashboards and alerts — "is it broken?"
Logs for debugging — "what went wrong?"
Traces for latency — "where is it slow?"
OpenTelemetry as the standard — one SDK, any backend
Four Golden Signals for every service — latency, traffic, errors, saturation
Set SLOs — measure what matters to users, not what's easy to measure

Generate observability configs at codelit.io — Datadog, Sentry, and Grafana dashboard exports from any architecture diagram.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Prometheus Monitoring Stack

Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.

10 components

Build this architecture

Generate an interactive architecture for Observability vs Monitoring in seconds.

Try it in Codelit →

Observability vs Monitoring: Metrics, Logs, and Traces Explained

Observability vs Monitoring: Metrics, Logs, and Traces#

Monitoring vs Observability#

The Three Pillars#

1. Metrics — Numbers Over Time#

2. Logs — Events with Context#

3. Traces — Request Journey#

How They Work Together#

OpenTelemetry — The Standard#

Observability Stack Comparison#

Decision Guide#

Key Metrics to Track#

The Four Golden Signals (Google SRE)#

RED Method (Microservices)#

USE Method (Infrastructure)#

SLOs, SLIs, and SLAs#

Architecture Example#

Microservices Observability Stack#

Summary#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

Prometheus Monitoring Stack

Build this architecture

Observability vs Monitoring: Metrics, Logs, and Traces Explained

Observability vs Monitoring: Metrics, Logs, and Traces#

Monitoring vs Observability#

The Three Pillars#

1. Metrics — Numbers Over Time#

2. Logs — Events with Context#

3. Traces — Request Journey#

How They Work Together#

OpenTelemetry — The Standard#

Observability Stack Comparison#

Decision Guide#

Key Metrics to Track#

The Four Golden Signals (Google SRE)#

RED Method (Microservices)#

USE Method (Infrastructure)#

SLOs, SLIs, and SLAs#

Architecture Example#

Microservices Observability Stack#

Summary#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

Prometheus Monitoring Stack

Build this architecture