observabilitysystem-designdevops

Observability for Production Systems — Logs, Metrics, and Traces

March 23, 2026 4 min readBy Mo Discussion

Something's broken. Now what?#

Your dashboard is red. Users are complaining. Latency is spiking. You need to figure out what's wrong — fast.

This is where observability matters. Not as a buzzword, but as the difference between "fixed in 5 minutes" and "debugging for 3 hours."

The three pillars#

1. Logs — what happened#

Logs tell you the story. A sequence of events that led to the problem.

Structured logging is non-negotiable. JSON logs with consistent fields:

{"level":"error","service":"payment","trace_id":"abc123","message":"Stripe API timeout","duration_ms":30000,"user_id":"u-456"}

Unstructured logging is useless at scale:

ERROR: something went wrong with payment for user

You can't search, filter, or aggregate unstructured logs. Structured logs let you query: "show me all errors from the payment service in the last hour where duration > 5000ms."

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki + Grafana, Datadog Logs.

2. Metrics — what's happening now#

Metrics are numbers over time. They tell you the health of your system at a glance.

The four golden signals (Google SRE):

Latency — how long requests take (p50, p95, p99)
Traffic — requests per second
Errors — error rate as a percentage
Saturation — how full your resources are (CPU, memory, disk, connections)

If you only track four things, track these.

RED method (for services):

Rate — requests per second
Errors — failed requests per second
Duration — response time distribution

USE method (for infrastructure):

Utilization — percentage of resource used
Saturation — queue depth / backlog
Errors — hardware/resource errors

Tools: Prometheus + Grafana, Datadog, CloudWatch, New Relic.

3. Traces — where time is spent#

A trace follows a single request through all the services it touches. Each service adds a span with timing information.

[User Request] 450ms total
  ├── API Gateway: 5ms
  ├── Auth Service: 20ms
  ├── Order Service: 380ms
  │   ├── Database Query: 350ms  ← bottleneck!
  │   └── Cache Check: 2ms
  └── Response: 5ms

Without traces, you know the request was slow. With traces, you know the database query in the order service is the bottleneck.

Tools: Jaeger, Zipkin, OpenTelemetry, Datadog APM.

The observability stack in practice#

Most production systems use this stack:

Layer	Tool	Purpose
Metrics	Prometheus + Grafana	Dashboards, alerts, SLO tracking
Logs	Loki or ELK	Searchable event history
Traces	Jaeger + OpenTelemetry	Request flow visualization
Alerting	PagerDuty / OpsGenie	On-call notifications
Status	Statuspage.io	Public incident communication

Alerting: the part everyone gets wrong#

Alert fatigue is real. If your team gets 50 alerts a day, they start ignoring all of them. Then the real incident happens and nobody notices.

Rules for good alerting:

Alert on symptoms, not causes. "Error rate > 5%" not "CPU > 80%"
Every alert should be actionable. If there's nothing to do, it's not an alert
Set severity levels: critical (wake someone up) vs warning (check in the morning)
Use SLO-based alerts: "We're burning through our error budget faster than expected"

SLOs: the business side of observability#

An SLO (Service Level Objective) is a target: "99.9% of requests complete in under 500ms."

This gives you an error budget: 0.1% of requests can be slow. If you're burning through your budget too fast, slow down deployments and fix reliability. If you have budget to spare, ship features faster.

SLOs turn observability data into business decisions.

Where observability fits in your architecture#

On Codelit, generate any production system and you'll see where monitoring and logging components connect. Every backend service should have metrics exported, every database should have query monitoring, and every API gateway should have request tracing.

Build observable systems: describe your architecture on Codelit.io and audit each component's monitoring coverage.

{ }

Explore the Slack architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Observability for Production Systems in seconds.

Try it in Codelit →

observabilitysystem-designdevops

Observability for Production Systems — Logs, Metrics, and Traces

March 23, 2026 4 min readBy Mo Discussion

Something's broken. Now what?#

Your dashboard is red. Users are complaining. Latency is spiking. You need to figure out what's wrong — fast.

This is where observability matters. Not as a buzzword, but as the difference between "fixed in 5 minutes" and "debugging for 3 hours."

The three pillars#

1. Logs — what happened#

Logs tell you the story. A sequence of events that led to the problem.

Structured logging is non-negotiable. JSON logs with consistent fields:

{"level":"error","service":"payment","trace_id":"abc123","message":"Stripe API timeout","duration_ms":30000,"user_id":"u-456"}

Unstructured logging is useless at scale:

ERROR: something went wrong with payment for user

You can't search, filter, or aggregate unstructured logs. Structured logs let you query: "show me all errors from the payment service in the last hour where duration > 5000ms."

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki + Grafana, Datadog Logs.

2. Metrics — what's happening now#

Metrics are numbers over time. They tell you the health of your system at a glance.

The four golden signals (Google SRE):

Latency — how long requests take (p50, p95, p99)
Traffic — requests per second
Errors — error rate as a percentage
Saturation — how full your resources are (CPU, memory, disk, connections)

If you only track four things, track these.

RED method (for services):

Rate — requests per second
Errors — failed requests per second
Duration — response time distribution

USE method (for infrastructure):

Utilization — percentage of resource used
Saturation — queue depth / backlog
Errors — hardware/resource errors

Tools: Prometheus + Grafana, Datadog, CloudWatch, New Relic.

3. Traces — where time is spent#

A trace follows a single request through all the services it touches. Each service adds a span with timing information.

[User Request] 450ms total
  ├── API Gateway: 5ms
  ├── Auth Service: 20ms
  ├── Order Service: 380ms
  │   ├── Database Query: 350ms  ← bottleneck!
  │   └── Cache Check: 2ms
  └── Response: 5ms

Without traces, you know the request was slow. With traces, you know the database query in the order service is the bottleneck.

Tools: Jaeger, Zipkin, OpenTelemetry, Datadog APM.

The observability stack in practice#

Most production systems use this stack:

Layer	Tool	Purpose
Metrics	Prometheus + Grafana	Dashboards, alerts, SLO tracking
Logs	Loki or ELK	Searchable event history
Traces	Jaeger + OpenTelemetry	Request flow visualization
Alerting	PagerDuty / OpsGenie	On-call notifications
Status	Statuspage.io	Public incident communication

Alerting: the part everyone gets wrong#

Alert fatigue is real. If your team gets 50 alerts a day, they start ignoring all of them. Then the real incident happens and nobody notices.

Rules for good alerting:

Alert on symptoms, not causes. "Error rate > 5%" not "CPU > 80%"
Every alert should be actionable. If there's nothing to do, it's not an alert
Set severity levels: critical (wake someone up) vs warning (check in the morning)
Use SLO-based alerts: "We're burning through our error budget faster than expected"

SLOs: the business side of observability#

An SLO (Service Level Objective) is a target: "99.9% of requests complete in under 500ms."

SLOs turn observability data into business decisions.

Where observability fits in your architecture#

Build observable systems: describe your architecture on Codelit.io and audit each component's monitoring coverage.

{ }

Explore the Slack architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Observability for Production Systems in seconds.

Try it in Codelit →

Observability for Production Systems — Logs, Metrics, and Traces

Something's broken. Now what?#

The three pillars#

1. Logs — what happened#

2. Metrics — what's happening now#

3. Traces — where time is spent#

The observability stack in practice#

Alerting: the part everyone gets wrong#

SLOs: the business side of observability#

Where observability fits in your architecture#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

CI/CD Pipeline Architecture

Build this architecture

Observability for Production Systems — Logs, Metrics, and Traces

Something's broken. Now what?#

The three pillars#

1. Logs — what happened#

2. Metrics — what's happening now#

3. Traces — where time is spent#

The observability stack in practice#

Alerting: the part everyone gets wrong#

SLOs: the business side of observability#

Where observability fits in your architecture#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

CI/CD Pipeline Architecture

Build this architecture