Observability for Production Systems — Logs, Metrics, and Traces
Something's broken. Now what?#
Your dashboard is red. Users are complaining. Latency is spiking. You need to figure out what's wrong — fast.
This is where observability matters. Not as a buzzword, but as the difference between "fixed in 5 minutes" and "debugging for 3 hours."
The three pillars#
1. Logs — what happened#
Logs tell you the story. A sequence of events that led to the problem.
Structured logging is non-negotiable. JSON logs with consistent fields:
{"level":"error","service":"payment","trace_id":"abc123","message":"Stripe API timeout","duration_ms":30000,"user_id":"u-456"}
Unstructured logging is useless at scale:
ERROR: something went wrong with payment for user
You can't search, filter, or aggregate unstructured logs. Structured logs let you query: "show me all errors from the payment service in the last hour where duration > 5000ms."
Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki + Grafana, Datadog Logs.
2. Metrics — what's happening now#
Metrics are numbers over time. They tell you the health of your system at a glance.
The four golden signals (Google SRE):
- Latency — how long requests take (p50, p95, p99)
- Traffic — requests per second
- Errors — error rate as a percentage
- Saturation — how full your resources are (CPU, memory, disk, connections)
If you only track four things, track these.
RED method (for services):
- Rate — requests per second
- Errors — failed requests per second
- Duration — response time distribution
USE method (for infrastructure):
- Utilization — percentage of resource used
- Saturation — queue depth / backlog
- Errors — hardware/resource errors
Tools: Prometheus + Grafana, Datadog, CloudWatch, New Relic.
3. Traces — where time is spent#
A trace follows a single request through all the services it touches. Each service adds a span with timing information.
[User Request] 450ms total
├── API Gateway: 5ms
├── Auth Service: 20ms
├── Order Service: 380ms
│ ├── Database Query: 350ms ← bottleneck!
│ └── Cache Check: 2ms
└── Response: 5ms
Without traces, you know the request was slow. With traces, you know the database query in the order service is the bottleneck.
Tools: Jaeger, Zipkin, OpenTelemetry, Datadog APM.
The observability stack in practice#
Most production systems use this stack:
| Layer | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus + Grafana | Dashboards, alerts, SLO tracking |
| Logs | Loki or ELK | Searchable event history |
| Traces | Jaeger + OpenTelemetry | Request flow visualization |
| Alerting | PagerDuty / OpsGenie | On-call notifications |
| Status | Statuspage.io | Public incident communication |
Alerting: the part everyone gets wrong#
Alert fatigue is real. If your team gets 50 alerts a day, they start ignoring all of them. Then the real incident happens and nobody notices.
Rules for good alerting:
- Alert on symptoms, not causes. "Error rate > 5%" not "CPU > 80%"
- Every alert should be actionable. If there's nothing to do, it's not an alert
- Set severity levels: critical (wake someone up) vs warning (check in the morning)
- Use SLO-based alerts: "We're burning through our error budget faster than expected"
SLOs: the business side of observability#
An SLO (Service Level Objective) is a target: "99.9% of requests complete in under 500ms."
This gives you an error budget: 0.1% of requests can be slow. If you're burning through your budget too fast, slow down deployments and fix reliability. If you have budget to spare, ship features faster.
SLOs turn observability data into business decisions.
Where observability fits in your architecture#
On Codelit, generate any production system and you'll see where monitoring and logging components connect. Every backend service should have metrics exported, every database should have query monitoring, and every API gateway should have request tracing.
Build observable systems: describe your architecture on Codelit.io and audit each component's monitoring coverage.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
testingAPI Contract Testing with Pact — Consumer-Driven Contracts for Microservices
8 min read
Try these templates
Build this architecture
Generate an interactive architecture for Observability for Production Systems in seconds.
Try it in Codelit →
Comments