Design a Logging System — From Printf to Petabyte-Scale Log Aggregation
Why logging is a system design problem#
On one server, tail -f /var/log/app.log works. With 100 servers running 20 microservices, you need a centralized logging system that collects, stores, and searches billions of log lines per day.
Structured logging#
Stop logging strings. Log structured data:
Bad:
2026-03-24 10:30:15 ERROR Payment failed for user 123, amount $49.99, reason: card_declined
Good:
{
"timestamp": "2026-03-24T10:30:15Z",
"level": "error",
"service": "payment-service",
"event": "payment_failed",
"user_id": 123,
"amount": 49.99,
"currency": "USD",
"reason": "card_declined",
"trace_id": "abc-123-def"
}
Structured logs are searchable, filterable, and aggregatable. You can query "all payment failures over $100 in the last hour" without regex.
Architecture#
Apps → Log Collector → Message Queue → Processor → Storage → Query/UI
1. Collection#
Apps write logs to stdout (twelve-factor app). A sidecar or agent collects them:
- Fluentd/Fluent Bit — CNCF standard, lightweight, plugin ecosystem
- Filebeat — Elastic's collector, ships to Logstash/Elasticsearch
- Vector — Rust-based, fast, supports transforms
2. Transport#
Collectors ship logs to a message queue for buffering:
- Kafka — High throughput, durable, replay capability
- Direct shipping — Skip the queue for smaller deployments
3. Processing#
Parse, enrich, filter, and transform logs:
- Logstash — Part of ELK stack, powerful but resource-heavy
- Fluent Bit — Lightweight processing at the edge
- Vector — Processing pipeline with WASM transforms
Enrichment examples: add GeoIP data, parse user-agent strings, redact PII, add Kubernetes metadata.
4. Storage#
Two main approaches:
Indexed (Elasticsearch):
- Full-text search, fast queries
- Expensive storage (inverted indexes)
- Best for: operational debugging, search-heavy workloads
Log-optimized (Grafana Loki, ClickHouse):
- Store logs cheaply in object storage
- Index only metadata (labels), not content
- Best for: high volume, cost-sensitive, label-based queries
5. Query and visualization#
- Kibana — Dashboards, visualizations, alerting for Elasticsearch
- Grafana — Unified dashboards for logs, metrics, and traces
- Custom UI — Build your own with the query API
Log levels#
Use them consistently:
| Level | When to use |
|---|---|
| ERROR | Something failed, needs attention |
| WARN | Unexpected but handled, might become an error |
| INFO | Normal operations worth recording |
| DEBUG | Detailed diagnostic info, disabled in production |
Rule: If you'd wake someone up at 3 AM for it, it's ERROR. If not, it's WARN or INFO.
Retention and cost#
Logs are expensive to store. Tier your retention:
| Tier | Duration | Storage | Use case |
|---|---|---|---|
| Hot | 7 days | SSD/Elasticsearch | Active debugging |
| Warm | 30 days | HDD/cheaper storage | Recent investigations |
| Cold | 1 year | Object storage (S3) | Compliance, audit |
| Archive | 7 years | Glacier | Legal requirements |
Alerting#
Don't just store logs — act on them:
- Error rate spike — Alert when error rate exceeds 5% of requests
- Specific patterns — "OOM killed", "disk full", "connection refused"
- Absence detection — Alert when a service stops logging (it might be dead)
Common mistakes#
Logging too much. DEBUG-level in production generates terabytes. Be selective.
Not including trace IDs. Without correlation IDs, you can't follow a request across services.
Logging PII. Names, emails, credit cards in logs violate GDPR. Redact at collection time.
No retention policy. Logs grow forever until storage runs out at 3 AM on a Friday.
Visualize your logging architecture#
See how collection, processing, storage, and querying connect — try Codelit to generate an interactive diagram of your logging infrastructure.
Key takeaways#
- Structured logging (JSON) over free-text — searchable, aggregatable
- Centralize with Fluentd/Vector → Kafka → Elasticsearch/Loki
- Tier retention — hot (7d), warm (30d), cold (1y), archive (7y)
- Always include trace_id — correlate logs across services
- Redact PII at collection time, not after storage
- Alert on patterns — don't just store logs, act on them
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency
8 min read
system designCircuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j
7 min read
testingAPI Contract Testing with Pact — Consumer-Driven Contracts for Microservices
8 min read
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for Design a Logging System in seconds.
Try it in Codelit →
Comments