system-designinfrastructuredevops

Design a Logging System — From Printf to Petabyte-Scale Log Aggregation

March 24, 2026 4 min readBy Mo Discussion

Why logging is a system design problem#

On one server, tail -f /var/log/app.log works. With 100 servers running 20 microservices, you need a centralized logging system that collects, stores, and searches billions of log lines per day.

Structured logging#

Stop logging strings. Log structured data:

Bad:

2026-03-24 10:30:15 ERROR Payment failed for user 123, amount $49.99, reason: card_declined

Good:

{
  "timestamp": "2026-03-24T10:30:15Z",
  "level": "error",
  "service": "payment-service",
  "event": "payment_failed",
  "user_id": 123,
  "amount": 49.99,
  "currency": "USD",
  "reason": "card_declined",
  "trace_id": "abc-123-def"
}

Structured logs are searchable, filterable, and aggregatable. You can query "all payment failures over $100 in the last hour" without regex.

Architecture#

Apps → Log Collector → Message Queue → Processor → Storage → Query/UI

1. Collection#

Apps write logs to stdout (twelve-factor app). A sidecar or agent collects them:

Fluentd/Fluent Bit — CNCF standard, lightweight, plugin ecosystem
Filebeat — Elastic's collector, ships to Logstash/Elasticsearch
Vector — Rust-based, fast, supports transforms

2. Transport#

Collectors ship logs to a message queue for buffering:

Kafka — High throughput, durable, replay capability
Direct shipping — Skip the queue for smaller deployments

3. Processing#

Parse, enrich, filter, and transform logs:

Logstash — Part of ELK stack, powerful but resource-heavy
Fluent Bit — Lightweight processing at the edge
Vector — Processing pipeline with WASM transforms

Enrichment examples: add GeoIP data, parse user-agent strings, redact PII, add Kubernetes metadata.

4. Storage#

Two main approaches:

Indexed (Elasticsearch):

Full-text search, fast queries
Expensive storage (inverted indexes)
Best for: operational debugging, search-heavy workloads

Log-optimized (Grafana Loki, ClickHouse):

Store logs cheaply in object storage
Index only metadata (labels), not content
Best for: high volume, cost-sensitive, label-based queries

5. Query and visualization#

Kibana — Dashboards, visualizations, alerting for Elasticsearch
Grafana — Unified dashboards for logs, metrics, and traces
Custom UI — Build your own with the query API

Log levels#

Use them consistently:

Level	When to use
ERROR	Something failed, needs attention
WARN	Unexpected but handled, might become an error
INFO	Normal operations worth recording
DEBUG	Detailed diagnostic info, disabled in production

Rule: If you'd wake someone up at 3 AM for it, it's ERROR. If not, it's WARN or INFO.

Retention and cost#

Logs are expensive to store. Tier your retention:

Tier	Duration	Storage	Use case
Hot	7 days	SSD/Elasticsearch	Active debugging
Warm	30 days	HDD/cheaper storage	Recent investigations
Cold	1 year	Object storage (S3)	Compliance, audit
Archive	7 years	Glacier	Legal requirements

Alerting#

Don't just store logs — act on them:

Error rate spike — Alert when error rate exceeds 5% of requests
Specific patterns — "OOM killed", "disk full", "connection refused"
Absence detection — Alert when a service stops logging (it might be dead)

Common mistakes#

Logging too much. DEBUG-level in production generates terabytes. Be selective.

Not including trace IDs. Without correlation IDs, you can't follow a request across services.

Logging PII. Names, emails, credit cards in logs violate GDPR. Redact at collection time.

No retention policy. Logs grow forever until storage runs out at 3 AM on a Friday.

Visualize your logging architecture#

See how collection, processing, storage, and querying connect — try Codelit to generate an interactive diagram of your logging infrastructure.

Key takeaways#

Structured logging (JSON) over free-text — searchable, aggregatable
Centralize with Fluentd/Vector → Kafka → Elasticsearch/Loki
Tier retention — hot (7d), warm (30d), cold (1y), archive (7y)
Always include trace_id — correlate logs across services
Redact PII at collection time, not after storage
Alert on patterns — don't just store logs, act on them

{ }

Explore the GitHub architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

testing

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

8 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Notification System

Multi-channel notification platform with preferences, templating, and delivery tracking.

9 components

Build this architecture

Generate an interactive architecture for Design a Logging System in seconds.

Try it in Codelit →

system-designinfrastructuredevops

Design a Logging System — From Printf to Petabyte-Scale Log Aggregation

March 24, 2026 4 min readBy Mo Discussion

Why logging is a system design problem#

On one server, tail -f /var/log/app.log works. With 100 servers running 20 microservices, you need a centralized logging system that collects, stores, and searches billions of log lines per day.

Structured logging#

Stop logging strings. Log structured data:

Bad:

2026-03-24 10:30:15 ERROR Payment failed for user 123, amount $49.99, reason: card_declined

Good:

{
  "timestamp": "2026-03-24T10:30:15Z",
  "level": "error",
  "service": "payment-service",
  "event": "payment_failed",
  "user_id": 123,
  "amount": 49.99,
  "currency": "USD",
  "reason": "card_declined",
  "trace_id": "abc-123-def"
}

Structured logs are searchable, filterable, and aggregatable. You can query "all payment failures over $100 in the last hour" without regex.

Architecture#

Apps → Log Collector → Message Queue → Processor → Storage → Query/UI

1. Collection#

Apps write logs to stdout (twelve-factor app). A sidecar or agent collects them:

Fluentd/Fluent Bit — CNCF standard, lightweight, plugin ecosystem
Filebeat — Elastic's collector, ships to Logstash/Elasticsearch
Vector — Rust-based, fast, supports transforms

2. Transport#

Collectors ship logs to a message queue for buffering:

Kafka — High throughput, durable, replay capability
Direct shipping — Skip the queue for smaller deployments

3. Processing#

Parse, enrich, filter, and transform logs:

Logstash — Part of ELK stack, powerful but resource-heavy
Fluent Bit — Lightweight processing at the edge
Vector — Processing pipeline with WASM transforms

Enrichment examples: add GeoIP data, parse user-agent strings, redact PII, add Kubernetes metadata.

4. Storage#

Two main approaches:

Indexed (Elasticsearch):

Full-text search, fast queries
Expensive storage (inverted indexes)
Best for: operational debugging, search-heavy workloads

Log-optimized (Grafana Loki, ClickHouse):

Store logs cheaply in object storage
Index only metadata (labels), not content
Best for: high volume, cost-sensitive, label-based queries

5. Query and visualization#

Kibana — Dashboards, visualizations, alerting for Elasticsearch
Grafana — Unified dashboards for logs, metrics, and traces
Custom UI — Build your own with the query API

Log levels#

Use them consistently:

Level	When to use
ERROR	Something failed, needs attention
WARN	Unexpected but handled, might become an error
INFO	Normal operations worth recording
DEBUG	Detailed diagnostic info, disabled in production

Rule: If you'd wake someone up at 3 AM for it, it's ERROR. If not, it's WARN or INFO.

Retention and cost#

Logs are expensive to store. Tier your retention:

Tier	Duration	Storage	Use case
Hot	7 days	SSD/Elasticsearch	Active debugging
Warm	30 days	HDD/cheaper storage	Recent investigations
Cold	1 year	Object storage (S3)	Compliance, audit
Archive	7 years	Glacier	Legal requirements

Alerting#

Don't just store logs — act on them:

Error rate spike — Alert when error rate exceeds 5% of requests
Specific patterns — "OOM killed", "disk full", "connection refused"
Absence detection — Alert when a service stops logging (it might be dead)

Common mistakes#

Logging too much. DEBUG-level in production generates terabytes. Be selective.

Not including trace IDs. Without correlation IDs, you can't follow a request across services.

Logging PII. Names, emails, credit cards in logs violate GDPR. Redact at collection time.

No retention policy. Logs grow forever until storage runs out at 3 AM on a Friday.

Visualize your logging architecture#

See how collection, processing, storage, and querying connect — try Codelit to generate an interactive diagram of your logging infrastructure.

Key takeaways#

Structured logging (JSON) over free-text — searchable, aggregatable
Centralize with Fluentd/Vector → Kafka → Elasticsearch/Loki
Tier retention — hot (7d), warm (30d), cold (1y), archive (7y)
Always include trace_id — correlate logs across services
Redact PII at collection time, not after storage
Alert on patterns — don't just store logs, act on them

{ }

Explore the GitHub architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

api design

Build this architecture

Generate an interactive architecture for Design a Logging System in seconds.

Try it in Codelit →

Design a Logging System — From Printf to Petabyte-Scale Log Aggregation

Why logging is a system design problem#

Structured logging#

Architecture#

1. Collection#

2. Transport#

3. Processing#

4. Storage#

5. Query and visualization#

Log levels#

Retention and cost#

Alerting#

Common mistakes#

Visualize your logging architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Uber Real-Time Location System

E-Commerce Checkout System

Notification System

Build this architecture

Design a Logging System — From Printf to Petabyte-Scale Log Aggregation

Why logging is a system design problem#

Structured logging#

Architecture#

1. Collection#

2. Transport#

3. Processing#

4. Storage#

5. Query and visualization#

Log levels#

Retention and cost#

Alerting#

Common mistakes#

Visualize your logging architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API Contract Testing with Pact — Consumer-Driven Contracts for Microservices

Try these templates

Uber Real-Time Location System

E-Commerce Checkout System

Notification System

Build this architecture