Logging Architecture & the ELK Stack: From Chaos to Observability
Logging Architecture & the ELK Stack#
When production breaks at 3 AM, logs are your first responder. But grep-ing through scattered text files across 50 servers doesn't scale. You need a logging architecture.
The Problem with Ad-Hoc Logging#
Reality without centralized logging:
SSH into server-1 → grep error → nothing
SSH into server-2 → grep error → nothing
SSH into server-3 → grep error → found it... but what caused it?
SSH into server-1 → grep the correlation ID → log already rotated
Total time: 45 minutes for a 2-minute fix
Structured Logging#
The foundation of any logging architecture. Stop writing free-text logs.
# Bad — unstructured
logger.info(f"User {user_id} placed order {order_id} for ${amount}")
# Good — structured (JSON)
logger.info("order_placed", extra={
"user_id": user_id,
"order_id": order_id,
"amount": amount,
"currency": "USD",
"items_count": len(items),
"payment_method": "stripe",
})
Output:
{
"timestamp": "2026-03-28T14:23:01.442Z",
"level": "INFO",
"message": "order_placed",
"user_id": "u_8f3k2",
"order_id": "ord_9x2m1",
"amount": 149.99,
"currency": "USD",
"items_count": 3,
"service": "order-service",
"trace_id": "abc123def456"
}
Structured logs are queryable. You can filter by user_id, aggregate by payment_method, alert on amount > 10000.
Log Levels#
Use them consistently across all services:
| Level | When | Example |
|---|---|---|
| TRACE | Granular debugging | Function entry/exit, variable values |
| DEBUG | Development diagnostics | Cache hit/miss, query plans |
| INFO | Normal operations | Request served, job completed |
| WARN | Degraded but functional | Retry succeeded, fallback used |
| ERROR | Operation failed | Payment declined, timeout |
| FATAL | System unusable | DB connection lost, OOM |
# Production log level strategy
production:
default: INFO
noisy-services: WARN
critical-paths: DEBUG (sampled at 10%)
staging:
default: DEBUG
development:
default: TRACE
Centralized Logging Architecture#
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Service A │ │ Service B │ │ Service C │
│ (stdout) │ │ (stdout) │ │ (stdout) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────┐
│ Log Shipper / Agent │
│ (Filebeat, Fluentd, or Vector) │
└─────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Buffer / Queue (optional) │
│ (Kafka, Redis) │
└─────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Log Aggregator / Processor │
│ (Logstash, Fluentd) │
└─────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Storage & Index │
│ (Elasticsearch, Loki, ClickHouse) │
└─────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Visualization & Alerting │
│ (Kibana, Grafana, AlertManager) │
└─────────────────────────────────────────────┘
The ELK Stack#
Elasticsearch + Logstash + Kibana — the most popular open-source logging stack.
Elasticsearch — Storage & Search#
Full-text search engine built on Apache Lucene. Stores logs as JSON documents in indices.
// Index template for application logs
PUT _index_template/app-logs
{
"index_patterns": ["app-logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy"
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"service": { "type": "keyword" },
"message": { "type": "text" },
"trace_id": { "type": "keyword" },
"user_id": { "type": "keyword" }
}
}
}
}
Logstash — Processing Pipeline#
# logstash.conf
input {
beats { port => 5044 }
}
filter {
json { source => "message" }
# Enrich with geo data
geoip { source => "client_ip" }
# Drop noisy health checks
if [message] =~ /health_check/ { drop {} }
# Add environment tag
mutate { add_field => { "environment" => "production" } }
}
output {
elasticsearch {
hosts => ["http://es-node:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
Kibana — Visualization#
Build dashboards for error rates, latency distributions, top error messages, and service health.
EFK Variant#
Replace Logstash with Fluentd — lighter, Kubernetes-native, plugin ecosystem.
# Fluentd config for Kubernetes
<source>
@type tail
path /var/log/containers/*.log
tag kubernetes.*
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc
port 9200
index_name fluentd-${tag}
<buffer>
@type file
flush_interval 5s
</buffer>
</match>
Log Shippers Compared#
| Shipper | Language | Memory | Best For |
|---|---|---|---|
| Filebeat | Go | ~10 MB | Simple file tailing to ES |
| Fluentd | Ruby/C | ~40 MB | Kubernetes, plugin ecosystem |
| Fluent Bit | C | ~1 MB | Edge, IoT, resource-constrained |
| Vector | Rust | ~15 MB | High throughput, flexible routing |
Vector is the rising star — Rust-based, blazing fast, with a declarative config:
# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '. = parse_json!(.message)'
[transforms.filter_errors]
type = "filter"
inputs = ["parse_json"]
condition = '.level == "ERROR" || .level == "FATAL"'
[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoints = ["http://es:9200"]
bulk.index = "app-logs-{{ now }}"
[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "https://hooks.slack.com/services/xxx"
encoding.codec = "json"
Storage Options Beyond Elasticsearch#
| Storage | Strengths | Cost | Query Language |
|---|---|---|---|
| Elasticsearch | Full-text search, mature | High (RAM-hungry) | KQL, Lucene |
| Grafana Loki | Label-based, cheap storage | Low | LogQL |
| ClickHouse | Analytical queries, compression | Medium | SQL |
| AWS CloudWatch | Managed, AWS-native | Per-GB ingestion | Insights QL |
Loki is gaining traction for cost-sensitive teams — it indexes only labels, not full text:
# LogQL query
{service="order-service", level="ERROR"} |= "timeout" | json | duration > 5s
Alerting on Log Patterns#
# ElastAlert rule — alert on error spike
name: Error Rate Spike
type: spike
index: app-logs-*
threshold: 3 # 3x normal rate
timeframe: minutes: 10
spike_height: 3
spike_type: up
filter:
- term: { level: "ERROR" }
alert:
- slack:
slack_webhook_url: "https://hooks.slack.com/..."
slack_channel: "#alerts"
Key alerting patterns:
- Rate spike: Errors jump 3x above baseline
- New error: Error message never seen before
- Absence: Expected log missing (cron didn't run)
- Pattern: Specific sequence of events (login → fail → fail → fail → lockout)
Retention Policies & Cost Management#
Logs are expensive. A medium-sized system generates 1-10 TB/day.
Retention strategy (tiered):
├── Hot (SSD): Last 3 days → full indexing, fast queries
├── Warm (HDD): 3-30 days → reduced replicas, slower queries
├── Cold (S3): 30-90 days → compressed, restore-on-demand
└── Frozen: 90-365 days → S3 Glacier, compliance only
Cost optimization:
├── Sample verbose logs (keep 10% of DEBUG in prod)
├── Drop known-noisy logs at shipper level
├── Compress and batch before shipping
├── Use index lifecycle management (ILM)
└── Consider Loki for non-search workloads
// Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" }}},
"warm": { "min_age": "3d", "actions": { "shrink": { "number_of_shards": 1 }}},
"cold": { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" }}},
"delete": { "min_age": "90d", "actions": { "delete": {} }}
}
}
}
Key Takeaways#
- Structure everything — JSON logs with consistent fields across all services
- Centralize early — don't wait for an outage to build logging infrastructure
- Ship smart — Vector or Fluent Bit at the edge, buffer through Kafka for reliability
- Store in tiers — hot/warm/cold with ILM to control costs
- Alert on patterns — rate spikes, new errors, and absent expected logs
Design observable systems with codelit.io — your visual architecture companion.
Article 192 of the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsBuild this architecture
Generate an interactive Logging Architecture & the ELK Stack in seconds.
Try it in Codelit →
Comments