logging architectureELK stackobservabilityDevOpssystem design

Logging Architecture & the ELK Stack: From Chaos to Observability

March 28, 2026 7 min readBy Codelit Team Discussion

Logging Architecture & the ELK Stack#

When production breaks at 3 AM, logs are your first responder. But grep-ing through scattered text files across 50 servers doesn't scale. You need a logging architecture.

The Problem with Ad-Hoc Logging#

Reality without centralized logging:
  SSH into server-1 → grep error → nothing
  SSH into server-2 → grep error → nothing
  SSH into server-3 → grep error → found it... but what caused it?
  SSH into server-1 → grep the correlation ID → log already rotated
  Total time: 45 minutes for a 2-minute fix

Structured Logging#

The foundation of any logging architecture. Stop writing free-text logs.

# Bad — unstructured
logger.info(f"User {user_id} placed order {order_id} for ${amount}")

# Good — structured (JSON)
logger.info("order_placed", extra={
    "user_id": user_id,
    "order_id": order_id,
    "amount": amount,
    "currency": "USD",
    "items_count": len(items),
    "payment_method": "stripe",
})

Output:

{
  "timestamp": "2026-03-28T14:23:01.442Z",
  "level": "INFO",
  "message": "order_placed",
  "user_id": "u_8f3k2",
  "order_id": "ord_9x2m1",
  "amount": 149.99,
  "currency": "USD",
  "items_count": 3,
  "service": "order-service",
  "trace_id": "abc123def456"
}

Structured logs are queryable. You can filter by user_id, aggregate by payment_method, alert on amount > 10000.

Log Levels#

Use them consistently across all services:

Level	When	Example
TRACE	Granular debugging	Function entry/exit, variable values
DEBUG	Development diagnostics	Cache hit/miss, query plans
INFO	Normal operations	Request served, job completed
WARN	Degraded but functional	Retry succeeded, fallback used
ERROR	Operation failed	Payment declined, timeout
FATAL	System unusable	DB connection lost, OOM

# Production log level strategy
production:
  default: INFO
  noisy-services: WARN
  critical-paths: DEBUG (sampled at 10%)

staging:
  default: DEBUG

development:
  default: TRACE

Centralized Logging Architecture#

┌────────────┐  ┌────────────┐  ┌────────────┐
│ Service A  │  │ Service B  │  │ Service C  │
│  (stdout)  │  │  (stdout)  │  │  (stdout)  │
└─────┬──────┘  └─────┬──────┘  └─────┬──────┘
      │               │               │
      ▼               ▼               ▼
┌─────────────────────────────────────────────┐
│           Log Shipper / Agent               │
│     (Filebeat, Fluentd, or Vector)          │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│          Buffer / Queue (optional)          │
│              (Kafka, Redis)                 │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│         Log Aggregator / Processor          │
│          (Logstash, Fluentd)                │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Storage & Index                  │
│   (Elasticsearch, Loki, ClickHouse)         │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│           Visualization & Alerting          │
│         (Kibana, Grafana, AlertManager)     │
└─────────────────────────────────────────────┘

The ELK Stack#

Elasticsearch + Logstash + Kibana — the most popular open-source logging stack.

Elasticsearch — Storage & Search#

Full-text search engine built on Apache Lucene. Stores logs as JSON documents in indices.

// Index template for application logs
PUT _index_template/app-logs
{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "user_id": { "type": "keyword" }
      }
    }
  }
}

Logstash — Processing Pipeline#

# logstash.conf
input {
  beats { port => 5044 }
}

filter {
  json { source => "message" }

  # Enrich with geo data
  geoip { source => "client_ip" }

  # Drop noisy health checks
  if [message] =~ /health_check/ { drop {} }

  # Add environment tag
  mutate { add_field => { "environment" => "production" } }
}

output {
  elasticsearch {
    hosts => ["http://es-node:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
}

Kibana — Visualization#

Build dashboards for error rates, latency distributions, top error messages, and service health.

EFK Variant#

Replace Logstash with Fluentd — lighter, Kubernetes-native, plugin ecosystem.

# Fluentd config for Kubernetes
<source>
  @type tail
  path /var/log/containers/*.log
  tag kubernetes.*
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc
  port 9200
  index_name fluentd-${tag}
  <buffer>
    @type file
    flush_interval 5s
  </buffer>
</match>

Log Shippers Compared#

Shipper	Language	Memory	Best For
Filebeat	Go	~10 MB	Simple file tailing to ES
Fluentd	Ruby/C	~40 MB	Kubernetes, plugin ecosystem
Fluent Bit	C	~1 MB	Edge, IoT, resource-constrained
Vector	Rust	~15 MB	High throughput, flexible routing

Vector is the rising star — Rust-based, blazing fast, with a declarative config:

# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]

[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '. = parse_json!(.message)'

[transforms.filter_errors]
type = "filter"
inputs = ["parse_json"]
condition = '.level == "ERROR" || .level == "FATAL"'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoints = ["http://es:9200"]
bulk.index = "app-logs-{{ now }}"

[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "https://hooks.slack.com/services/xxx"
encoding.codec = "json"

Storage Options Beyond Elasticsearch#

Storage	Strengths	Cost	Query Language
Elasticsearch	Full-text search, mature	High (RAM-hungry)	KQL, Lucene
Grafana Loki	Label-based, cheap storage	Low	LogQL
ClickHouse	Analytical queries, compression	Medium	SQL
AWS CloudWatch	Managed, AWS-native	Per-GB ingestion	Insights QL

Loki is gaining traction for cost-sensitive teams — it indexes only labels, not full text:

# LogQL query
{service="order-service", level="ERROR"} |= "timeout" | json | duration > 5s

Alerting on Log Patterns#

# ElastAlert rule — alert on error spike
name: Error Rate Spike
type: spike
index: app-logs-*
threshold: 3            # 3x normal rate
timeframe: minutes: 10
spike_height: 3
spike_type: up
filter:
  - term: { level: "ERROR" }
alert:
  - slack:
      slack_webhook_url: "https://hooks.slack.com/..."
      slack_channel: "#alerts"

Key alerting patterns:

Rate spike: Errors jump 3x above baseline
New error: Error message never seen before
Absence: Expected log missing (cron didn't run)
Pattern: Specific sequence of events (login → fail → fail → fail → lockout)

Retention Policies & Cost Management#

Logs are expensive. A medium-sized system generates 1-10 TB/day.

Retention strategy (tiered):
├── Hot (SSD):    Last 3 days    → full indexing, fast queries
├── Warm (HDD):   3-30 days     → reduced replicas, slower queries
├── Cold (S3):    30-90 days    → compressed, restore-on-demand
└── Frozen:       90-365 days   → S3 Glacier, compliance only

Cost optimization:
├── Sample verbose logs (keep 10% of DEBUG in prod)
├── Drop known-noisy logs at shipper level
├── Compress and batch before shipping
├── Use index lifecycle management (ILM)
└── Consider Loki for non-search workloads

// Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot":    { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" }}},
      "warm":   { "min_age": "3d", "actions": { "shrink": { "number_of_shards": 1 }}},
      "cold":   { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" }}},
      "delete": { "min_age": "90d", "actions": { "delete": {} }}
    }
  }
}

Key Takeaways#

Structure everything — JSON logs with consistent fields across all services
Centralize early — don't wait for an outage to build logging infrastructure
Ship smart — Vector or Fluent Bit at the edge, buffer through Kafka for reliability
Store in tiers — hot/warm/cold with ILM to control costs
Alert on patterns — rate spikes, new errors, and absent expected logs

Design observable systems with codelit.io — your visual architecture companion.

Article 192 of the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Build this architecture

Generate an interactive Logging Architecture & the ELK Stack in seconds.

Try it in Codelit →

logging architectureELK stackobservabilityDevOpssystem design

Logging Architecture & the ELK Stack: From Chaos to Observability

March 28, 2026 7 min readBy Codelit Team Discussion

Logging Architecture & the ELK Stack#

When production breaks at 3 AM, logs are your first responder. But grep-ing through scattered text files across 50 servers doesn't scale. You need a logging architecture.

The Problem with Ad-Hoc Logging#

Reality without centralized logging:
  SSH into server-1 → grep error → nothing
  SSH into server-2 → grep error → nothing
  SSH into server-3 → grep error → found it... but what caused it?
  SSH into server-1 → grep the correlation ID → log already rotated
  Total time: 45 minutes for a 2-minute fix

Structured Logging#

The foundation of any logging architecture. Stop writing free-text logs.

# Bad — unstructured
logger.info(f"User {user_id} placed order {order_id} for ${amount}")

# Good — structured (JSON)
logger.info("order_placed", extra={
    "user_id": user_id,
    "order_id": order_id,
    "amount": amount,
    "currency": "USD",
    "items_count": len(items),
    "payment_method": "stripe",
})

Output:

{
  "timestamp": "2026-03-28T14:23:01.442Z",
  "level": "INFO",
  "message": "order_placed",
  "user_id": "u_8f3k2",
  "order_id": "ord_9x2m1",
  "amount": 149.99,
  "currency": "USD",
  "items_count": 3,
  "service": "order-service",
  "trace_id": "abc123def456"
}

Structured logs are queryable. You can filter by user_id, aggregate by payment_method, alert on amount > 10000.

Log Levels#

Use them consistently across all services:

Level	When	Example
TRACE	Granular debugging	Function entry/exit, variable values
DEBUG	Development diagnostics	Cache hit/miss, query plans
INFO	Normal operations	Request served, job completed
WARN	Degraded but functional	Retry succeeded, fallback used
ERROR	Operation failed	Payment declined, timeout
FATAL	System unusable	DB connection lost, OOM

# Production log level strategy
production:
  default: INFO
  noisy-services: WARN
  critical-paths: DEBUG (sampled at 10%)

staging:
  default: DEBUG

development:
  default: TRACE

Centralized Logging Architecture#

┌────────────┐  ┌────────────┐  ┌────────────┐
│ Service A  │  │ Service B  │  │ Service C  │
│  (stdout)  │  │  (stdout)  │  │  (stdout)  │
└─────┬──────┘  └─────┬──────┘  └─────┬──────┘
      │               │               │
      ▼               ▼               ▼
┌─────────────────────────────────────────────┐
│           Log Shipper / Agent               │
│     (Filebeat, Fluentd, or Vector)          │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│          Buffer / Queue (optional)          │
│              (Kafka, Redis)                 │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│         Log Aggregator / Processor          │
│          (Logstash, Fluentd)                │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│            Storage & Index                  │
│   (Elasticsearch, Loki, ClickHouse)         │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│           Visualization & Alerting          │
│         (Kibana, Grafana, AlertManager)     │
└─────────────────────────────────────────────┘

The ELK Stack#

Elasticsearch + Logstash + Kibana — the most popular open-source logging stack.

Elasticsearch — Storage & Search#

Full-text search engine built on Apache Lucene. Stores logs as JSON documents in indices.

// Index template for application logs
PUT _index_template/app-logs
{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "user_id": { "type": "keyword" }
      }
    }
  }
}

Logstash — Processing Pipeline#

# logstash.conf
input {
  beats { port => 5044 }
}

filter {
  json { source => "message" }

  # Enrich with geo data
  geoip { source => "client_ip" }

  # Drop noisy health checks
  if [message] =~ /health_check/ { drop {} }

  # Add environment tag
  mutate { add_field => { "environment" => "production" } }
}

output {
  elasticsearch {
    hosts => ["http://es-node:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
}

Kibana — Visualization#

Build dashboards for error rates, latency distributions, top error messages, and service health.

EFK Variant#

Replace Logstash with Fluentd — lighter, Kubernetes-native, plugin ecosystem.

# Fluentd config for Kubernetes
<source>
  @type tail
  path /var/log/containers/*.log
  tag kubernetes.*
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc
  port 9200
  index_name fluentd-${tag}
  <buffer>
    @type file
    flush_interval 5s
  </buffer>
</match>

Log Shippers Compared#

Shipper	Language	Memory	Best For
Filebeat	Go	~10 MB	Simple file tailing to ES
Fluentd	Ruby/C	~40 MB	Kubernetes, plugin ecosystem
Fluent Bit	C	~1 MB	Edge, IoT, resource-constrained
Vector	Rust	~15 MB	High throughput, flexible routing

Vector is the rising star — Rust-based, blazing fast, with a declarative config:

# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]

[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '. = parse_json!(.message)'

[transforms.filter_errors]
type = "filter"
inputs = ["parse_json"]
condition = '.level == "ERROR" || .level == "FATAL"'

[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["parse_json"]
endpoints = ["http://es:9200"]
bulk.index = "app-logs-{{ now }}"

[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "https://hooks.slack.com/services/xxx"
encoding.codec = "json"

Storage Options Beyond Elasticsearch#

Storage	Strengths	Cost	Query Language
Elasticsearch	Full-text search, mature	High (RAM-hungry)	KQL, Lucene
Grafana Loki	Label-based, cheap storage	Low	LogQL
ClickHouse	Analytical queries, compression	Medium	SQL
AWS CloudWatch	Managed, AWS-native	Per-GB ingestion	Insights QL

Loki is gaining traction for cost-sensitive teams — it indexes only labels, not full text:

# LogQL query
{service="order-service", level="ERROR"} |= "timeout" | json | duration > 5s

Alerting on Log Patterns#

# ElastAlert rule — alert on error spike
name: Error Rate Spike
type: spike
index: app-logs-*
threshold: 3            # 3x normal rate
timeframe: minutes: 10
spike_height: 3
spike_type: up
filter:
  - term: { level: "ERROR" }
alert:
  - slack:
      slack_webhook_url: "https://hooks.slack.com/..."
      slack_channel: "#alerts"

Key alerting patterns:

Rate spike: Errors jump 3x above baseline
New error: Error message never seen before
Absence: Expected log missing (cron didn't run)
Pattern: Specific sequence of events (login → fail → fail → fail → lockout)

Retention Policies & Cost Management#

Logs are expensive. A medium-sized system generates 1-10 TB/day.

Retention strategy (tiered):
├── Hot (SSD):    Last 3 days    → full indexing, fast queries
├── Warm (HDD):   3-30 days     → reduced replicas, slower queries
├── Cold (S3):    30-90 days    → compressed, restore-on-demand
└── Frozen:       90-365 days   → S3 Glacier, compliance only

Cost optimization:
├── Sample verbose logs (keep 10% of DEBUG in prod)
├── Drop known-noisy logs at shipper level
├── Compress and batch before shipping
├── Use index lifecycle management (ILM)
└── Consider Loki for non-search workloads

// Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot":    { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" }}},
      "warm":   { "min_age": "3d", "actions": { "shrink": { "number_of_shards": 1 }}},
      "cold":   { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" }}},
      "delete": { "min_age": "90d", "actions": { "delete": {} }}
    }
  }
}

Key Takeaways#

Structure everything — JSON logs with consistent fields across all services
Centralize early — don't wait for an outage to build logging infrastructure
Ship smart — Vector or Fluent Bit at the edge, buffer through Kafka for reliability
Store in tiers — hot/warm/cold with ILM to control costs
Alert on patterns — rate spikes, new errors, and absent expected logs

Design observable systems with codelit.io — your visual architecture companion.

Article 192 of the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Logging Architecture & the ELK Stack in seconds.

Try it in Codelit →

Logging Architecture & the ELK Stack: From Chaos to Observability

Logging Architecture & the ELK Stack#

The Problem with Ad-Hoc Logging#

Structured Logging#

Log Levels#

Centralized Logging Architecture#

The ELK Stack#

Elasticsearch — Storage & Search#

Logstash — Processing Pipeline#

Kibana — Visualization#

EFK Variant#

Log Shippers Compared#

Storage Options Beyond Elasticsearch#

Alerting on Log Patterns#

Retention Policies & Cost Management#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Logging & Observability Platform

Build this architecture

Logging Architecture & the ELK Stack: From Chaos to Observability

Logging Architecture & the ELK Stack#

The Problem with Ad-Hoc Logging#

Structured Logging#

Log Levels#

Centralized Logging Architecture#

The ELK Stack#

Elasticsearch — Storage & Search#

Logstash — Processing Pipeline#

Kibana — Visualization#

EFK Variant#

Log Shippers Compared#

Storage Options Beyond Elasticsearch#

Alerting on Log Patterns#

Retention Policies & Cost Management#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Logging & Observability Platform

Build this architecture