Observability Pipeline Architecture: Collecting, Processing & Routing Telemetry at Scale
Observability Pipeline Architecture#
Modern systems generate enormous volumes of telemetry — logs, metrics, and traces. Without a well-designed pipeline, you either drown in data costs or fly blind when incidents hit. An observability pipeline sits between your applications and your backends, giving you control over what data goes where.
Why You Need a Pipeline#
Sending telemetry directly from applications to backends creates problems:
App → Datadog (vendor lock-in)
App → Elasticsearch (tight coupling)
App → Prometheus (no transformation)
With a pipeline:
App → Pipeline → Datadog (sampled traces)
→ S3 (full archive)
→ Prometheus (aggregated metrics)
→ Elasticsearch (filtered logs)
The pipeline gives you sampling, filtering, transformation, and routing — all without changing application code.
Core Pipeline Components#
Every observability pipeline has three stages:
1. Collection (Receivers)#
Agents or SDKs emit telemetry data into the pipeline.
# OpenTelemetry Collector - receivers
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
prometheus:
config:
scrape_configs:
- job_name: "my-service"
scrape_interval: 15s
static_configs:
- targets: ["localhost:8080"]
filelog:
include: ["/var/log/app/*.log"]
operators:
- type: json_parser
timestamp:
parse_from: attributes.time
layout: "%Y-%m-%dT%H:%M:%S"
2. Processing (Processors)#
Transform, enrich, filter, and sample data before it leaves the pipeline.
# OpenTelemetry Collector - processors
processors:
batch:
send_batch_size: 8192
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
attributes:
actions:
- key: environment
value: "production"
action: upsert
filter:
logs:
exclude:
match_type: strict
bodies:
- "health check"
tail_sampling:
decision_wait: 10s
policies:
- name: error-traces
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
3. Export (Exporters)#
Route processed data to one or more destinations.
# OpenTelemetry Collector - exporters
exporters:
otlp/jaeger:
endpoint: "jaeger:4317"
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
awss3:
s3uploader:
region: "us-east-1"
s3_bucket: "telemetry-archive"
s3_prefix: "logs"
Pipeline Tools Compared#
| Tool | Strengths | Best For |
|---|---|---|
| OTel Collector | Vendor-neutral, traces-first | Full OTLP pipeline |
| Vector | Performance, flexible transforms | High-volume log routing |
| Fluentd | Plugin ecosystem, mature | Kubernetes log collection |
| Fluent Bit | Lightweight, low memory | Edge and IoT |
| Cribl Stream | GUI, enterprise features | Complex enterprise routing |
Vector Pipeline Example#
Vector excels at high-throughput log processing with a declarative config:
# vector.toml
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
.timestamp = parse_timestamp!(.timestamp, format: "%+")
.environment = get_env_var("ENV") ?? "unknown"
'''
[transforms.filter_noise]
type = "filter"
inputs = ["parse_json"]
condition = '.level != "debug" || .environment == "staging"'
[transforms.sample_info]
type = "sample"
inputs = ["filter_noise"]
rate = 10
exclude.type = "vrl"
exclude.source = '.level == "error" || .level == "warn"'
[sinks.elasticsearch]
type = "elasticsearch"
inputs = ["sample_info"]
endpoints = ["http://elasticsearch:9200"]
bulk.index = "app-logs-%Y-%m-%d"
[sinks.s3_archive]
type = "aws_s3"
inputs = ["filter_noise"]
bucket = "log-archive"
key_prefix = "app-logs/year=%Y/month=%m/day=%d/"
compression = "gzip"
Sampling Strategies#
Sampling is the single most impactful cost-reduction technique.
Head sampling — decide at trace creation:
- Simple, low overhead
- May miss interesting traces
- Good for high-volume, low-criticality services
Tail sampling — decide after the full trace completes:
- Can keep all error traces and slow traces
- Requires buffering complete traces in memory
- Higher resource usage on the collector
Priority sampling — combine both:
High priority (always keep): errors, SLO violations, manual debug flags
Medium priority (sample at 25%): authenticated user requests
Low priority (sample at 5%): health checks, internal RPCs
Multi-Destination Routing#
Route different signals to different backends based on content:
# OTel Collector - routing by log severity
connectors:
routing:
table:
- statement: route() where severity_number >= 13
pipelines: [logs/errors]
- statement: route() where attributes["service.name"] == "payments"
pipelines: [logs/critical]
default_pipelines: [logs/standard]
service:
pipelines:
logs/errors:
receivers: [routing]
exporters: [otlp/pagerduty, elasticsearch]
logs/critical:
receivers: [routing]
exporters: [otlp/dedicated, awss3]
logs/standard:
receivers: [routing]
exporters: [awss3]
Deployment Patterns#
Sidecar — one collector per pod:
- Strong isolation, simple config
- Higher resource overhead
DaemonSet — one collector per node:
- Efficient resource usage
- Shared by all pods on the node
Gateway — centralized collector pool:
- Advanced processing (tail sampling needs full traces)
- Single point for routing decisions
- Scale independently from application pods
Most production setups use a two-tier approach: lightweight DaemonSet agents forward to a Gateway tier that handles sampling, enrichment, and routing.
Key Design Principles#
- Buffer aggressively — disk-backed queues prevent data loss during backend outages
- Filter early — drop noise at the agent level, not the backend
- Sample traces, not logs — keep complete traces or drop them entirely
- Archive everything — send raw data to cheap object storage before sampling
- Decouple format from backend — use OTLP as your internal format, convert at export
- Monitor the pipeline itself — internal metrics for queue depth, drop rate, and latency
Conclusion#
An observability pipeline transforms telemetry from a cost center into a strategic asset. By adding collection, processing, and routing layers between your applications and backends, you gain the flexibility to control costs, avoid vendor lock-in, and ensure the right data reaches the right place.
Article #401 — part of the Codelit engineering blog. Explore all articles at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsGitHub-like CI/CD Pipeline
Continuous integration and deployment system with parallel jobs, artifact caching, and environment management.
9 componentsWhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsBuild this architecture
Generate an interactive Observability Pipeline Architecture in seconds.
Try it in Codelit →
Comments