Observability Cost Optimization: Tame Logs, Metrics & Traces Without Breaking the Budget
Observability spend is one of the fastest-growing line items in engineering budgets. Teams instrument everything, ship terabytes of telemetry, and then face six- or seven-figure invoices from their observability vendor. The data is valuable — but most of it is never queried. This guide covers concrete strategies to cut observability costs without sacrificing the ability to debug production incidents.
Why Observability Gets Expensive#
Observability costs are driven by three variables: volume (how much data you ingest), cardinality (how many unique time series or label combinations you create), and retention (how long you keep data). Most teams optimize none of these, and costs grow linearly — or worse — with traffic.
Cost = Ingest Volume × Price per GB
+ Active Series × Price per Series
+ Storage Duration × Price per GB/month
+ Query Volume × Price per Query
Every observability vendor weights these differently, which makes apples-to-apples comparison difficult and vendor lock-in expensive.
Log Volume Control#
Logs are typically the largest contributor to observability cost — often 60-80% of total spend.
Strategies to Reduce Log Volume#
-
Set log levels correctly. DEBUG and TRACE should never run in production by default. Use dynamic log level adjustment (via feature flags or config) to enable verbose logging only when investigating an issue.
-
Drop noise at the source. Health check logs, successful auth logs, and repetitive cron output add volume without value. Filter them in the logging pipeline before ingestion.
-
Sample high-volume logs. For endpoints handling thousands of requests per second, log a statistically significant sample (e.g., 1 in 100) rather than every request. Always log errors and slow requests at 100%.
-
Structured logging. Unstructured text logs resist compression and indexing. Structured JSON logs compress 2-5x better and enable field-level retention policies.
-
Pipeline-level aggregation. Use an observability pipeline (Vector, Fluent Bit, Cribl) to aggregate repetitive log patterns into counts before forwarding to the backend.
Log Pipeline Architecture#
Application ──► Agent (Fluent Bit / OTel Collector)
│
├── Filter: drop health checks
├── Sample: 1% of 2xx request logs
├── Transform: extract metrics from logs
└── Route: errors → hot tier, info → cold tier
│ │
▼ ▼
Primary Store Object Storage
(indexed, fast) (cheap, slow)
Metric Cardinality Explosion#
Metrics are priced per active time series. A single metric with high-cardinality labels can generate millions of series. For example, a request_duration histogram with labels for user_id, endpoint, status_code, and region on a system with 1M users creates an astronomical number of series.
How to Prevent Cardinality Explosion#
- Never use unbounded values as labels. User IDs, request IDs, email addresses, and UUIDs must not be metric labels.
- Cap label values. If an endpoint label could have thousands of values, group long-tail endpoints into an "other" bucket.
- Use recording rules. Pre-aggregate high-cardinality metrics into lower-cardinality rollups that serve dashboards and alerts.
- Drop unused metrics. Audit which metrics are actually queried. Prometheus and Grafana Mimir expose API endpoints to identify unused series.
- Use histograms wisely. Each histogram bucket is a separate series. Use exponential histograms (OpenTelemetry) instead of fixed-bucket Prometheus histograms to reduce bucket count.
Cardinality Audit Checklist#
| Check | Action |
|---|---|
| Labels with more than 100 unique values | Replace with recorded metric or remove label |
| Metrics not queried in 30 days | Drop at pipeline or recording rule |
| Histogram with more than 20 buckets | Switch to exponential histogram |
| Duplicate metrics from multiple libraries | Consolidate to one instrumentation |
Trace Sampling Strategies#
Distributed traces are the most expensive signal per event because each trace contains multiple spans, and each span carries attributes, events, and links.
Sampling Approaches#
Head sampling makes the sampling decision at trace creation time. It is simple but blind — it cannot know whether a trace will be interesting.
Tail sampling buffers complete traces and decides after the fact. It keeps all error traces, slow traces, and a random sample of normal traces. Tail sampling requires a collector with enough memory to buffer in-flight traces.
Priority sampling lets application code flag specific traces as must-keep (debug sessions, canary deployments, flagged users).
Recommended Sampling Configuration#
Tail Sampling Processor (OTel Collector):
├── Always keep: status == ERROR
├── Always keep: duration > p99 threshold
├── Always keep: sampling.priority == 1
├── Probabilistic: keep 5% of remaining traces
└── Rate limiting: cap at 100 traces/sec per service
A 5% probabilistic sample combined with 100% error/slow capture typically reduces trace volume by 80-90% while preserving debugging capability.
Data Tiering#
Not all telemetry needs the same query performance. Data tiering stores recent and critical data on fast (expensive) storage and moves older data to cheap storage.
Three-Tier Model#
| Tier | Retention | Storage | Query latency | Use case |
|---|---|---|---|---|
| Hot | 0-7 days | SSD / indexed | Milliseconds | Active debugging, dashboards |
| Warm | 7-30 days | HDD / partially indexed | Seconds | Recent incident investigation |
| Cold | 30-365 days | Object storage (S3) | Minutes | Compliance, trend analysis |
Most vendors support tiering natively (Grafana Loki, Elasticsearch ILM, Datadog Flex Logs). If your vendor charges the same rate for all retention, that is a negotiation lever — or a reason to evaluate alternatives.
Retention Policies#
Define retention by signal type and severity:
- Error logs and traces: 90 days (incident postmortems often happen weeks later).
- Info/warn logs: 14-30 days.
- Debug logs: 1-3 days (or zero in production).
- Metrics (full resolution): 15-30 days.
- Metrics (downsampled to 5-minute intervals): 1 year.
- Metrics (downsampled to 1-hour intervals): 3-5 years for capacity planning.
Automate retention enforcement. Manual deletion is a policy that no one follows.
Tools Comparison by Cost Model#
| Tool | Pricing model | Strengths | Watch out for |
|---|---|---|---|
| Datadog | Per host + per GB ingested | Unified platform, strong APM | Costs escalate fast with scale |
| Grafana Cloud | Active series + log GB | Open-source ecosystem, flexible | Self-hosted Grafana stack is cheaper but ops-heavy |
| New Relic | Per GB ingested (all signals) | Simple pricing, generous free tier | Large trace volumes get expensive |
| Elastic / OpenSearch | Self-hosted or cloud per node | Full control, no per-GB fees (self-hosted) | Cluster management overhead |
| Splunk | Per GB indexed | Powerful search, SPL language | Among the most expensive per GB |
| Honeycomb | Per event ingested | Best-in-class trace exploration | No built-in metrics or log management |
| SigNoz / Uptrace | Self-hosted, open source | Lowest cost at scale | Smaller community, DIY operations |
Cost Reduction Levers by Vendor#
- Datadog: Use Pipelines to filter logs before indexing. Use Metrics without Limits to reduce cardinality.
- Grafana Cloud: Use Loki's structured metadata instead of indexed labels. Use Adaptive Metrics.
- New Relic: Use drop rules to exclude noisy data at ingest. Negotiate per-GB rate at scale.
- Self-hosted: The infrastructure cost is lower but factor in engineering time for cluster management, upgrades, and capacity planning.
FinOps for Observability#
FinOps practices bring financial accountability to observability spending.
Implement Observability FinOps#
-
Tag telemetry by team and service. Know which team generates which cost. Use OpenTelemetry resource attributes to tag every signal with
service.nameandteam. -
Set per-team budgets. Allocate an observability budget to each team proportional to their service count. Alert when a team exceeds their budget.
-
Showback reports. Publish monthly reports showing each team's observability cost breakdown by signal type. Visibility alone drives behavior change.
-
Rate-limit at the pipeline. Configure the observability pipeline to drop or sample data when a service exceeds its allocated ingest rate. This prevents a single runaway service from blowing the budget.
-
Right-size retention. Review retention policies quarterly. Data that seemed critical six months ago may now be irrelevant.
-
Negotiate contracts annually. Committed-use discounts of 20-40% are standard. Bring competing quotes to the negotiation.
Cost Optimization Flywheel#
Instrument ──► Measure cost per team ──► Set budgets
▲ │
│ ▼
Iterate ◄── Review quarterly ◄── Enforce with pipeline
Key Takeaways#
- Logs dominate observability cost. Filter, sample, and tier them aggressively.
- Cardinality explosion is the silent budget killer for metrics. Audit labels and drop unbounded values.
- Tail sampling preserves debugging capability while reducing trace volume by 80-90%.
- Data tiering and retention policies prevent paying hot-tier prices for cold data.
- No single vendor is cheapest at all scales. Model your workload against each pricing structure.
- FinOps for observability means tagging, budgeting, showback, and pipeline-level enforcement.
The goal is not to observe less — it is to observe smarter. Every byte of telemetry should earn its storage cost by contributing to faster incident resolution or deeper system understanding.
Article #303 on Codelit — Keep building, keep shipping.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings
6 min read
distributed tracingTrace Context Propagation — W3C Headers, B3, Baggage, and Cross-Service Correlation
6 min read
distributed tracingDistributed Tracing Sampling Strategies: Head-Based, Tail-Based & Beyond
7 min read
Try these templates
Build this architecture
Generate an interactive architecture for Observability Cost Optimization in seconds.
Try it in Codelit →
Comments