observability cost optimizationobservabilitylogging costsmetrics cardinalitytrace samplingdata tieringretention policiesFinOpsmonitoringOpenTelemetry

Observability Cost Optimization: Tame Logs, Metrics & Traces Without Breaking the Budget

March 29, 2026 8 min readBy Codelit Team Discussion

Observability spend is one of the fastest-growing line items in engineering budgets. Teams instrument everything, ship terabytes of telemetry, and then face six- or seven-figure invoices from their observability vendor. The data is valuable — but most of it is never queried. This guide covers concrete strategies to cut observability costs without sacrificing the ability to debug production incidents.

Why Observability Gets Expensive#

Observability costs are driven by three variables: volume (how much data you ingest), cardinality (how many unique time series or label combinations you create), and retention (how long you keep data). Most teams optimize none of these, and costs grow linearly — or worse — with traffic.

Cost = Ingest Volume × Price per GB
     + Active Series × Price per Series
     + Storage Duration × Price per GB/month
     + Query Volume × Price per Query

Every observability vendor weights these differently, which makes apples-to-apples comparison difficult and vendor lock-in expensive.

Log Volume Control#

Logs are typically the largest contributor to observability cost — often 60-80% of total spend.

Strategies to Reduce Log Volume#

Set log levels correctly. DEBUG and TRACE should never run in production by default. Use dynamic log level adjustment (via feature flags or config) to enable verbose logging only when investigating an issue.
Drop noise at the source. Health check logs, successful auth logs, and repetitive cron output add volume without value. Filter them in the logging pipeline before ingestion.
Sample high-volume logs. For endpoints handling thousands of requests per second, log a statistically significant sample (e.g., 1 in 100) rather than every request. Always log errors and slow requests at 100%.
Structured logging. Unstructured text logs resist compression and indexing. Structured JSON logs compress 2-5x better and enable field-level retention policies.
Pipeline-level aggregation. Use an observability pipeline (Vector, Fluent Bit, Cribl) to aggregate repetitive log patterns into counts before forwarding to the backend.

Log Pipeline Architecture#

Application ──► Agent (Fluent Bit / OTel Collector)
                  │
                  ├── Filter: drop health checks
                  ├── Sample: 1% of 2xx request logs
                  ├── Transform: extract metrics from logs
                  └── Route: errors → hot tier, info → cold tier
                         │                    │
                         ▼                    ▼
                    Primary Store       Object Storage
                    (indexed, fast)     (cheap, slow)

Metric Cardinality Explosion#

Metrics are priced per active time series. A single metric with high-cardinality labels can generate millions of series. For example, a request_duration histogram with labels for user_id, endpoint, status_code, and region on a system with 1M users creates an astronomical number of series.

How to Prevent Cardinality Explosion#

Never use unbounded values as labels. User IDs, request IDs, email addresses, and UUIDs must not be metric labels.
Cap label values. If an endpoint label could have thousands of values, group long-tail endpoints into an "other" bucket.
Use recording rules. Pre-aggregate high-cardinality metrics into lower-cardinality rollups that serve dashboards and alerts.
Drop unused metrics. Audit which metrics are actually queried. Prometheus and Grafana Mimir expose API endpoints to identify unused series.
Use histograms wisely. Each histogram bucket is a separate series. Use exponential histograms (OpenTelemetry) instead of fixed-bucket Prometheus histograms to reduce bucket count.

Cardinality Audit Checklist#

Check	Action
Labels with more than 100 unique values	Replace with recorded metric or remove label
Metrics not queried in 30 days	Drop at pipeline or recording rule
Histogram with more than 20 buckets	Switch to exponential histogram
Duplicate metrics from multiple libraries	Consolidate to one instrumentation

Trace Sampling Strategies#

Distributed traces are the most expensive signal per event because each trace contains multiple spans, and each span carries attributes, events, and links.

Sampling Approaches#

Head sampling makes the sampling decision at trace creation time. It is simple but blind — it cannot know whether a trace will be interesting.

Tail sampling buffers complete traces and decides after the fact. It keeps all error traces, slow traces, and a random sample of normal traces. Tail sampling requires a collector with enough memory to buffer in-flight traces.

Priority sampling lets application code flag specific traces as must-keep (debug sessions, canary deployments, flagged users).

Recommended Sampling Configuration#

Tail Sampling Processor (OTel Collector):
  ├── Always keep: status == ERROR
  ├── Always keep: duration > p99 threshold
  ├── Always keep: sampling.priority == 1
  ├── Probabilistic: keep 5% of remaining traces
  └── Rate limiting: cap at 100 traces/sec per service

A 5% probabilistic sample combined with 100% error/slow capture typically reduces trace volume by 80-90% while preserving debugging capability.

Data Tiering#

Not all telemetry needs the same query performance. Data tiering stores recent and critical data on fast (expensive) storage and moves older data to cheap storage.

Three-Tier Model#

Tier	Retention	Storage	Query latency	Use case
Hot	0-7 days	SSD / indexed	Milliseconds	Active debugging, dashboards
Warm	7-30 days	HDD / partially indexed	Seconds	Recent incident investigation
Cold	30-365 days	Object storage (S3)	Minutes	Compliance, trend analysis

Most vendors support tiering natively (Grafana Loki, Elasticsearch ILM, Datadog Flex Logs). If your vendor charges the same rate for all retention, that is a negotiation lever — or a reason to evaluate alternatives.

Retention Policies#

Define retention by signal type and severity:

Error logs and traces: 90 days (incident postmortems often happen weeks later).
Info/warn logs: 14-30 days.
Debug logs: 1-3 days (or zero in production).
Metrics (full resolution): 15-30 days.
Metrics (downsampled to 5-minute intervals): 1 year.
Metrics (downsampled to 1-hour intervals): 3-5 years for capacity planning.

Automate retention enforcement. Manual deletion is a policy that no one follows.

Tools Comparison by Cost Model#

Tool	Pricing model	Strengths	Watch out for
Datadog	Per host + per GB ingested	Unified platform, strong APM	Costs escalate fast with scale
Grafana Cloud	Active series + log GB	Open-source ecosystem, flexible	Self-hosted Grafana stack is cheaper but ops-heavy
New Relic	Per GB ingested (all signals)	Simple pricing, generous free tier	Large trace volumes get expensive
Elastic / OpenSearch	Self-hosted or cloud per node	Full control, no per-GB fees (self-hosted)	Cluster management overhead
Splunk	Per GB indexed	Powerful search, SPL language	Among the most expensive per GB
Honeycomb	Per event ingested	Best-in-class trace exploration	No built-in metrics or log management
SigNoz / Uptrace	Self-hosted, open source	Lowest cost at scale	Smaller community, DIY operations

Cost Reduction Levers by Vendor#

Datadog: Use Pipelines to filter logs before indexing. Use Metrics without Limits to reduce cardinality.
Grafana Cloud: Use Loki's structured metadata instead of indexed labels. Use Adaptive Metrics.
New Relic: Use drop rules to exclude noisy data at ingest. Negotiate per-GB rate at scale.
Self-hosted: The infrastructure cost is lower but factor in engineering time for cluster management, upgrades, and capacity planning.

FinOps for Observability#

FinOps practices bring financial accountability to observability spending.

Implement Observability FinOps#

Tag telemetry by team and service. Know which team generates which cost. Use OpenTelemetry resource attributes to tag every signal with service.name and team.
Set per-team budgets. Allocate an observability budget to each team proportional to their service count. Alert when a team exceeds their budget.
Showback reports. Publish monthly reports showing each team's observability cost breakdown by signal type. Visibility alone drives behavior change.
Rate-limit at the pipeline. Configure the observability pipeline to drop or sample data when a service exceeds its allocated ingest rate. This prevents a single runaway service from blowing the budget.
Right-size retention. Review retention policies quarterly. Data that seemed critical six months ago may now be irrelevant.
Negotiate contracts annually. Committed-use discounts of 20-40% are standard. Bring competing quotes to the negotiation.

Cost Optimization Flywheel#

Instrument ──► Measure cost per team ──► Set budgets
     ▲                                        │
     │                                        ▼
  Iterate ◄── Review quarterly ◄── Enforce with pipeline

Key Takeaways#

Logs dominate observability cost. Filter, sample, and tier them aggressively.
Cardinality explosion is the silent budget killer for metrics. Audit labels and drop unbounded values.
Tail sampling preserves debugging capability while reducing trace volume by 80-90%.
Data tiering and retention policies prevent paying hot-tier prices for cold data.
No single vendor is cheapest at all scales. Model your workload against each pricing structure.
FinOps for observability means tagging, budgeting, showback, and pipeline-level enforcement.

The goal is not to observe less — it is to observe smarter. Every byte of telemetry should earn its storage cost by contributing to faster incident resolution or deeper system understanding.

Article #303 on Codelit — Keep building, keep shipping.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Prometheus Monitoring Stack

Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.

10 components

Build this architecture

Generate an interactive architecture for Observability Cost Optimization in seconds.

Try it in Codelit →

observability cost optimizationobservabilitylogging costsmetrics cardinalitytrace samplingdata tieringretention policiesFinOpsmonitoringOpenTelemetry

Observability Cost Optimization: Tame Logs, Metrics & Traces Without Breaking the Budget

March 29, 2026 8 min readBy Codelit Team Discussion

Why Observability Gets Expensive#

Cost = Ingest Volume × Price per GB
     + Active Series × Price per Series
     + Storage Duration × Price per GB/month
     + Query Volume × Price per Query

Every observability vendor weights these differently, which makes apples-to-apples comparison difficult and vendor lock-in expensive.

Log Volume Control#

Logs are typically the largest contributor to observability cost — often 60-80% of total spend.

Strategies to Reduce Log Volume#

Set log levels correctly. DEBUG and TRACE should never run in production by default. Use dynamic log level adjustment (via feature flags or config) to enable verbose logging only when investigating an issue.
Drop noise at the source. Health check logs, successful auth logs, and repetitive cron output add volume without value. Filter them in the logging pipeline before ingestion.
Sample high-volume logs. For endpoints handling thousands of requests per second, log a statistically significant sample (e.g., 1 in 100) rather than every request. Always log errors and slow requests at 100%.
Structured logging. Unstructured text logs resist compression and indexing. Structured JSON logs compress 2-5x better and enable field-level retention policies.
Pipeline-level aggregation. Use an observability pipeline (Vector, Fluent Bit, Cribl) to aggregate repetitive log patterns into counts before forwarding to the backend.

Log Pipeline Architecture#

Application ──► Agent (Fluent Bit / OTel Collector)
                  │
                  ├── Filter: drop health checks
                  ├── Sample: 1% of 2xx request logs
                  ├── Transform: extract metrics from logs
                  └── Route: errors → hot tier, info → cold tier
                         │                    │
                         ▼                    ▼
                    Primary Store       Object Storage
                    (indexed, fast)     (cheap, slow)

Metric Cardinality Explosion#

How to Prevent Cardinality Explosion#

Never use unbounded values as labels. User IDs, request IDs, email addresses, and UUIDs must not be metric labels.
Cap label values. If an endpoint label could have thousands of values, group long-tail endpoints into an "other" bucket.
Use recording rules. Pre-aggregate high-cardinality metrics into lower-cardinality rollups that serve dashboards and alerts.
Drop unused metrics. Audit which metrics are actually queried. Prometheus and Grafana Mimir expose API endpoints to identify unused series.
Use histograms wisely. Each histogram bucket is a separate series. Use exponential histograms (OpenTelemetry) instead of fixed-bucket Prometheus histograms to reduce bucket count.

Cardinality Audit Checklist#

Check	Action
Labels with more than 100 unique values	Replace with recorded metric or remove label
Metrics not queried in 30 days	Drop at pipeline or recording rule
Histogram with more than 20 buckets	Switch to exponential histogram
Duplicate metrics from multiple libraries	Consolidate to one instrumentation

Trace Sampling Strategies#

Distributed traces are the most expensive signal per event because each trace contains multiple spans, and each span carries attributes, events, and links.

Sampling Approaches#

Head sampling makes the sampling decision at trace creation time. It is simple but blind — it cannot know whether a trace will be interesting.

Priority sampling lets application code flag specific traces as must-keep (debug sessions, canary deployments, flagged users).

Recommended Sampling Configuration#

Tail Sampling Processor (OTel Collector):
  ├── Always keep: status == ERROR
  ├── Always keep: duration > p99 threshold
  ├── Always keep: sampling.priority == 1
  ├── Probabilistic: keep 5% of remaining traces
  └── Rate limiting: cap at 100 traces/sec per service

A 5% probabilistic sample combined with 100% error/slow capture typically reduces trace volume by 80-90% while preserving debugging capability.

Data Tiering#

Not all telemetry needs the same query performance. Data tiering stores recent and critical data on fast (expensive) storage and moves older data to cheap storage.

Three-Tier Model#

Tier	Retention	Storage	Query latency	Use case
Hot	0-7 days	SSD / indexed	Milliseconds	Active debugging, dashboards
Warm	7-30 days	HDD / partially indexed	Seconds	Recent incident investigation
Cold	30-365 days	Object storage (S3)	Minutes	Compliance, trend analysis

Retention Policies#

Define retention by signal type and severity:

Error logs and traces: 90 days (incident postmortems often happen weeks later).
Info/warn logs: 14-30 days.
Debug logs: 1-3 days (or zero in production).
Metrics (full resolution): 15-30 days.
Metrics (downsampled to 5-minute intervals): 1 year.
Metrics (downsampled to 1-hour intervals): 3-5 years for capacity planning.

Automate retention enforcement. Manual deletion is a policy that no one follows.

Tools Comparison by Cost Model#

Tool	Pricing model	Strengths	Watch out for
Datadog	Per host + per GB ingested	Unified platform, strong APM	Costs escalate fast with scale
Grafana Cloud	Active series + log GB	Open-source ecosystem, flexible	Self-hosted Grafana stack is cheaper but ops-heavy
New Relic	Per GB ingested (all signals)	Simple pricing, generous free tier	Large trace volumes get expensive
Elastic / OpenSearch	Self-hosted or cloud per node	Full control, no per-GB fees (self-hosted)	Cluster management overhead
Splunk	Per GB indexed	Powerful search, SPL language	Among the most expensive per GB
Honeycomb	Per event ingested	Best-in-class trace exploration	No built-in metrics or log management
SigNoz / Uptrace	Self-hosted, open source	Lowest cost at scale	Smaller community, DIY operations

Cost Reduction Levers by Vendor#

Datadog: Use Pipelines to filter logs before indexing. Use Metrics without Limits to reduce cardinality.
Grafana Cloud: Use Loki's structured metadata instead of indexed labels. Use Adaptive Metrics.
New Relic: Use drop rules to exclude noisy data at ingest. Negotiate per-GB rate at scale.
Self-hosted: The infrastructure cost is lower but factor in engineering time for cluster management, upgrades, and capacity planning.

FinOps for Observability#

FinOps practices bring financial accountability to observability spending.

Implement Observability FinOps#

Tag telemetry by team and service. Know which team generates which cost. Use OpenTelemetry resource attributes to tag every signal with service.name and team.
Set per-team budgets. Allocate an observability budget to each team proportional to their service count. Alert when a team exceeds their budget.
Showback reports. Publish monthly reports showing each team's observability cost breakdown by signal type. Visibility alone drives behavior change.
Rate-limit at the pipeline. Configure the observability pipeline to drop or sample data when a service exceeds its allocated ingest rate. This prevents a single runaway service from blowing the budget.
Right-size retention. Review retention policies quarterly. Data that seemed critical six months ago may now be irrelevant.
Negotiate contracts annually. Committed-use discounts of 20-40% are standard. Bring competing quotes to the negotiation.

Cost Optimization Flywheel#

Instrument ──► Measure cost per team ──► Set budgets
     ▲                                        │
     │                                        ▼
  Iterate ◄── Review quarterly ◄── Enforce with pipeline

Key Takeaways#

Logs dominate observability cost. Filter, sample, and tier them aggressively.
Cardinality explosion is the silent budget killer for metrics. Audit labels and drop unbounded values.
Tail sampling preserves debugging capability while reducing trace volume by 80-90%.
Data tiering and retention policies prevent paying hot-tier prices for cold data.
No single vendor is cheapest at all scales. Model your workload against each pricing structure.
FinOps for observability means tagging, budgeting, showback, and pipeline-level enforcement.

The goal is not to observe less — it is to observe smarter. Every byte of telemetry should earn its storage cost by contributing to faster incident resolution or deeper system understanding.

Article #303 on Codelit — Keep building, keep shipping.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Prometheus Monitoring Stack

Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.

10 components

Build this architecture

Generate an interactive architecture for Observability Cost Optimization in seconds.

Try it in Codelit →

Observability Cost Optimization: Tame Logs, Metrics & Traces Without Breaking the Budget

Why Observability Gets Expensive#

Log Volume Control#

Strategies to Reduce Log Volume#

Log Pipeline Architecture#

Metric Cardinality Explosion#

How to Prevent Cardinality Explosion#

Cardinality Audit Checklist#

Trace Sampling Strategies#

Sampling Approaches#

Recommended Sampling Configuration#

Data Tiering#

Three-Tier Model#

Retention Policies#

Tools Comparison by Cost Model#

Cost Reduction Levers by Vendor#

FinOps for Observability#

Implement Observability FinOps#

Cost Optimization Flywheel#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

Prometheus Monitoring Stack

Build this architecture

Observability Cost Optimization: Tame Logs, Metrics & Traces Without Breaking the Budget

Why Observability Gets Expensive#

Log Volume Control#

Strategies to Reduce Log Volume#

Log Pipeline Architecture#

Metric Cardinality Explosion#

How to Prevent Cardinality Explosion#

Cardinality Audit Checklist#

Trace Sampling Strategies#

Sampling Approaches#

Recommended Sampling Configuration#

Data Tiering#

Three-Tier Model#

Retention Policies#

Tools Comparison by Cost Model#

Cost Reduction Levers by Vendor#

FinOps for Observability#

Implement Observability FinOps#

Cost Optimization Flywheel#

Key Takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

Prometheus Monitoring Stack

Build this architecture