SLOSREobservabilityreliabilityDevOpssystem design

Service Level Objectives (SLOs): Measure Reliability Like Google SRE

March 29, 2026 7 min readBy Codelit Team Discussion

Service Level Objectives (SLOs)#

Uptime is not reliability. Users don't care whether your server is "up" — they care whether their requests succeed fast enough. Service Level Objectives give you a precise, measurable definition of "reliable enough."

SLI vs SLO vs SLA#

These three terms form a hierarchy:

Term	What it is	Example
SLI (Service Level Indicator)	A metric that measures user experience	99.2% of requests complete in under 300ms
SLO (Service Level Objective)	A target range for an SLI	99.5% of requests must complete in under 300ms over 30 days
SLA (Service Level Agreement)	A contractual commitment with consequences	If availability drops below 99.9%, customer receives credits

The relationship flows upward: SLIs feed SLOs, and SLOs back SLAs.

SLI (measurement) --> SLO (internal target) --> SLA (external contract)

Key insight: You can have SLOs without SLAs. Every team should have SLOs. Not every team needs SLAs.

Choosing the Right SLIs#

Bad SLIs measure infrastructure. Good SLIs measure what users experience.

Avoid These SLIs#

CPU utilization
Memory usage
Pod restart count
Disk IOPS

Use These Instead#

Availability SLI:

successful requests / total requests

Latency SLI:

requests faster than threshold / total requests

Correctness SLI:

requests returning correct data / total requests

Freshness SLI:

data updates arriving within threshold / total updates

SLI Selection by Service Type#

Service Type	Primary SLI	Secondary SLI
API / Web	Availability + Latency	Error rate
Data pipeline	Freshness + Correctness	Throughput
Storage	Durability + Availability	Latency
Streaming	Throughput + Latency	Availability

Error Budgets#

An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of allowed downtime.

Error Budget = 1 - SLO target

99.9% SLO --> 0.1% budget --> 43.2 min/month
99.5% SLO --> 0.5% budget --> 3.6 hours/month
99.0% SLO --> 1.0% budget --> 7.2 hours/month

Error Budget Policy#

Define what happens as budget depletes:

Budget Remaining	Action
100-50%	Normal feature velocity
50-25%	Prioritize reliability work
25-10%	Feature freeze, reliability-only sprints
Under 10%	All hands on reliability
0% (exhausted)	Full stop on deployments until budget recovers

This creates a self-balancing system: ship fast when you have budget, slow down when reliability suffers.

Burn Rate Alerts#

Traditional threshold alerts fire too late. Burn rate alerts tell you how fast you're consuming your error budget.

Burn Rate = (error rate observed) / (error rate allowed by SLO)

A burn rate of 1.0 means you'll exhaust your budget exactly at the end of the window. A burn rate of 10.0 means you'll burn through your entire monthly budget in 3 days.

Multi-Window Burn Rate Strategy#

Google SRE recommends alerting on two windows simultaneously to reduce false positives:

Severity	Long Window	Short Window	Burn Rate	Budget Consumed
Page (critical)	1 hour	5 minutes	14.4x	2% in 1 hour
Page (high)	6 hours	30 minutes	6x	5% in 6 hours
Ticket (medium)	3 days	6 hours	1x	10% in 3 days

Both windows must be breaching to trigger the alert. This eliminates short spikes from waking anyone up.

Burn Rate Alert in Prometheus#

# Fast burn: 14.4x over 1 hour (pages immediately)
- alert: HighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical

Building an SLO Dashboard#

Every SLO dashboard should show four things:

Current SLI value — are we meeting the objective right now?
Error budget remaining — how much room do we have?
Burn rate — how fast are we consuming budget?
Time series — SLI trend over the SLO window

+--------------------------------------------------+
|  API Availability SLO: 99.9%                     |
|                                                  |
|  Current SLI:     99.94%     [=====|----]        |
|  Budget Remaining: 62%       [========|--]       |
|  Burn Rate:        0.8x      (healthy)           |
|  Window:           30 days   (18 days remaining) |
+--------------------------------------------------+

Dashboard Anti-Patterns#

Showing raw error counts instead of ratios
Missing the SLO target line on graphs
No burn rate visualization
Mixing infrastructure metrics with SLIs

The Google SRE Approach#

Google's SRE book codified SLOs as the core reliability practice. Key principles:

100% is the wrong target. Pursuing perfect reliability is infinitely expensive and slows innovation.
Users set the SLO. If users can't tell the difference between 99.99% and 99.999%, the tighter target wastes engineering effort.
Error budgets create alignment. Product and SRE teams agree on the budget. No more "move fast" vs "don't break things" arguments.
SLOs drive decisions. Launch reviews, architecture choices, and staffing all reference SLOs.

Practical SLO Targets#

Service Tier	Suggested SLO	Error Budget (30 days)
Internal tool	99.0%	7.2 hours
B2B SaaS	99.5% - 99.9%	43 min - 3.6 hours
Consumer app	99.9%	43 minutes
Payment / auth	99.95%	21.6 minutes
Infrastructure	99.99%	4.3 minutes

Consequences of SLO Breach#

When your SLO is breached (error budget exhausted), escalation kicks in:

Engineering consequences:

Feature freeze until budget recovers
Mandatory postmortem for every incident
Architecture review for the failing service
Increased testing and canary requirements

Organizational consequences:

SLO breach appears in team health dashboards
Leadership review if breaches are recurring
Staffing adjustments (more SRE support)
Possible re-architecture or service decomposition

What NOT to do:

Blame individuals
Loosen the SLO to avoid breaches
Ignore the breach and keep shipping

Tools for SLO Management#

Nobl9#

A dedicated SLO platform that integrates with your existing observability stack:

Connects to Datadog, Prometheus, New Relic, Splunk, and more
Calculates error budgets automatically
Provides burn rate alerting out of the box
Supports composite SLOs across multiple services

Sloth (Open Source)#

Generates Prometheus alerting rules from SLO definitions:

# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
slos:
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate and it produces multi-window, multi-burn-rate alerts automatically.

Other Tools#

Datadog SLO Tracking — built into Datadog with SLO widgets
Google Cloud SLO Monitoring — native GCP service
Prometheus + Grafana — DIY with recording rules
Dynatrace — SLO tiles with automatic baselining

Getting Started Checklist#

Pick your most critical service
Identify 1-2 SLIs that reflect user experience
Set an SLO target (start conservative, tighten later)
Calculate the error budget
Set up burn rate alerts
Build a dashboard
Write an error budget policy
Review monthly

This is article #354 in the Codelit engineering series. Explore more at codelit.io.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agent Reliability Engineering

2 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

Try these templates

Instagram-like Photo Sharing Platform

Full-stack social media platform with image processing, feeds, and real-time notifications.

12 components

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Twitter-like Feed & Timeline

Social media feed with real-time tweets, fan-out on write, and trending topics.

9 components

Build this architecture

Generate an interactive architecture for Service Level Objectives (SLOs) in seconds.

Try it in Codelit →

SLOSREobservabilityreliabilityDevOpssystem design

Service Level Objectives (SLOs): Measure Reliability Like Google SRE

March 29, 2026 7 min readBy Codelit Team Discussion

Service Level Objectives (SLOs)#

SLI vs SLO vs SLA#

These three terms form a hierarchy:

Term	What it is	Example
SLI (Service Level Indicator)	A metric that measures user experience	99.2% of requests complete in under 300ms
SLO (Service Level Objective)	A target range for an SLI	99.5% of requests must complete in under 300ms over 30 days
SLA (Service Level Agreement)	A contractual commitment with consequences	If availability drops below 99.9%, customer receives credits

The relationship flows upward: SLIs feed SLOs, and SLOs back SLAs.

SLI (measurement) --> SLO (internal target) --> SLA (external contract)

Key insight: You can have SLOs without SLAs. Every team should have SLOs. Not every team needs SLAs.

Choosing the Right SLIs#

Bad SLIs measure infrastructure. Good SLIs measure what users experience.

Avoid These SLIs#

CPU utilization
Memory usage
Pod restart count
Disk IOPS

Use These Instead#

Availability SLI:

successful requests / total requests

Latency SLI:

requests faster than threshold / total requests

Correctness SLI:

requests returning correct data / total requests

Freshness SLI:

data updates arriving within threshold / total updates

SLI Selection by Service Type#

Service Type	Primary SLI	Secondary SLI
API / Web	Availability + Latency	Error rate
Data pipeline	Freshness + Correctness	Throughput
Storage	Durability + Availability	Latency
Streaming	Throughput + Latency	Availability

Error Budgets#

An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of allowed downtime.

Error Budget = 1 - SLO target

99.9% SLO --> 0.1% budget --> 43.2 min/month
99.5% SLO --> 0.5% budget --> 3.6 hours/month
99.0% SLO --> 1.0% budget --> 7.2 hours/month

Error Budget Policy#

Define what happens as budget depletes:

Budget Remaining	Action
100-50%	Normal feature velocity
50-25%	Prioritize reliability work
25-10%	Feature freeze, reliability-only sprints
Under 10%	All hands on reliability
0% (exhausted)	Full stop on deployments until budget recovers

This creates a self-balancing system: ship fast when you have budget, slow down when reliability suffers.

Burn Rate Alerts#

Traditional threshold alerts fire too late. Burn rate alerts tell you how fast you're consuming your error budget.

Burn Rate = (error rate observed) / (error rate allowed by SLO)

A burn rate of 1.0 means you'll exhaust your budget exactly at the end of the window. A burn rate of 10.0 means you'll burn through your entire monthly budget in 3 days.

Multi-Window Burn Rate Strategy#

Google SRE recommends alerting on two windows simultaneously to reduce false positives:

Severity	Long Window	Short Window	Burn Rate	Budget Consumed
Page (critical)	1 hour	5 minutes	14.4x	2% in 1 hour
Page (high)	6 hours	30 minutes	6x	5% in 6 hours
Ticket (medium)	3 days	6 hours	1x	10% in 3 days

Both windows must be breaching to trigger the alert. This eliminates short spikes from waking anyone up.

Burn Rate Alert in Prometheus#

# Fast burn: 14.4x over 1 hour (pages immediately)
- alert: HighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical

Building an SLO Dashboard#

Every SLO dashboard should show four things:

Current SLI value — are we meeting the objective right now?
Error budget remaining — how much room do we have?
Burn rate — how fast are we consuming budget?
Time series — SLI trend over the SLO window

+--------------------------------------------------+
|  API Availability SLO: 99.9%                     |
|                                                  |
|  Current SLI:     99.94%     [=====|----]        |
|  Budget Remaining: 62%       [========|--]       |
|  Burn Rate:        0.8x      (healthy)           |
|  Window:           30 days   (18 days remaining) |
+--------------------------------------------------+

Dashboard Anti-Patterns#

Showing raw error counts instead of ratios
Missing the SLO target line on graphs
No burn rate visualization
Mixing infrastructure metrics with SLIs

The Google SRE Approach#

Google's SRE book codified SLOs as the core reliability practice. Key principles:

100% is the wrong target. Pursuing perfect reliability is infinitely expensive and slows innovation.
Users set the SLO. If users can't tell the difference between 99.99% and 99.999%, the tighter target wastes engineering effort.
Error budgets create alignment. Product and SRE teams agree on the budget. No more "move fast" vs "don't break things" arguments.
SLOs drive decisions. Launch reviews, architecture choices, and staffing all reference SLOs.

Practical SLO Targets#

Service Tier	Suggested SLO	Error Budget (30 days)
Internal tool	99.0%	7.2 hours
B2B SaaS	99.5% - 99.9%	43 min - 3.6 hours
Consumer app	99.9%	43 minutes
Payment / auth	99.95%	21.6 minutes
Infrastructure	99.99%	4.3 minutes

Consequences of SLO Breach#

When your SLO is breached (error budget exhausted), escalation kicks in:

Engineering consequences:

Feature freeze until budget recovers
Mandatory postmortem for every incident
Architecture review for the failing service
Increased testing and canary requirements

Organizational consequences:

SLO breach appears in team health dashboards
Leadership review if breaches are recurring
Staffing adjustments (more SRE support)
Possible re-architecture or service decomposition

What NOT to do:

Blame individuals
Loosen the SLO to avoid breaches
Ignore the breach and keep shipping

Tools for SLO Management#

Nobl9#

A dedicated SLO platform that integrates with your existing observability stack:

Connects to Datadog, Prometheus, New Relic, Splunk, and more
Calculates error budgets automatically
Provides burn rate alerting out of the box
Supports composite SLOs across multiple services

Sloth (Open Source)#

Generates Prometheus alerting rules from SLO definitions:

# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
slos:
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate and it produces multi-window, multi-burn-rate alerts automatically.

Other Tools#

Datadog SLO Tracking — built into Datadog with SLO widgets
Google Cloud SLO Monitoring — native GCP service
Prometheus + Grafana — DIY with recording rules
Dynatrace — SLO tiles with automatic baselining

Getting Started Checklist#

Pick your most critical service
Identify 1-2 SLIs that reflect user experience
Set an SLO target (start conservative, tighten later)
Calculate the error budget
Set up burn rate alerts
Build a dashboard
Write an error budget policy
Review monthly

This is article #354 in the Codelit engineering series. Explore more at codelit.io.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Service Level Objectives (SLOs) in seconds.

Try it in Codelit →

Service Level Objectives (SLOs): Measure Reliability Like Google SRE

Service Level Objectives (SLOs)#

SLI vs SLO vs SLA#

Choosing the Right SLIs#

Avoid These SLIs#

Use These Instead#

SLI Selection by Service Type#

Error Budgets#

Error Budget Policy#

Burn Rate Alerts#

Multi-Window Burn Rate Strategy#

Burn Rate Alert in Prometheus#

Building an SLO Dashboard#

Dashboard Anti-Patterns#

The Google SRE Approach#

Practical SLO Targets#

Consequences of SLO Breach#

Tools for SLO Management#

Nobl9#

Sloth (Open Source)#

Other Tools#

Getting Started Checklist#

Comments

Related articles

AgentOps Observability for AI Agents

Agent Reliability Engineering

Agentic Data Pipeline Workflow

Try these templates

Instagram-like Photo Sharing Platform

Scalable SaaS Application

Twitter-like Feed & Timeline

Build this architecture

Service Level Objectives (SLOs): Measure Reliability Like Google SRE

Service Level Objectives (SLOs)#

SLI vs SLO vs SLA#

Choosing the Right SLIs#

Avoid These SLIs#

Use These Instead#

SLI Selection by Service Type#

Error Budgets#

Error Budget Policy#

Burn Rate Alerts#

Multi-Window Burn Rate Strategy#

Burn Rate Alert in Prometheus#

Building an SLO Dashboard#

Dashboard Anti-Patterns#

The Google SRE Approach#

Practical SLO Targets#

Consequences of SLO Breach#

Tools for SLO Management#

Nobl9#

Sloth (Open Source)#

Other Tools#

Getting Started Checklist#

Comments

Related articles

AgentOps Observability for AI Agents

Agent Reliability Engineering

Agentic Data Pipeline Workflow

Try these templates

Instagram-like Photo Sharing Platform

Scalable SaaS Application

Twitter-like Feed & Timeline

Build this architecture