Service Level Objectives (SLOs): Measure Reliability Like Google SRE
Service Level Objectives (SLOs)#
Uptime is not reliability. Users don't care whether your server is "up" — they care whether their requests succeed fast enough. Service Level Objectives give you a precise, measurable definition of "reliable enough."
SLI vs SLO vs SLA#
These three terms form a hierarchy:
| Term | What it is | Example |
|---|---|---|
| SLI (Service Level Indicator) | A metric that measures user experience | 99.2% of requests complete in under 300ms |
| SLO (Service Level Objective) | A target range for an SLI | 99.5% of requests must complete in under 300ms over 30 days |
| SLA (Service Level Agreement) | A contractual commitment with consequences | If availability drops below 99.9%, customer receives credits |
The relationship flows upward: SLIs feed SLOs, and SLOs back SLAs.
SLI (measurement) --> SLO (internal target) --> SLA (external contract)
Key insight: You can have SLOs without SLAs. Every team should have SLOs. Not every team needs SLAs.
Choosing the Right SLIs#
Bad SLIs measure infrastructure. Good SLIs measure what users experience.
Avoid These SLIs#
- CPU utilization
- Memory usage
- Pod restart count
- Disk IOPS
Use These Instead#
Availability SLI:
successful requests / total requests
Latency SLI:
requests faster than threshold / total requests
Correctness SLI:
requests returning correct data / total requests
Freshness SLI:
data updates arriving within threshold / total updates
SLI Selection by Service Type#
| Service Type | Primary SLI | Secondary SLI |
|---|---|---|
| API / Web | Availability + Latency | Error rate |
| Data pipeline | Freshness + Correctness | Throughput |
| Storage | Durability + Availability | Latency |
| Streaming | Throughput + Latency | Availability |
Error Budgets#
An error budget is the inverse of your SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of allowed downtime.
Error Budget = 1 - SLO target
99.9% SLO --> 0.1% budget --> 43.2 min/month
99.5% SLO --> 0.5% budget --> 3.6 hours/month
99.0% SLO --> 1.0% budget --> 7.2 hours/month
Error Budget Policy#
Define what happens as budget depletes:
| Budget Remaining | Action |
|---|---|
| 100-50% | Normal feature velocity |
| 50-25% | Prioritize reliability work |
| 25-10% | Feature freeze, reliability-only sprints |
| Under 10% | All hands on reliability |
| 0% (exhausted) | Full stop on deployments until budget recovers |
This creates a self-balancing system: ship fast when you have budget, slow down when reliability suffers.
Burn Rate Alerts#
Traditional threshold alerts fire too late. Burn rate alerts tell you how fast you're consuming your error budget.
Burn Rate = (error rate observed) / (error rate allowed by SLO)
A burn rate of 1.0 means you'll exhaust your budget exactly at the end of the window. A burn rate of 10.0 means you'll burn through your entire monthly budget in 3 days.
Multi-Window Burn Rate Strategy#
Google SRE recommends alerting on two windows simultaneously to reduce false positives:
| Severity | Long Window | Short Window | Burn Rate | Budget Consumed |
|---|---|---|---|---|
| Page (critical) | 1 hour | 5 minutes | 14.4x | 2% in 1 hour |
| Page (high) | 6 hours | 30 minutes | 6x | 5% in 6 hours |
| Ticket (medium) | 3 days | 6 hours | 1x | 10% in 3 days |
Both windows must be breaching to trigger the alert. This eliminates short spikes from waking anyone up.
Burn Rate Alert in Prometheus#
# Fast burn: 14.4x over 1 hour (pages immediately)
- alert: HighBurnRate
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
Building an SLO Dashboard#
Every SLO dashboard should show four things:
- Current SLI value — are we meeting the objective right now?
- Error budget remaining — how much room do we have?
- Burn rate — how fast are we consuming budget?
- Time series — SLI trend over the SLO window
+--------------------------------------------------+
| API Availability SLO: 99.9% |
| |
| Current SLI: 99.94% [=====|----] |
| Budget Remaining: 62% [========|--] |
| Burn Rate: 0.8x (healthy) |
| Window: 30 days (18 days remaining) |
+--------------------------------------------------+
Dashboard Anti-Patterns#
- Showing raw error counts instead of ratios
- Missing the SLO target line on graphs
- No burn rate visualization
- Mixing infrastructure metrics with SLIs
The Google SRE Approach#
Google's SRE book codified SLOs as the core reliability practice. Key principles:
- 100% is the wrong target. Pursuing perfect reliability is infinitely expensive and slows innovation.
- Users set the SLO. If users can't tell the difference between 99.99% and 99.999%, the tighter target wastes engineering effort.
- Error budgets create alignment. Product and SRE teams agree on the budget. No more "move fast" vs "don't break things" arguments.
- SLOs drive decisions. Launch reviews, architecture choices, and staffing all reference SLOs.
Practical SLO Targets#
| Service Tier | Suggested SLO | Error Budget (30 days) |
|---|---|---|
| Internal tool | 99.0% | 7.2 hours |
| B2B SaaS | 99.5% - 99.9% | 43 min - 3.6 hours |
| Consumer app | 99.9% | 43 minutes |
| Payment / auth | 99.95% | 21.6 minutes |
| Infrastructure | 99.99% | 4.3 minutes |
Consequences of SLO Breach#
When your SLO is breached (error budget exhausted), escalation kicks in:
Engineering consequences:
- Feature freeze until budget recovers
- Mandatory postmortem for every incident
- Architecture review for the failing service
- Increased testing and canary requirements
Organizational consequences:
- SLO breach appears in team health dashboards
- Leadership review if breaches are recurring
- Staffing adjustments (more SRE support)
- Possible re-architecture or service decomposition
What NOT to do:
- Blame individuals
- Loosen the SLO to avoid breaches
- Ignore the breach and keep shipping
Tools for SLO Management#
Nobl9#
A dedicated SLO platform that integrates with your existing observability stack:
- Connects to Datadog, Prometheus, New Relic, Splunk, and more
- Calculates error budgets automatically
- Provides burn rate alerting out of the box
- Supports composite SLOs across multiple services
Sloth (Open Source)#
Generates Prometheus alerting rules from SLO definitions:
# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
slos:
- name: "requests-availability"
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
alerting:
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Run sloth generate and it produces multi-window, multi-burn-rate alerts automatically.
Other Tools#
- Datadog SLO Tracking — built into Datadog with SLO widgets
- Google Cloud SLO Monitoring — native GCP service
- Prometheus + Grafana — DIY with recording rules
- Dynatrace — SLO tiles with automatic baselining
Getting Started Checklist#
- Pick your most critical service
- Identify 1-2 SLIs that reflect user experience
- Set an SLO target (start conservative, tighten later)
- Calculate the error budget
- Set up burn rate alerts
- Build a dashboard
- Write an error budget policy
- Review monthly
This is article #354 in the Codelit engineering series. Explore more at codelit.io.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Instagram-like Photo Sharing Platform
Full-stack social media platform with image processing, feeds, and real-time notifications.
12 componentsScalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsTwitter-like Feed & Timeline
Social media feed with real-time tweets, fan-out on write, and trending topics.
9 componentsBuild this architecture
Generate an interactive architecture for Service Level Objectives (SLOs) in seconds.
Try it in Codelit →
Comments