Observability as Code — Dashboards, Alerts & SLOs in Version Control
The problem with click-ops observability#
Someone builds a dashboard in Grafana. It works great. Then someone else modifies a panel and breaks a query. No one knows what changed. The on-call engineer creates an alert at 3 AM that never gets reviewed. Six months later you have 200 dashboards, half of them broken, and alert fatigue everywhere.
Observability as code treats dashboards, alerts, and SLOs the same way you treat application code: version-controlled, reviewed, tested, and deployed through CI/CD.
What belongs in code#
| Asset | Format | Tool |
|---|---|---|
| Dashboards | JSON / Jsonnet | Grafana, Terraform |
| Alert rules | YAML / HCL | Prometheus, Terraform |
| SLO definitions | YAML | Sloth, OpenSLO |
| Recording rules | YAML | Prometheus |
| Notification channels | HCL | Terraform |
| On-call schedules | HCL / YAML | Terraform, PagerDuty API |
If it can drift, it belongs in code.
Dashboards as code#
Grafana JSON model#
Every Grafana dashboard is a JSON document. Export it, commit it, and deploy it via the API or provisioning.
{
"dashboard": {
"title": "API Latency Overview",
"uid": "api-latency-v1",
"panels": [
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.5, "color": "yellow" },
{ "value": 1.0, "color": "red" }
]
}
}
}
}
]
},
"overwrite": true
}
Grafana provisioning#
Mount dashboards from a directory and Grafana picks them up automatically:
# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Production"
type: file
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Commit the JSON files to your repo. A CI pipeline copies them to the provisioning path on deploy.
Terraform for Grafana#
Terraform's Grafana provider manages dashboards, folders, data sources, and alert rules declaratively:
resource "grafana_dashboard" "api_latency" {
config_json = file("${path.module}/dashboards/api-latency.json")
folder = grafana_folder.production.id
overwrite = true
}
resource "grafana_folder" "production" {
title = "Production"
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus:9090"
}
Run terraform plan to preview changes. Run terraform apply to deploy. Every change is tracked in state.
Alerts as code#
Prometheus alerting rules#
Define alert rules in YAML files, loaded by Prometheus:
# alerts/api-alerts.yaml
groups:
- name: api.latency
interval: 30s
rules:
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "P99 latency above 1s for 5 minutes"
runbook: "https://wiki.internal/runbooks/high-latency"
- alert: ErrorRateHigh
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 3m
labels:
severity: critical
team: platform
annotations:
summary: "Error rate above 5% for 3 minutes"
Terraform-managed alert rules#
For Grafana-managed alerts (Grafana 9+):
resource "grafana_rule_group" "api_alerts" {
name = "api-alerts"
folder_uid = grafana_folder.production.uid
interval_seconds = 60
rule {
name = "HighP99Latency"
condition = "C"
data {
ref_id = "A"
relative_time_range {
from = 300
to = 0
}
datasource_uid = grafana_data_source.prometheus.uid
model = jsonencode({
expr = "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
})
}
data {
ref_id = "C"
relative_time_range {
from = 0
to = 0
}
datasource_uid = "-100"
model = jsonencode({
type = "threshold"
conditions = [{ evaluator = { type = "gt", params = [1.0] } }]
})
}
}
}
SLO definitions in YAML#
OpenSLO spec#
OpenSLO is a vendor-neutral standard for defining SLOs:
# slos/api-availability.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
name: api-availability
displayName: API Availability
spec:
service: payment-api
description: "Payment API should be available 99.9% of the time"
budgetingMethod: Occurrences
objectives:
- displayName: Availability
target: 0.999
ratioMetrics:
good:
metricSource:
type: Prometheus
spec:
query: sum(rate(http_requests_total{status!~"5.."}[5m]))
total:
metricSource:
type: Prometheus
spec:
query: sum(rate(http_requests_total[5m]))
timeWindow:
- duration: 30d
isRolling: true
Sloth for Prometheus SLOs#
Sloth generates Prometheus recording and alerting rules from SLO definitions:
# slos/sloth/payment-api.yaml
version: "prometheus/v1"
service: "payment-api"
slos:
- name: "requests-availability"
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
alerting:
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Run sloth generate -i slos/sloth/ -o rules/ to produce Prometheus-compatible rule files.
Jsonnet and CUE for config generation#
Raw JSON and YAML get repetitive. Templating languages help.
Jsonnet for Grafana dashboards#
Grafonnet is a Jsonnet library for generating Grafana dashboards:
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
dashboard.new('API Overview', uid='api-overview')
.addPanel(
graphPanel.new(
'Request Rate',
datasource='Prometheus',
).addTarget(
prometheus.target('rate(http_requests_total[5m])', legendFormat='{{method}} {{path}}')
),
gridPos={ x: 0, y: 0, w: 12, h: 8 }
)
One Jsonnet file can generate dozens of consistent dashboards across services.
CUE for validation#
CUE provides type-safe configuration with built-in validation:
#AlertRule: {
alert: string & =~"^[A-Z]"
expr: string
for: =~"^[0-9]+[msh]$"
labels: severity: "info" | "warning" | "critical"
annotations: {
summary: string
runbook: string & =~"^https://"
}
}
rules: [...#AlertRule]
CUE catches invalid configurations before they reach production.
CI/CD pipeline for observability#
A typical workflow:
- Developer edits a dashboard JSON or alert YAML in a feature branch
- CI validates syntax (jsonnet fmt, promtool check rules, cue vet)
- CI runs
promtool test rulesagainst unit tests for alerts - PR review — team reviews observability changes like any code change
- Merge triggers deployment via Terraform apply or Grafana provisioning
- Drift detection — scheduled job compares live state to code, alerts on divergence
Validating alerts in CI#
# .github/workflows/observability.yml
name: Validate Observability
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate Prometheus rules
run: promtool check rules alerts/*.yaml
- name: Run alert unit tests
run: promtool test rules tests/*.yaml
- name: Validate Jsonnet
run: jsonnet fmt --test dashboards/*.jsonnet
- name: Terraform plan
run: terraform plan -target=module.observability
Preventing drift#
Code-defined observability is useless if people bypass it with manual changes.
- Disable manual editing in Grafana for provisioned dashboards (they become read-only)
- Run drift detection — compare Terraform state to live resources on a schedule
- Alert on drift — if live dashboards diverge from code, notify the team
- Use RBAC — restrict who can create or modify dashboards and alerts in the UI
Key takeaways#
- Treat dashboards, alerts, and SLOs like application code — version-controlled and reviewed
- Grafana JSON + provisioning gives you reproducible dashboards across environments
- Terraform manages the full observability stack declaratively
- OpenSLO and Sloth standardize SLO definitions and generate alerting rules
- Jsonnet/CUE reduce repetition and add type safety to configs
- CI/CD validation catches broken queries and invalid thresholds before deployment
- Drift detection ensures live observability matches what is in the repo
Article #420 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Logging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsPrometheus Monitoring Stack
Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.
10 componentsBuild this architecture
Generate an interactive architecture for Observability as Code in seconds.
Try it in Codelit →
Comments