observabilityinfrastructuredevopsmonitoring

Observability as Code — Dashboards, Alerts & SLOs in Version Control

March 29, 2026 6 min readBy Codelit Team Discussion

The problem with click-ops observability#

Someone builds a dashboard in Grafana. It works great. Then someone else modifies a panel and breaks a query. No one knows what changed. The on-call engineer creates an alert at 3 AM that never gets reviewed. Six months later you have 200 dashboards, half of them broken, and alert fatigue everywhere.

Observability as code treats dashboards, alerts, and SLOs the same way you treat application code: version-controlled, reviewed, tested, and deployed through CI/CD.

What belongs in code#

Asset	Format	Tool
Dashboards	JSON / Jsonnet	Grafana, Terraform
Alert rules	YAML / HCL	Prometheus, Terraform
SLO definitions	YAML	Sloth, OpenSLO
Recording rules	YAML	Prometheus
Notification channels	HCL	Terraform
On-call schedules	HCL / YAML	Terraform, PagerDuty API

If it can drift, it belongs in code.

Dashboards as code#

Grafana JSON model#

Every Grafana dashboard is a JSON document. Export it, commit it, and deploy it via the API or provisioning.

{
  "dashboard": {
    "title": "API Latency Overview",
    "uid": "api-latency-v1",
    "panels": [
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.5, "color": "yellow" },
                { "value": 1.0, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  },
  "overwrite": true
}

Grafana provisioning#

Mount dashboards from a directory and Grafana picks them up automatically:

# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Production"
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Commit the JSON files to your repo. A CI pipeline copies them to the provisioning path on deploy.

Terraform for Grafana#

Terraform's Grafana provider manages dashboards, folders, data sources, and alert rules declaratively:

resource "grafana_dashboard" "api_latency" {
  config_json = file("${path.module}/dashboards/api-latency.json")
  folder      = grafana_folder.production.id
  overwrite   = true
}

resource "grafana_folder" "production" {
  title = "Production"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"
}

Run terraform plan to preview changes. Run terraform apply to deploy. Every change is tracked in state.

Alerts as code#

Prometheus alerting rules#

Define alert rules in YAML files, loaded by Prometheus:

# alerts/api-alerts.yaml
groups:
  - name: api.latency
    interval: 30s
    rules:
      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "P99 latency above 1s for 5 minutes"
          runbook: "https://wiki.internal/runbooks/high-latency"

      - alert: ErrorRateHigh
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate above 5% for 3 minutes"

Terraform-managed alert rules#

For Grafana-managed alerts (Grafana 9+):

resource "grafana_rule_group" "api_alerts" {
  name             = "api-alerts"
  folder_uid       = grafana_folder.production.uid
  interval_seconds = 60

  rule {
    name      = "HighP99Latency"
    condition = "C"

    data {
      ref_id = "A"
      relative_time_range {
        from = 300
        to   = 0
      }
      datasource_uid = grafana_data_source.prometheus.uid
      model = jsonencode({
        expr = "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
      })
    }

    data {
      ref_id = "C"
      relative_time_range {
        from = 0
        to   = 0
      }
      datasource_uid = "-100"
      model = jsonencode({
        type       = "threshold"
        conditions = [{ evaluator = { type = "gt", params = [1.0] } }]
      })
    }
  }
}

SLO definitions in YAML#

OpenSLO spec#

OpenSLO is a vendor-neutral standard for defining SLOs:

# slos/api-availability.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: api-availability
  displayName: API Availability
spec:
  service: payment-api
  description: "Payment API should be available 99.9% of the time"
  budgetingMethod: Occurrences
  objectives:
    - displayName: Availability
      target: 0.999
      ratioMetrics:
        good:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total{status!~"5.."}[5m]))
        total:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total[5m]))
  timeWindow:
    - duration: 30d
      isRolling: true

Sloth for Prometheus SLOs#

Sloth generates Prometheus recording and alerting rules from SLO definitions:

# slos/sloth/payment-api.yaml
version: "prometheus/v1"
service: "payment-api"
slos:
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate -i slos/sloth/ -o rules/ to produce Prometheus-compatible rule files.

Jsonnet and CUE for config generation#

Raw JSON and YAML get repetitive. Templating languages help.

Jsonnet for Grafana dashboards#

Grafonnet is a Jsonnet library for generating Grafana dashboards:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new('API Overview', uid='api-overview')
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target('rate(http_requests_total[5m])', legendFormat='{{method}} {{path}}')
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)

One Jsonnet file can generate dozens of consistent dashboards across services.

CUE for validation#

CUE provides type-safe configuration with built-in validation:

#AlertRule: {
    alert:  string & =~"^[A-Z]"
    expr:   string
    for:    =~"^[0-9]+[msh]$"
    labels: severity: "info" | "warning" | "critical"
    annotations: {
        summary: string
        runbook: string & =~"^https://"
    }
}

rules: [...#AlertRule]

CUE catches invalid configurations before they reach production.

CI/CD pipeline for observability#

A typical workflow:

Developer edits a dashboard JSON or alert YAML in a feature branch
CI validates syntax (jsonnet fmt, promtool check rules, cue vet)
CI runs promtool test rules against unit tests for alerts
PR review — team reviews observability changes like any code change
Merge triggers deployment via Terraform apply or Grafana provisioning
Drift detection — scheduled job compares live state to code, alerts on divergence

Validating alerts in CI#

# .github/workflows/observability.yml
name: Validate Observability
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Prometheus rules
        run: promtool check rules alerts/*.yaml
      - name: Run alert unit tests
        run: promtool test rules tests/*.yaml
      - name: Validate Jsonnet
        run: jsonnet fmt --test dashboards/*.jsonnet
      - name: Terraform plan
        run: terraform plan -target=module.observability

Preventing drift#

Code-defined observability is useless if people bypass it with manual changes.

Disable manual editing in Grafana for provisioned dashboards (they become read-only)
Run drift detection — compare Terraform state to live resources on a schedule
Alert on drift — if live dashboards diverge from code, notify the team
Use RBAC — restrict who can create or modify dashboards and alerts in the UI

Key takeaways#

Treat dashboards, alerts, and SLOs like application code — version-controlled and reviewed
Grafana JSON + provisioning gives you reproducible dashboards across environments
Terraform manages the full observability stack declaratively
OpenSLO and Sloth standardize SLO definitions and generate alerting rules
Jsonnet/CUE reduce repetition and add type safety to configs
CI/CD validation catches broken queries and invalid thresholds before deployment
Drift detection ensures live observability matches what is in the repo

Article #420 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Prometheus Monitoring Stack

Metrics collection, alerting, and visualization with Prometheus, Grafana, Alertmanager, and exporters.

10 components

Build this architecture

Generate an interactive architecture for Observability as Code in seconds.

Try it in Codelit →

observabilityinfrastructuredevopsmonitoring

Observability as Code — Dashboards, Alerts & SLOs in Version Control

March 29, 2026 6 min readBy Codelit Team Discussion

The problem with click-ops observability#

Observability as code treats dashboards, alerts, and SLOs the same way you treat application code: version-controlled, reviewed, tested, and deployed through CI/CD.

What belongs in code#

Asset	Format	Tool
Dashboards	JSON / Jsonnet	Grafana, Terraform
Alert rules	YAML / HCL	Prometheus, Terraform
SLO definitions	YAML	Sloth, OpenSLO
Recording rules	YAML	Prometheus
Notification channels	HCL	Terraform
On-call schedules	HCL / YAML	Terraform, PagerDuty API

If it can drift, it belongs in code.

Dashboards as code#

Grafana JSON model#

Every Grafana dashboard is a JSON document. Export it, commit it, and deploy it via the API or provisioning.

{
  "dashboard": {
    "title": "API Latency Overview",
    "uid": "api-latency-v1",
    "panels": [
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 0.5, "color": "yellow" },
                { "value": 1.0, "color": "red" }
              ]
            }
          }
        }
      }
    ]
  },
  "overwrite": true
}

Grafana provisioning#

Mount dashboards from a directory and Grafana picks them up automatically:

# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Production"
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Commit the JSON files to your repo. A CI pipeline copies them to the provisioning path on deploy.

Terraform for Grafana#

Terraform's Grafana provider manages dashboards, folders, data sources, and alert rules declaratively:

resource "grafana_dashboard" "api_latency" {
  config_json = file("${path.module}/dashboards/api-latency.json")
  folder      = grafana_folder.production.id
  overwrite   = true
}

resource "grafana_folder" "production" {
  title = "Production"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus:9090"
}

Run terraform plan to preview changes. Run terraform apply to deploy. Every change is tracked in state.

Alerts as code#

Prometheus alerting rules#

Define alert rules in YAML files, loaded by Prometheus:

# alerts/api-alerts.yaml
groups:
  - name: api.latency
    interval: 30s
    rules:
      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "P99 latency above 1s for 5 minutes"
          runbook: "https://wiki.internal/runbooks/high-latency"

      - alert: ErrorRateHigh
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate above 5% for 3 minutes"

Terraform-managed alert rules#

For Grafana-managed alerts (Grafana 9+):

resource "grafana_rule_group" "api_alerts" {
  name             = "api-alerts"
  folder_uid       = grafana_folder.production.uid
  interval_seconds = 60

  rule {
    name      = "HighP99Latency"
    condition = "C"

    data {
      ref_id = "A"
      relative_time_range {
        from = 300
        to   = 0
      }
      datasource_uid = grafana_data_source.prometheus.uid
      model = jsonencode({
        expr = "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
      })
    }

    data {
      ref_id = "C"
      relative_time_range {
        from = 0
        to   = 0
      }
      datasource_uid = "-100"
      model = jsonencode({
        type       = "threshold"
        conditions = [{ evaluator = { type = "gt", params = [1.0] } }]
      })
    }
  }
}

SLO definitions in YAML#

OpenSLO spec#

OpenSLO is a vendor-neutral standard for defining SLOs:

# slos/api-availability.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: api-availability
  displayName: API Availability
spec:
  service: payment-api
  description: "Payment API should be available 99.9% of the time"
  budgetingMethod: Occurrences
  objectives:
    - displayName: Availability
      target: 0.999
      ratioMetrics:
        good:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total{status!~"5.."}[5m]))
        total:
          metricSource:
            type: Prometheus
            spec:
              query: sum(rate(http_requests_total[5m]))
  timeWindow:
    - duration: 30d
      isRolling: true

Sloth for Prometheus SLOs#

Sloth generates Prometheus recording and alerting rules from SLO definitions:

# slos/sloth/payment-api.yaml
version: "prometheus/v1"
service: "payment-api"
slos:
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_requests_total{service="payment-api",code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service="payment-api"}[{{.window}}]))
    alerting:
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning

Run sloth generate -i slos/sloth/ -o rules/ to produce Prometheus-compatible rule files.

Jsonnet and CUE for config generation#

Raw JSON and YAML get repetitive. Templating languages help.

Jsonnet for Grafana dashboards#

Grafonnet is a Jsonnet library for generating Grafana dashboards:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new('API Overview', uid='api-overview')
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target('rate(http_requests_total[5m])', legendFormat='{{method}} {{path}}')
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)

One Jsonnet file can generate dozens of consistent dashboards across services.

CUE for validation#

CUE provides type-safe configuration with built-in validation:

#AlertRule: {
    alert:  string & =~"^[A-Z]"
    expr:   string
    for:    =~"^[0-9]+[msh]$"
    labels: severity: "info" | "warning" | "critical"
    annotations: {
        summary: string
        runbook: string & =~"^https://"
    }
}

rules: [...#AlertRule]

CUE catches invalid configurations before they reach production.

CI/CD pipeline for observability#

A typical workflow:

Developer edits a dashboard JSON or alert YAML in a feature branch
CI validates syntax (jsonnet fmt, promtool check rules, cue vet)
CI runs promtool test rules against unit tests for alerts
PR review — team reviews observability changes like any code change
Merge triggers deployment via Terraform apply or Grafana provisioning
Drift detection — scheduled job compares live state to code, alerts on divergence

Validating alerts in CI#

# .github/workflows/observability.yml
name: Validate Observability
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Prometheus rules
        run: promtool check rules alerts/*.yaml
      - name: Run alert unit tests
        run: promtool test rules tests/*.yaml
      - name: Validate Jsonnet
        run: jsonnet fmt --test dashboards/*.jsonnet
      - name: Terraform plan
        run: terraform plan -target=module.observability

Preventing drift#

Code-defined observability is useless if people bypass it with manual changes.

Disable manual editing in Grafana for provisioned dashboards (they become read-only)
Run drift detection — compare Terraform state to live resources on a schedule
Alert on drift — if live dashboards diverge from code, notify the team
Use RBAC — restrict who can create or modify dashboards and alerts in the UI

Key takeaways#

Treat dashboards, alerts, and SLOs like application code — version-controlled and reviewed
Grafana JSON + provisioning gives you reproducible dashboards across environments
Terraform manages the full observability stack declaratively
OpenSLO and Sloth standardize SLO definitions and generate alerting rules
Jsonnet/CUE reduce repetition and add type safety to configs
CI/CD validation catches broken queries and invalid thresholds before deployment
Drift detection ensures live observability matches what is in the repo

Article #420 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Build this architecture

Generate an interactive architecture for Observability as Code in seconds.

Try it in Codelit →

Observability as Code — Dashboards, Alerts & SLOs in Version Control

The problem with click-ops observability#

What belongs in code#

Dashboards as code#

Grafana JSON model#

Grafana provisioning#

Terraform for Grafana#

Alerts as code#

Prometheus alerting rules#

Terraform-managed alert rules#

SLO definitions in YAML#

OpenSLO spec#

Sloth for Prometheus SLOs#

Jsonnet and CUE for config generation#

Jsonnet for Grafana dashboards#

CUE for validation#

CI/CD pipeline for observability#

Validating alerts in CI#

Preventing drift#

Key takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

CI/CD Pipeline Architecture

Prometheus Monitoring Stack

Build this architecture

Observability as Code — Dashboards, Alerts & SLOs in Version Control

The problem with click-ops observability#

What belongs in code#

Dashboards as code#

Grafana JSON model#

Grafana provisioning#

Terraform for Grafana#

Alerts as code#

Prometheus alerting rules#

Terraform-managed alert rules#

SLO definitions in YAML#

OpenSLO spec#

Sloth for Prometheus SLOs#

Jsonnet and CUE for config generation#

Jsonnet for Grafana dashboards#

CUE for validation#

CI/CD pipeline for observability#

Validating alerts in CI#

Preventing drift#

Key takeaways#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Logging & Observability Platform

CI/CD Pipeline Architecture

Prometheus Monitoring Stack

Build this architecture