Monitoring & Alerting Best Practices: Taming Alert Fatigue
Your monitoring dashboard glows green until it doesn't — and when it doesn't, you want the right person to know about the right problem with enough context to fix it in minutes, not hours. Poor alerting practices turn on-call rotations into burnout machines and bury real incidents under a mountain of noise.
The Alert Fatigue Problem#
Alert fatigue occurs when engineers receive so many notifications that they begin ignoring them. Studies from healthcare (where alarm fatigue is well-documented) show that between 72 % and 99 % of clinical alarms are false. Engineering teams face the same dynamic.
Common symptoms:
- Engineers mute channels or snooze pages reflexively.
- Mean time to acknowledge (MTTA) drifts upward week over week.
- Post-incident reviews reveal that the relevant alert fired but nobody noticed.
- Dashboards contain hundreds of charts that nobody looks at unless something is already broken.
Why It Happens#
- Threshold-based alerts on raw metrics — CPU over 80 % triggers a page even though the service is happily auto-scaling.
- Copy-paste alerting — Every new service inherits a boilerplate set of alerts with no tuning.
- No ownership — Alerts fire into a shared channel with no assigned responder.
- Missing severity tiers — Every alert is treated as equally urgent.
SLO-Based Alerting#
Service Level Objectives (SLOs) shift the focus from infrastructure symptoms to user-facing impact. Instead of asking "Is CPU high?" you ask "Are users experiencing errors or latency above our budget?"
Defining SLOs#
An SLO has three parts:
- SLI (Service Level Indicator) — The metric that represents user happiness. Examples: request success rate, p99 latency, data freshness.
- Target — The threshold. Example: 99.9 % of requests succeed over a 30-day rolling window.
- Error budget — The inverse of the target. A 99.9 % target gives you a 0.1 % error budget — roughly 43 minutes of total downtime per month.
Burn-Rate Alerts#
Rather than alerting when the SLI dips below the target instantaneously, burn-rate alerts fire when the error budget is being consumed faster than sustainable.
A common multi-window approach:
| Severity | Short window | Long window | Budget consumed |
|---|---|---|---|
| Page | 5 min | 1 hour | 2 % in 1 hour |
| Page | 30 min | 6 hours | 5 % in 6 hours |
| Ticket | 6 hours | 3 days | 10 % in 3 days |
The short window catches sudden spikes; the long window prevents alert-and-recover flapping. This approach, popularized by the Google SRE Workbook, dramatically reduces false positives.
Practical Example#
# Prometheus recording rule for error ratio
- record: sli:http_requests:error_ratio_5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Burn-rate alert
- alert: HighErrorBurnRate
expr: |
sli:http_requests:error_ratio_5m > 14.4 * 0.001
and
sli:http_requests:error_ratio_1h > 14.4 * 0.001
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burning 14.4x faster than sustainable"
runbook: "https://wiki.internal/runbooks/high-error-rate"
The multiplier 14.4 corresponds to consuming 2 % of a 30-day budget in one hour.
Alert Design Principles#
1. Every Alert Must Be Actionable#
If an engineer receives a page and the correct response is "wait and see," the alert should not be a page. Demote it to a ticket or a dashboard indicator.
2. Include Context#
A good alert notification contains:
- What is broken (service, SLI, current value vs. threshold).
- Since when (timestamp of first breach).
- Where to look (link to dashboard, logs, traces).
- What to do (link to runbook).
3. Assign Ownership#
Every alert routes to a specific team. Unowned alerts are effectively ignored alerts.
4. Tier Your Severity#
| Tier | Response | Channel | Example |
|---|---|---|---|
| P1 | Immediate (wake up) | Phone/SMS page | Payment processing down |
| P2 | Within 30 min | Push notification | Elevated latency above SLO burn rate |
| P3 | Next business day | Ticket | Disk usage trending toward capacity |
| P4 | Informational | Dashboard only | Cache hit rate decreased |
Runbook Automation#
A runbook starts as a human-readable document. As the response matures, you automate steps incrementally.
Maturity Levels#
- Wiki page — Step-by-step instructions an on-call engineer follows manually.
- Semi-automated — A script performs diagnostics and suggests a remediation that a human approves.
- Fully automated — The alert triggers a workflow that remediates without human intervention, logging every action for audit.
Example Automation Flow#
Alert fires: "Pod CrashLoopBackOff for checkout-service"
|
v
Automation gathers:
- Pod logs (last 200 lines)
- Recent deploys to the namespace
- Memory/CPU usage trend
|
v
Decision tree:
- OOMKilled? --> Increase memory limit, apply, notify
- Bad config? --> Roll back last ConfigMap change, notify
- Unknown? --> Page on-call with gathered diagnostics
Tools like Rundeck, Shoreline.io, and PagerDuty Automation Actions can orchestrate these workflows.
PagerDuty and OpsGenie Integration#
Routing and Escalation#
Both PagerDuty and OpsGenie support multi-tier escalation policies:
- Primary responder receives the page immediately.
- If unacknowledged after N minutes, secondary responder is paged.
- If still unacknowledged, engineering manager is paged.
- If the incident remains open past a threshold, an incident commander is looped in.
Event Enrichment#
Raw alert payloads are often cryptic. Enrichment adds:
- Links to relevant Grafana dashboards.
- Recent deployment history from CI/CD.
- Affected customer tier (enterprise vs. free).
- Suggested runbook steps.
PagerDuty Event Orchestration and OpsGenie Alert Policies can perform this enrichment automatically based on alert metadata.
Noise Reduction Features#
- Alert grouping — Correlate related alerts into a single incident (e.g., all pods in the same deployment).
- Transient alert suppression — Suppress alerts that auto-resolve within a short window.
- Maintenance windows — Silence alerts during planned maintenance.
Building a Healthy On-Call Culture#
Sustainable Rotations#
- Minimum two people per rotation to allow for illness and time off.
- Limit rotation length — One week is common; longer stretches increase burnout.
- Follow-the-sun — Distribute on-call across time zones so nobody is paged at 3 AM regularly.
- Compensate on-call — Whether through extra pay, comp time, or reduced workload the following week.
On-Call Hygiene#
- Weekly alert review — Triage every alert that fired. Was it actionable? Should it be tuned, demoted, or deleted?
- Toil budgets — Track the percentage of on-call time spent on repetitive manual work. Invest engineering time to automate the top offenders.
- Blameless post-incidents — Focus on systemic improvements, not individual mistakes.
Metrics to Track#
| Metric | Healthy target |
|---|---|
| Pages per on-call shift | Fewer than 2 per day |
| MTTA (mean time to acknowledge) | Under 5 minutes |
| MTTR (mean time to resolve) | Under 1 hour for P1 |
| False positive rate | Under 5 % of total alerts |
| Alert-to-incident ratio | Close to 1:1 after grouping |
Putting It All Together#
A mature monitoring and alerting pipeline looks like this:
- Instrument — Emit metrics, logs, and traces from every service using OpenTelemetry or equivalent.
- Store — Ship telemetry to a durable backend (Prometheus, Datadog, Grafana Cloud).
- Define SLOs — Agree on SLIs and targets with product and engineering stakeholders.
- Alert on burn rate — Use multi-window burn-rate rules instead of static thresholds.
- Route intelligently — PagerDuty or OpsGenie with enrichment, grouping, and escalation.
- Automate response — Start with runbooks, graduate to semi- and fully-automated remediation.
- Review continuously — Weekly alert triage, monthly SLO review, quarterly on-call retrospective.
The goal is not zero alerts — it is every alert matters. When your on-call engineer's phone buzzes, they should trust that it represents a real problem affecting real users, and they should have everything they need to fix it fast.
This is article #287 on Codelit.io — leveling up your engineering knowledge, one deep dive at a time. Explore more at codelit.io.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Monitoring & Alerting Best Practices in seconds.
Try it in Codelit →
Comments