Incident Management Architecture: From Alert to Post-Mortem
Incident Management Architecture#
Every production system will eventually fail. Incident management is not about preventing all failures — it is about detecting them fast, resolving them faster, and learning from every one.
The Incident Lifecycle#
Every incident follows the same fundamental stages, regardless of severity:
1. Detection#
Something is wrong. This can come from:
- Automated monitoring — alerts fire when metrics cross thresholds
- SLO-based detection — error budgets are burning too fast
- Customer reports — users notice before your systems do (this is bad)
- Internal reports — an engineer notices something unexpected
The goal: detect automatically before customers notice. If customers are your primary detection mechanism, your monitoring has gaps.
2. Triage#
Determine the severity and route to the right people. Key questions:
- How many users are affected?
- Is the impact growing or stable?
- Is there a workaround?
- Which system is involved?
Triage should take minutes, not hours. Pre-defined severity levels make this faster.
3. Response#
The right people are engaged and working the problem. This stage includes:
- Assembling the response team
- Opening a communication channel (war room)
- Diagnosing root cause
- Implementing a fix or mitigation
- Communicating status to stakeholders
4. Resolution#
The immediate impact is resolved. Users are no longer affected. But the incident is not over — you still need to:
- Verify the fix is holding
- Clean up any temporary mitigations
- Confirm monitoring is back to healthy state
5. Post-mortem#
Analyze what happened, why, and how to prevent recurrence. This is where the real value of incident management lives.
Severity Levels#
A shared severity framework eliminates ambiguity during triage. Here is a common model:
SEV-1 (Critical)#
- Complete service outage or data loss
- All or most users affected
- Revenue impact is immediate and significant
- Response: all hands, executive communication, war room immediately
SEV-2 (Major)#
- Significant degradation of core functionality
- Large subset of users affected
- Workarounds may exist but are not acceptable long-term
- Response: on-call team plus relevant experts, status page update
SEV-3 (Minor)#
- Partial degradation of non-critical functionality
- Small subset of users affected
- Workarounds are available and acceptable
- Response: on-call team handles during business hours
SEV-4 (Low)#
- Cosmetic issues or minor inconveniences
- Minimal user impact
- Response: tracked as a bug, fixed in normal sprint work
Calibrating severity#
The most common mistake is under-classifying severity. When in doubt, escalate. It is better to stand down a SEV-1 response than to let a real SEV-1 fester as a SEV-3.
Review severity definitions quarterly. Adjust thresholds as your user base and system complexity grow.
On-Call Rotation#
On-call is the foundation of incident response. Someone must always be available to respond.
Designing a healthy rotation#
- Rotation length — 1 week is most common. Shorter rotations (3-4 days) reduce burnout
- Primary and secondary — primary responder handles the page, secondary is backup
- Follow-the-sun — distribute on-call across time zones so nobody works nights
- Compensation — on-call time should be compensated (time off, pay, or both)
- Escalation paths — if primary does not acknowledge within 5 minutes, page secondary. If secondary does not acknowledge, page the engineering manager
On-call hygiene#
- Alert quality matters — every false alarm erodes trust and response speed
- Target fewer than 2 pages per on-call shift — more than that indicates noisy alerts or systemic issues
- Runbooks for every alert — the person paged should know exactly what to check first
- On-call handoff — outgoing on-call briefs incoming on active issues and recent changes
Avoiding burnout#
On-call burnout is real and dangerous. Burned-out engineers respond slower and make worse decisions. Signs to watch:
- Engineers trading away on-call shifts consistently
- Alert fatigue — acknowledging without investigating
- Increasing time-to-acknowledge trends
- Feedback surveys showing on-call dissatisfaction
Fix the causes: reduce alert noise, grow the rotation, compensate fairly.
War Rooms#
A war room is a dedicated communication space for incident response. It can be a Slack channel, a video call, or a physical room.
War room structure#
- Incident commander (IC) — coordinates the response, makes decisions, delegates tasks
- Technical lead — drives diagnosis and fix implementation
- Communications lead — updates stakeholders, status page, and customers
- Scribe — documents timeline, actions taken, and decisions made
War room rules#
- Stay focused on the incident. Side conversations happen elsewhere
- The IC makes decisions. Consensus is nice but speed matters more during an incident
- Communicate in the channel, not in DMs. Everyone needs the same information
- Update the timeline in real time. You will need it for the post-mortem
When to open a war room#
- All SEV-1 incidents: immediately
- SEV-2 incidents: if not resolved within 30 minutes
- Any incident involving multiple teams
- Any incident where the root cause is unclear after initial triage
Runbooks#
A runbook is a step-by-step guide for diagnosing and resolving a specific type of incident. Good runbooks are the difference between a 5-minute resolution and a 2-hour investigation.
What a runbook should contain#
- Alert context — what triggered this runbook and what it means
- Diagnostic steps — specific commands, queries, or dashboards to check
- Common causes — the top 3-5 reasons this alert fires, ranked by frequency
- Remediation steps — how to fix each common cause
- Escalation criteria — when to involve additional people or escalate severity
- Rollback procedures — how to undo recent changes if they caused the issue
Runbook best practices#
- Keep them current — a stale runbook is worse than no runbook because it wastes time and creates false confidence
- Link from alerts — every PagerDuty or OpsGenie alert should include a direct link to the relevant runbook
- Test them — during game days, have someone follow the runbook literally. If they get stuck, the runbook needs improvement
- Version control them — runbooks live in Git alongside the code they support
Example runbook structure#
Alert: API Latency P99 exceeds 2 seconds
Service: orders-api
Dashboard: https://grafana.internal/d/orders-api-latency
Step 1: Check if a deployment happened in the last 30 minutes
- Command: kubectl rollout history deployment/orders-api
- If yes: consider rollback (Step 6)
Step 2: Check database connection pool
- Dashboard: https://grafana.internal/d/orders-db-connections
- Healthy: connections below 80% of pool size
- If saturated: restart pods to reset connections (Step 7)
Step 3: Check downstream dependency health
- Dashboard: https://grafana.internal/d/dependency-health
- If payment-service is degraded: this is expected, engage payments team
...
Step 6: Rollback deployment
- Command: kubectl rollout undo deployment/orders-api
- Verify: watch latency dashboard for 5 minutes
Escalation: If none of the above resolves the issue within 30 minutes,
page the orders-api tech lead.
Blameless Post-Mortems#
The post-mortem is the most valuable part of incident management. Done well, it transforms failures into organizational learning. Done poorly, it becomes a blame exercise that discourages transparency.
Blameless does not mean accountable-less#
Blameless means:
- We assume people made the best decisions they could with the information available
- We focus on systemic causes, not individual mistakes
- We ask "what allowed this to happen" not "who caused this"
It does not mean:
- Nobody is responsible for follow-up actions
- We ignore patterns of recurring issues
- We skip the post-mortem because "nobody is to blame"
Post-mortem structure#
- Summary — one paragraph describing what happened and the impact
- Timeline — minute-by-minute account of detection, response, and resolution
- Root cause analysis — what actually caused the incident (use the "5 whys" technique)
- Contributing factors — what made detection or resolution slower
- What went well — what worked during the response (this is important for morale)
- Action items — specific, assigned, time-bound improvements
- Lessons learned — broader takeaways for the organization
The 5 Whys#
- Why did the service go down? — A bad configuration was deployed
- Why was a bad configuration deployed? — The config change was not tested
- Why was it not tested? — There is no staging environment for config changes
- Why is there no staging environment? — Config changes bypass the normal deployment pipeline
- Why do they bypass the pipeline? — The pipeline does not support config-only changes
Root cause: the deployment pipeline does not treat configuration changes as code.
Action item quality#
Bad action item: "Be more careful with config changes." Good action item: "Add config validation to the CI pipeline by April 15. Owner: Sarah."
Every action item must be:
- Specific — exactly what needs to happen
- Assigned — one person owns it
- Time-bound — a deadline exists
- Tracked — in the issue tracker, not just in the post-mortem document
SLO-Based Incident Detection#
Traditional threshold-based alerting (CPU greater than 90%, latency greater than 500ms) produces noise. SLO-based detection aligns alerts with what actually matters: user experience.
How it works#
- Define SLOs for your service (99.9% of requests complete in under 500ms)
- Calculate your error budget (0.1% of requests can fail per month)
- Monitor the burn rate — how fast you are consuming your error budget
- Alert when the burn rate exceeds a threshold
Burn rate alerting#
A burn rate of 1x means you will exactly exhaust your error budget by the end of the window. Common alert thresholds:
- 14.4x burn rate over 5 minutes — something is very wrong right now (page immediately)
- 6x burn rate over 30 minutes — significant degradation (page on-call)
- 3x burn rate over 6 hours — slow burn, will exhaust budget if unchecked (create a ticket)
- 1x burn rate over 3 days — trending toward budget exhaustion (review in standup)
Why SLO-based detection is better#
- Fewer false positives — a CPU spike that does not affect users does not page anyone
- Severity is built in — burn rate directly maps to urgency
- Business-aligned — SLOs reflect what users actually care about
- Budget-aware — you know exactly how much room you have left
Incident Management Tools#
PagerDuty#
The most established incident management platform. Strengths:
- Robust on-call scheduling and escalation policies
- Integrations with every monitoring tool
- Event intelligence for alert grouping and noise reduction
- Incident workflows and automation
incident.io#
A newer platform focused on incident response workflows within Slack. Strengths:
- Native Slack integration — declare and manage incidents without leaving chat
- Automated post-mortem generation from Slack timeline
- Custom workflows triggered by incident fields
- Catalog of services and teams for fast routing
FireHydrant#
Full-lifecycle incident management. Strengths:
- Runbook automation — execute remediation steps from the incident timeline
- Change tracking — correlate incidents with recent deployments
- Retrospective templates and follow-up tracking
- Service catalog with dependency mapping
Choosing a tool#
- If you need robust on-call management first: PagerDuty
- If your team lives in Slack and wants minimal friction: incident.io
- If you want end-to-end lifecycle management: FireHydrant
- If budget is tight: Grafana OnCall (open source) plus a post-mortem template
Building an Incident Management Culture#
Tools and processes matter, but culture determines whether they work.
- Practice regularly — run game days and chaos engineering exercises
- Celebrate good incident response — not just incident-free periods
- Share post-mortems widely — the whole engineering org should learn from every incident
- Reward transparency — engineers who surface problems early should be praised, not punished
- Review metrics quarterly — MTTD, MTTR, incident count by severity, post-mortem completion rate
The organizations that handle incidents best are the ones that treat every incident as a learning opportunity, not a failure.
Design your incident management architecture — try Codelit to generate interactive system diagrams with AI.
342 articles and guides at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Headless CMS Platform
Headless content management with structured content, media pipeline, API-first delivery, and editorial workflows.
8 componentsProject Management Platform
Jira/Linear-like tool with issues, sprints, boards, workflows, and real-time collaboration.
8 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsBuild this architecture
Generate an interactive Incident Management Architecture in seconds.
Try it in Codelit →
Comments