incident managementon-callSREpost-mortemobservabilityDevOpssystem design

Incident Management Architecture: From Alert to Post-Mortem

March 29, 2026 11 min readBy Codelit Team Discussion

Incident Management Architecture#

Every production system will eventually fail. Incident management is not about preventing all failures — it is about detecting them fast, resolving them faster, and learning from every one.

The Incident Lifecycle#

Every incident follows the same fundamental stages, regardless of severity:

1. Detection#

Something is wrong. This can come from:

Automated monitoring — alerts fire when metrics cross thresholds
SLO-based detection — error budgets are burning too fast
Customer reports — users notice before your systems do (this is bad)
Internal reports — an engineer notices something unexpected

The goal: detect automatically before customers notice. If customers are your primary detection mechanism, your monitoring has gaps.

2. Triage#

Determine the severity and route to the right people. Key questions:

How many users are affected?
Is the impact growing or stable?
Is there a workaround?
Which system is involved?

Triage should take minutes, not hours. Pre-defined severity levels make this faster.

3. Response#

The right people are engaged and working the problem. This stage includes:

Assembling the response team
Opening a communication channel (war room)
Diagnosing root cause
Implementing a fix or mitigation
Communicating status to stakeholders

4. Resolution#

The immediate impact is resolved. Users are no longer affected. But the incident is not over — you still need to:

Verify the fix is holding
Clean up any temporary mitigations
Confirm monitoring is back to healthy state

5. Post-mortem#

Analyze what happened, why, and how to prevent recurrence. This is where the real value of incident management lives.

Severity Levels#

A shared severity framework eliminates ambiguity during triage. Here is a common model:

SEV-1 (Critical)#

Complete service outage or data loss
All or most users affected
Revenue impact is immediate and significant
Response: all hands, executive communication, war room immediately

SEV-2 (Major)#

Significant degradation of core functionality
Large subset of users affected
Workarounds may exist but are not acceptable long-term
Response: on-call team plus relevant experts, status page update

SEV-3 (Minor)#

Partial degradation of non-critical functionality
Small subset of users affected
Workarounds are available and acceptable
Response: on-call team handles during business hours

SEV-4 (Low)#

Cosmetic issues or minor inconveniences
Minimal user impact
Response: tracked as a bug, fixed in normal sprint work

Calibrating severity#

The most common mistake is under-classifying severity. When in doubt, escalate. It is better to stand down a SEV-1 response than to let a real SEV-1 fester as a SEV-3.

Review severity definitions quarterly. Adjust thresholds as your user base and system complexity grow.

On-Call Rotation#

On-call is the foundation of incident response. Someone must always be available to respond.

Designing a healthy rotation#

Rotation length — 1 week is most common. Shorter rotations (3-4 days) reduce burnout
Primary and secondary — primary responder handles the page, secondary is backup
Follow-the-sun — distribute on-call across time zones so nobody works nights
Compensation — on-call time should be compensated (time off, pay, or both)
Escalation paths — if primary does not acknowledge within 5 minutes, page secondary. If secondary does not acknowledge, page the engineering manager

On-call hygiene#

Alert quality matters — every false alarm erodes trust and response speed
Target fewer than 2 pages per on-call shift — more than that indicates noisy alerts or systemic issues
Runbooks for every alert — the person paged should know exactly what to check first
On-call handoff — outgoing on-call briefs incoming on active issues and recent changes

Avoiding burnout#

On-call burnout is real and dangerous. Burned-out engineers respond slower and make worse decisions. Signs to watch:

Engineers trading away on-call shifts consistently
Alert fatigue — acknowledging without investigating
Increasing time-to-acknowledge trends
Feedback surveys showing on-call dissatisfaction

Fix the causes: reduce alert noise, grow the rotation, compensate fairly.

War Rooms#

A war room is a dedicated communication space for incident response. It can be a Slack channel, a video call, or a physical room.

War room structure#

Incident commander (IC) — coordinates the response, makes decisions, delegates tasks
Technical lead — drives diagnosis and fix implementation
Communications lead — updates stakeholders, status page, and customers
Scribe — documents timeline, actions taken, and decisions made

War room rules#

Stay focused on the incident. Side conversations happen elsewhere
The IC makes decisions. Consensus is nice but speed matters more during an incident
Communicate in the channel, not in DMs. Everyone needs the same information
Update the timeline in real time. You will need it for the post-mortem

When to open a war room#

All SEV-1 incidents: immediately
SEV-2 incidents: if not resolved within 30 minutes
Any incident involving multiple teams
Any incident where the root cause is unclear after initial triage

Runbooks#

A runbook is a step-by-step guide for diagnosing and resolving a specific type of incident. Good runbooks are the difference between a 5-minute resolution and a 2-hour investigation.

What a runbook should contain#

Alert context — what triggered this runbook and what it means
Diagnostic steps — specific commands, queries, or dashboards to check
Common causes — the top 3-5 reasons this alert fires, ranked by frequency
Remediation steps — how to fix each common cause
Escalation criteria — when to involve additional people or escalate severity
Rollback procedures — how to undo recent changes if they caused the issue

Runbook best practices#

Keep them current — a stale runbook is worse than no runbook because it wastes time and creates false confidence
Link from alerts — every PagerDuty or OpsGenie alert should include a direct link to the relevant runbook
Test them — during game days, have someone follow the runbook literally. If they get stuck, the runbook needs improvement
Version control them — runbooks live in Git alongside the code they support

Example runbook structure#

Alert: API Latency P99 exceeds 2 seconds
Service: orders-api
Dashboard: https://grafana.internal/d/orders-api-latency

Step 1: Check if a deployment happened in the last 30 minutes
  - Command: kubectl rollout history deployment/orders-api
  - If yes: consider rollback (Step 6)

Step 2: Check database connection pool
  - Dashboard: https://grafana.internal/d/orders-db-connections
  - Healthy: connections below 80% of pool size
  - If saturated: restart pods to reset connections (Step 7)

Step 3: Check downstream dependency health
  - Dashboard: https://grafana.internal/d/dependency-health
  - If payment-service is degraded: this is expected, engage payments team

...

Step 6: Rollback deployment
  - Command: kubectl rollout undo deployment/orders-api
  - Verify: watch latency dashboard for 5 minutes

Escalation: If none of the above resolves the issue within 30 minutes,
page the orders-api tech lead.

Blameless Post-Mortems#

The post-mortem is the most valuable part of incident management. Done well, it transforms failures into organizational learning. Done poorly, it becomes a blame exercise that discourages transparency.

Blameless does not mean accountable-less#

Blameless means:

We assume people made the best decisions they could with the information available
We focus on systemic causes, not individual mistakes
We ask "what allowed this to happen" not "who caused this"

It does not mean:

Nobody is responsible for follow-up actions
We ignore patterns of recurring issues
We skip the post-mortem because "nobody is to blame"

Post-mortem structure#

Summary — one paragraph describing what happened and the impact
Timeline — minute-by-minute account of detection, response, and resolution
Root cause analysis — what actually caused the incident (use the "5 whys" technique)
Contributing factors — what made detection or resolution slower
What went well — what worked during the response (this is important for morale)
Action items — specific, assigned, time-bound improvements
Lessons learned — broader takeaways for the organization

The 5 Whys#

Why did the service go down? — A bad configuration was deployed
Why was a bad configuration deployed? — The config change was not tested
Why was it not tested? — There is no staging environment for config changes
Why is there no staging environment? — Config changes bypass the normal deployment pipeline
Why do they bypass the pipeline? — The pipeline does not support config-only changes

Root cause: the deployment pipeline does not treat configuration changes as code.

Action item quality#

Bad action item: "Be more careful with config changes." Good action item: "Add config validation to the CI pipeline by April 15. Owner: Sarah."

Every action item must be:

Specific — exactly what needs to happen
Assigned — one person owns it
Time-bound — a deadline exists
Tracked — in the issue tracker, not just in the post-mortem document

SLO-Based Incident Detection#

Traditional threshold-based alerting (CPU greater than 90%, latency greater than 500ms) produces noise. SLO-based detection aligns alerts with what actually matters: user experience.

How it works#

Define SLOs for your service (99.9% of requests complete in under 500ms)
Calculate your error budget (0.1% of requests can fail per month)
Monitor the burn rate — how fast you are consuming your error budget
Alert when the burn rate exceeds a threshold

Burn rate alerting#

A burn rate of 1x means you will exactly exhaust your error budget by the end of the window. Common alert thresholds:

14.4x burn rate over 5 minutes — something is very wrong right now (page immediately)
6x burn rate over 30 minutes — significant degradation (page on-call)
3x burn rate over 6 hours — slow burn, will exhaust budget if unchecked (create a ticket)
1x burn rate over 3 days — trending toward budget exhaustion (review in standup)

Why SLO-based detection is better#

Fewer false positives — a CPU spike that does not affect users does not page anyone
Severity is built in — burn rate directly maps to urgency
Business-aligned — SLOs reflect what users actually care about
Budget-aware — you know exactly how much room you have left

Incident Management Tools#

PagerDuty#

The most established incident management platform. Strengths:

Robust on-call scheduling and escalation policies
Integrations with every monitoring tool
Event intelligence for alert grouping and noise reduction
Incident workflows and automation

incident.io#

A newer platform focused on incident response workflows within Slack. Strengths:

Native Slack integration — declare and manage incidents without leaving chat
Automated post-mortem generation from Slack timeline
Custom workflows triggered by incident fields
Catalog of services and teams for fast routing

FireHydrant#

Full-lifecycle incident management. Strengths:

Runbook automation — execute remediation steps from the incident timeline
Change tracking — correlate incidents with recent deployments
Retrospective templates and follow-up tracking
Service catalog with dependency mapping

Choosing a tool#

If you need robust on-call management first: PagerDuty
If your team lives in Slack and wants minimal friction: incident.io
If you want end-to-end lifecycle management: FireHydrant
If budget is tight: Grafana OnCall (open source) plus a post-mortem template

Building an Incident Management Culture#

Tools and processes matter, but culture determines whether they work.

Practice regularly — run game days and chaos engineering exercises
Celebrate good incident response — not just incident-free periods
Share post-mortems widely — the whole engineering org should learn from every incident
Reward transparency — engineers who surface problems early should be praised, not punished
Review metrics quarterly — MTTD, MTTR, incident count by severity, post-mortem completion rate

The organizations that handle incidents best are the ones that treat every incident as a learning opportunity, not a failure.

Design your incident management architecture — try Codelit to generate interactive system diagrams with AI.

342 articles and guides at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agent Reliability Engineering

2 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

Try these templates

Headless CMS Platform

Headless content management with structured content, media pipeline, API-first delivery, and editorial workflows.

8 components

Project Management Platform

Jira/Linear-like tool with issues, sprints, boards, workflows, and real-time collaboration.

8 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

Build this architecture

Generate an interactive Incident Management Architecture in seconds.

Try it in Codelit →

incident managementon-callSREpost-mortemobservabilityDevOpssystem design

Incident Management Architecture: From Alert to Post-Mortem

March 29, 2026 11 min readBy Codelit Team Discussion

Incident Management Architecture#

Every production system will eventually fail. Incident management is not about preventing all failures — it is about detecting them fast, resolving them faster, and learning from every one.

The Incident Lifecycle#

Every incident follows the same fundamental stages, regardless of severity:

1. Detection#

Something is wrong. This can come from:

Automated monitoring — alerts fire when metrics cross thresholds
SLO-based detection — error budgets are burning too fast
Customer reports — users notice before your systems do (this is bad)
Internal reports — an engineer notices something unexpected

The goal: detect automatically before customers notice. If customers are your primary detection mechanism, your monitoring has gaps.

2. Triage#

Determine the severity and route to the right people. Key questions:

How many users are affected?
Is the impact growing or stable?
Is there a workaround?
Which system is involved?

Triage should take minutes, not hours. Pre-defined severity levels make this faster.

3. Response#

The right people are engaged and working the problem. This stage includes:

Assembling the response team
Opening a communication channel (war room)
Diagnosing root cause
Implementing a fix or mitigation
Communicating status to stakeholders

4. Resolution#

The immediate impact is resolved. Users are no longer affected. But the incident is not over — you still need to:

Verify the fix is holding
Clean up any temporary mitigations
Confirm monitoring is back to healthy state

5. Post-mortem#

Analyze what happened, why, and how to prevent recurrence. This is where the real value of incident management lives.

Severity Levels#

A shared severity framework eliminates ambiguity during triage. Here is a common model:

SEV-1 (Critical)#

Complete service outage or data loss
All or most users affected
Revenue impact is immediate and significant
Response: all hands, executive communication, war room immediately

SEV-2 (Major)#

Significant degradation of core functionality
Large subset of users affected
Workarounds may exist but are not acceptable long-term
Response: on-call team plus relevant experts, status page update

SEV-3 (Minor)#

Partial degradation of non-critical functionality
Small subset of users affected
Workarounds are available and acceptable
Response: on-call team handles during business hours

SEV-4 (Low)#

Cosmetic issues or minor inconveniences
Minimal user impact
Response: tracked as a bug, fixed in normal sprint work

Calibrating severity#

The most common mistake is under-classifying severity. When in doubt, escalate. It is better to stand down a SEV-1 response than to let a real SEV-1 fester as a SEV-3.

Review severity definitions quarterly. Adjust thresholds as your user base and system complexity grow.

On-Call Rotation#

On-call is the foundation of incident response. Someone must always be available to respond.

Designing a healthy rotation#

Rotation length — 1 week is most common. Shorter rotations (3-4 days) reduce burnout
Primary and secondary — primary responder handles the page, secondary is backup
Follow-the-sun — distribute on-call across time zones so nobody works nights
Compensation — on-call time should be compensated (time off, pay, or both)
Escalation paths — if primary does not acknowledge within 5 minutes, page secondary. If secondary does not acknowledge, page the engineering manager

On-call hygiene#

Alert quality matters — every false alarm erodes trust and response speed
Target fewer than 2 pages per on-call shift — more than that indicates noisy alerts or systemic issues
Runbooks for every alert — the person paged should know exactly what to check first
On-call handoff — outgoing on-call briefs incoming on active issues and recent changes

Avoiding burnout#

On-call burnout is real and dangerous. Burned-out engineers respond slower and make worse decisions. Signs to watch:

Engineers trading away on-call shifts consistently
Alert fatigue — acknowledging without investigating
Increasing time-to-acknowledge trends
Feedback surveys showing on-call dissatisfaction

Fix the causes: reduce alert noise, grow the rotation, compensate fairly.

War Rooms#

A war room is a dedicated communication space for incident response. It can be a Slack channel, a video call, or a physical room.

War room structure#

Incident commander (IC) — coordinates the response, makes decisions, delegates tasks
Technical lead — drives diagnosis and fix implementation
Communications lead — updates stakeholders, status page, and customers
Scribe — documents timeline, actions taken, and decisions made

War room rules#

Stay focused on the incident. Side conversations happen elsewhere
The IC makes decisions. Consensus is nice but speed matters more during an incident
Communicate in the channel, not in DMs. Everyone needs the same information
Update the timeline in real time. You will need it for the post-mortem

When to open a war room#

All SEV-1 incidents: immediately
SEV-2 incidents: if not resolved within 30 minutes
Any incident involving multiple teams
Any incident where the root cause is unclear after initial triage

Runbooks#

A runbook is a step-by-step guide for diagnosing and resolving a specific type of incident. Good runbooks are the difference between a 5-minute resolution and a 2-hour investigation.

What a runbook should contain#

Alert context — what triggered this runbook and what it means
Diagnostic steps — specific commands, queries, or dashboards to check
Common causes — the top 3-5 reasons this alert fires, ranked by frequency
Remediation steps — how to fix each common cause
Escalation criteria — when to involve additional people or escalate severity
Rollback procedures — how to undo recent changes if they caused the issue

Runbook best practices#

Keep them current — a stale runbook is worse than no runbook because it wastes time and creates false confidence
Link from alerts — every PagerDuty or OpsGenie alert should include a direct link to the relevant runbook
Test them — during game days, have someone follow the runbook literally. If they get stuck, the runbook needs improvement
Version control them — runbooks live in Git alongside the code they support

Example runbook structure#

Alert: API Latency P99 exceeds 2 seconds
Service: orders-api
Dashboard: https://grafana.internal/d/orders-api-latency

Step 1: Check if a deployment happened in the last 30 minutes
  - Command: kubectl rollout history deployment/orders-api
  - If yes: consider rollback (Step 6)

Step 2: Check database connection pool
  - Dashboard: https://grafana.internal/d/orders-db-connections
  - Healthy: connections below 80% of pool size
  - If saturated: restart pods to reset connections (Step 7)

Step 3: Check downstream dependency health
  - Dashboard: https://grafana.internal/d/dependency-health
  - If payment-service is degraded: this is expected, engage payments team

...

Step 6: Rollback deployment
  - Command: kubectl rollout undo deployment/orders-api
  - Verify: watch latency dashboard for 5 minutes

Escalation: If none of the above resolves the issue within 30 minutes,
page the orders-api tech lead.

Blameless Post-Mortems#

Blameless does not mean accountable-less#

Blameless means:

We assume people made the best decisions they could with the information available
We focus on systemic causes, not individual mistakes
We ask "what allowed this to happen" not "who caused this"

It does not mean:

Nobody is responsible for follow-up actions
We ignore patterns of recurring issues
We skip the post-mortem because "nobody is to blame"

Post-mortem structure#

Summary — one paragraph describing what happened and the impact
Timeline — minute-by-minute account of detection, response, and resolution
Root cause analysis — what actually caused the incident (use the "5 whys" technique)
Contributing factors — what made detection or resolution slower
What went well — what worked during the response (this is important for morale)
Action items — specific, assigned, time-bound improvements
Lessons learned — broader takeaways for the organization

The 5 Whys#

Why did the service go down? — A bad configuration was deployed
Why was a bad configuration deployed? — The config change was not tested
Why was it not tested? — There is no staging environment for config changes
Why is there no staging environment? — Config changes bypass the normal deployment pipeline
Why do they bypass the pipeline? — The pipeline does not support config-only changes

Root cause: the deployment pipeline does not treat configuration changes as code.

Action item quality#

Bad action item: "Be more careful with config changes." Good action item: "Add config validation to the CI pipeline by April 15. Owner: Sarah."

Every action item must be:

Specific — exactly what needs to happen
Assigned — one person owns it
Time-bound — a deadline exists
Tracked — in the issue tracker, not just in the post-mortem document

SLO-Based Incident Detection#

Traditional threshold-based alerting (CPU greater than 90%, latency greater than 500ms) produces noise. SLO-based detection aligns alerts with what actually matters: user experience.

How it works#

Define SLOs for your service (99.9% of requests complete in under 500ms)
Calculate your error budget (0.1% of requests can fail per month)
Monitor the burn rate — how fast you are consuming your error budget
Alert when the burn rate exceeds a threshold

Burn rate alerting#

A burn rate of 1x means you will exactly exhaust your error budget by the end of the window. Common alert thresholds:

14.4x burn rate over 5 minutes — something is very wrong right now (page immediately)
6x burn rate over 30 minutes — significant degradation (page on-call)
3x burn rate over 6 hours — slow burn, will exhaust budget if unchecked (create a ticket)
1x burn rate over 3 days — trending toward budget exhaustion (review in standup)

Why SLO-based detection is better#

Fewer false positives — a CPU spike that does not affect users does not page anyone
Severity is built in — burn rate directly maps to urgency
Business-aligned — SLOs reflect what users actually care about
Budget-aware — you know exactly how much room you have left

Incident Management Tools#

PagerDuty#

The most established incident management platform. Strengths:

Robust on-call scheduling and escalation policies
Integrations with every monitoring tool
Event intelligence for alert grouping and noise reduction
Incident workflows and automation

incident.io#

A newer platform focused on incident response workflows within Slack. Strengths:

Native Slack integration — declare and manage incidents without leaving chat
Automated post-mortem generation from Slack timeline
Custom workflows triggered by incident fields
Catalog of services and teams for fast routing

FireHydrant#

Full-lifecycle incident management. Strengths:

Runbook automation — execute remediation steps from the incident timeline
Change tracking — correlate incidents with recent deployments
Retrospective templates and follow-up tracking
Service catalog with dependency mapping

Choosing a tool#

If you need robust on-call management first: PagerDuty
If your team lives in Slack and wants minimal friction: incident.io
If you want end-to-end lifecycle management: FireHydrant
If budget is tight: Grafana OnCall (open source) plus a post-mortem template

Building an Incident Management Culture#

Tools and processes matter, but culture determines whether they work.

Practice regularly — run game days and chaos engineering exercises
Celebrate good incident response — not just incident-free periods
Share post-mortems widely — the whole engineering org should learn from every incident
Reward transparency — engineers who surface problems early should be praised, not punished
Review metrics quarterly — MTTD, MTTR, incident count by severity, post-mortem completion rate

The organizations that handle incidents best are the ones that treat every incident as a learning opportunity, not a failure.

Design your incident management architecture — try Codelit to generate interactive system diagrams with AI.

342 articles and guides at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Incident Management Architecture in seconds.

Try it in Codelit →

Incident Management Architecture: From Alert to Post-Mortem

Incident Management Architecture#

The Incident Lifecycle#

1. Detection#

2. Triage#

3. Response#

4. Resolution#

5. Post-mortem#

Severity Levels#

SEV-1 (Critical)#

SEV-2 (Major)#

SEV-3 (Minor)#

SEV-4 (Low)#

Calibrating severity#

On-Call Rotation#

Designing a healthy rotation#

On-call hygiene#

Avoiding burnout#

War Rooms#

War room structure#

War room rules#

When to open a war room#

Runbooks#

What a runbook should contain#

Runbook best practices#

Example runbook structure#

Blameless Post-Mortems#

Blameless does not mean accountable-less#

Post-mortem structure#

The 5 Whys#

Action item quality#

SLO-Based Incident Detection#

How it works#

Burn rate alerting#

Why SLO-based detection is better#

Incident Management Tools#

PagerDuty#

incident.io#

FireHydrant#

Choosing a tool#

Building an Incident Management Culture#

Comments

Related articles

AgentOps Observability for AI Agents

Agent Reliability Engineering

Agentic Data Pipeline Workflow

Try these templates

Headless CMS Platform

Project Management Platform

Logging & Observability Platform

Build this architecture

Incident Management Architecture: From Alert to Post-Mortem

Incident Management Architecture#

The Incident Lifecycle#

1. Detection#

2. Triage#

3. Response#

4. Resolution#

5. Post-mortem#

Severity Levels#

SEV-1 (Critical)#

SEV-2 (Major)#

SEV-3 (Minor)#

SEV-4 (Low)#

Calibrating severity#

On-Call Rotation#

Designing a healthy rotation#

On-call hygiene#

Avoiding burnout#

War Rooms#

War room structure#

War room rules#

When to open a war room#

Runbooks#

What a runbook should contain#

Runbook best practices#

Example runbook structure#

Blameless Post-Mortems#

Blameless does not mean accountable-less#

Post-mortem structure#