Disaster Recovery Architecture: High Availability, Failover Strategies & Multi-Region Design
Every production system will eventually fail. The question is not if but when — and whether your disaster recovery architecture is ready for it.
This guide covers the full spectrum: from defining RPO and RTO to designing multi-region active-active systems with automated failover.
HA vs DR: They Are Not the Same#
High availability (HA) keeps your system running during routine failures — a crashed process, a dead node, a full disk. It's about uptime within a single region or data center.
Disaster recovery (DR) is what happens when the entire region goes dark. It's about resuming operations after a catastrophic failure.
| Concern | High Availability | Disaster Recovery |
|---|---|---|
| Scope | Component/node failure | Region/site failure |
| Downtime goal | Seconds to minutes | Minutes to hours |
| Cost | Moderate | High |
| Automation | Expected | Often manual |
You need both. HA handles the daily turbulence. DR handles the earthquake.
RPO and RTO: The Two Numbers That Define Your DR Plan#
Recovery Point Objective (RPO) — How much data can you afford to lose? An RPO of 1 hour means you accept losing up to 1 hour of writes.
Recovery Time Objective (RTO) — How long can you be down? An RTO of 15 minutes means the system must be serving traffic again within 15 minutes of failure detection.
Timeline of a disaster:
Last backup Disaster Recovery complete
| | |
|<--- RPO -------->|<----- RTO ------>|
| (data loss) | (downtime) |
These numbers drive every architectural decision. A 0 RPO / 0 RTO system (zero data loss, zero downtime) costs orders of magnitude more than a 24h RPO / 4h RTO system.
DR Tiers: Cold, Warm, and Hot Standby#
Cold Standby#
Infrastructure is provisioned but powered off. On failure, you boot machines, restore from backups, and deploy. Cheapest option. RTO is measured in hours.
Warm Standby#
A scaled-down replica runs continuously. Data is replicated asynchronously. On failure, you scale up and redirect traffic. RTO drops to minutes.
Hot Standby#
A full-scale replica runs in parallel, receiving synchronous or near-synchronous replication. Failover is near-instant. This is the foundation for active-passive and active-active architectures.
Cold: [ Backups on S3 ] ──────── boot ──── restore ──── serve
Warm: [ Small replica ] ──────── scale up ──── serve
Hot: [ Full replica ] ──────── flip traffic ──── serve
Multi-Region Architecture#
Active-Passive#
One region handles all traffic. The passive region receives replicated data and stands ready. On failure, DNS or load balancer shifts traffic to the passive region.
Pros: Simpler data consistency. Lower cost than active-active. Cons: The passive region is "wasted" capacity during normal operation. Failover is not instant.
Active-Active#
Both regions serve traffic simultaneously. Data is replicated bidirectionally. On failure, the surviving region absorbs the full load.
┌──────────────┐
│ Global LB │
│ (Route 53 / │
│ Traffic Mgr) │
└──┬───────┬───┘
│ │
┌──────▼──┐ ┌──▼──────┐
│ Region A │ │ Region B │
│ (active) │ │ (active) │
│ App + DB │ │ App + DB │
└────┬─────┘ └─────┬───┘
│ │
└──── sync ────┘
(bi-directional replication)
Pros: No wasted capacity. Lower latency for geographically distributed users. Near-zero RTO. Cons: Conflict resolution is hard. Split-brain scenarios require careful handling (CRDTs, last-writer-wins, or application-level merging).
Failover Strategies#
DNS-Based Failover#
Tools like AWS Route 53 or Azure Traffic Manager use health checks to route traffic away from unhealthy endpoints. TTL settings control how fast clients pick up the change.
- Simple to implement
- Limited by DNS TTL propagation (30s–300s typical)
- Clients may cache stale records
Load Balancer Failover#
A global load balancer (AWS Global Accelerator, Cloudflare LB) detects backend failures and reroutes in seconds, bypassing DNS caching entirely.
- Faster than DNS failover
- Requires anycast or global LB infrastructure
- Better for latency-sensitive workloads
Database Failover#
The hardest piece. Options include:
| Strategy | RPO | RTO | Complexity |
|---|---|---|---|
| Async replication + promotion | Minutes | Minutes | Low |
| Sync replication + auto-failover | ~0 | Seconds | Medium |
| Multi-master (active-active) | 0 | 0 | High |
PostgreSQL supports streaming replication with automatic failover via Patroni or pg_auto_failover. Cloud-managed options like Aurora Global Database or Cloud Spanner abstract this away.
Backup and Recovery Strategies#
Snapshots#
Point-in-time snapshots of volumes or databases. Fast to create, but RPO equals the snapshot interval.
WAL Shipping (Write-Ahead Log)#
Continuously ship transaction logs to a standby or object storage. Enables Point-in-Time Recovery (PITR) — restore to any second within the retention window.
Primary DB ──── WAL stream ────► S3 / GCS
│
┌─────▼─────┐
│ PITR to │
│ any point │
└────────────┘
The 3-2-1 Rule#
- 3 copies of your data
- 2 different storage media
- 1 offsite (different region or provider)
Backup Testing#
A backup you have never restored is not a backup. Schedule regular restore drills. Automate them. Measure actual RTO against your target.
Chaos Engineering for DR Testing#
You cannot trust a DR plan that has never been exercised. Chaos engineering introduces controlled failures to validate resilience.
Practices:
- Game days — Scheduled DR simulations where you fail over an entire region and measure RTO/RPO.
- Fault injection — Kill processes, drop packets, corrupt disks using tools like Chaos Monkey, Litmus, or Gremlin.
- Steady-state hypothesis — Define what "normal" looks like, inject failure, and verify the system returns to steady state.
Chaos engineering loop:
Define steady state
│
▼
Hypothesize (system survives X failure)
│
▼
Inject failure
│
▼
Observe ── meets hypothesis? ── Yes ── confidence++
│
No
│
▼
Fix weakness, repeat
Start small. Kill a single pod. Then escalate: kill a node, an AZ, a region. Each level builds confidence.
Runbook Automation#
Manual runbooks fail under pressure. People skip steps, misread instructions, and panic. Automate your DR runbooks:
- Infrastructure as Code — Terraform or Pulumi to provision standby environments
- Automated failover scripts — Triggered by monitoring alerts, not human judgment
- Orchestration tools — AWS Systems Manager, Rundeck, or PagerDuty Runbook Automation
- Validation checks — Automated smoke tests that confirm the failover target is healthy before shifting traffic
The gold standard: a single command (or zero commands) that fails over your entire stack and verifies it is serving correctly.
Tools Quick Reference#
| Tool | Purpose |
|---|---|
| AWS Route 53 | DNS-based global failover |
| Azure Traffic Manager | DNS traffic routing + health checks |
| AWS Global Accelerator | Anycast-based global load balancing |
| Cloudflare LB | Global load balancing with fast failover |
| Patroni / pg_auto_failover | PostgreSQL HA + automatic failover |
| Aurora Global Database | Managed cross-region DB replication |
| Chaos Monkey / Gremlin / Litmus | Fault injection and chaos testing |
| Terraform / Pulumi | Infrastructure as Code for DR environments |
| Rundeck | Runbook automation |
Key Takeaways#
- Define your RPO and RTO first. Every design decision flows from these numbers.
- HA is not DR. You need both — local resilience and regional recovery.
- Choose the right standby tier (cold/warm/hot) based on your budget and RTO target.
- Active-active multi-region gives the best RTO but introduces data consistency challenges.
- Test your DR plan regularly. Chaos engineering and game days are not optional — they are how you find out your plan works before a real disaster does.
Your disaster recovery architecture is only as good as the last time you tested it. Build it, automate it, break it on purpose, and fix what fails.
Explore more system design deep dives and engineering guides at codelit.io.
147 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Try these templates
Figma Collaborative Design Platform
Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.
10 componentsPostgreSQL High Availability Cluster
Production PostgreSQL with streaming replication, connection pooling, automated failover, and monitoring.
10 componentsBuild this architecture
Generate an interactive Disaster Recovery Architecture in seconds.
Try it in Codelit →
Comments