Disaster Recovery Architecture: High Availability, Failover Strategies & Multi-Region Design

March 28, 2026 7 min readBy Codelit Team Discussion

Every production system will eventually fail. The question is not if but when — and whether your disaster recovery architecture is ready for it.

This guide covers the full spectrum: from defining RPO and RTO to designing multi-region active-active systems with automated failover.

HA vs DR: They Are Not the Same#

High availability (HA) keeps your system running during routine failures — a crashed process, a dead node, a full disk. It's about uptime within a single region or data center.

Disaster recovery (DR) is what happens when the entire region goes dark. It's about resuming operations after a catastrophic failure.

Concern	High Availability	Disaster Recovery
Scope	Component/node failure	Region/site failure
Downtime goal	Seconds to minutes	Minutes to hours
Cost	Moderate	High
Automation	Expected	Often manual

You need both. HA handles the daily turbulence. DR handles the earthquake.

RPO and RTO: The Two Numbers That Define Your DR Plan#

Recovery Point Objective (RPO) — How much data can you afford to lose? An RPO of 1 hour means you accept losing up to 1 hour of writes.

Recovery Time Objective (RTO) — How long can you be down? An RTO of 15 minutes means the system must be serving traffic again within 15 minutes of failure detection.

Timeline of a disaster:

  Last backup        Disaster        Recovery complete
      |                  |                  |
      |<--- RPO -------->|<----- RTO ------>|
      |  (data loss)      |  (downtime)      |

These numbers drive every architectural decision. A 0 RPO / 0 RTO system (zero data loss, zero downtime) costs orders of magnitude more than a 24h RPO / 4h RTO system.

DR Tiers: Cold, Warm, and Hot Standby#

Cold Standby#

Infrastructure is provisioned but powered off. On failure, you boot machines, restore from backups, and deploy. Cheapest option. RTO is measured in hours.

Warm Standby#

A scaled-down replica runs continuously. Data is replicated asynchronously. On failure, you scale up and redirect traffic. RTO drops to minutes.

Hot Standby#

A full-scale replica runs in parallel, receiving synchronous or near-synchronous replication. Failover is near-instant. This is the foundation for active-passive and active-active architectures.

Cold:    [ Backups on S3 ] ──────── boot ──── restore ──── serve
Warm:    [ Small replica ] ──────── scale up ──── serve
Hot:     [ Full replica  ] ──────── flip traffic ──── serve

Multi-Region Architecture#

Active-Passive#

One region handles all traffic. The passive region receives replicated data and stands ready. On failure, DNS or load balancer shifts traffic to the passive region.

Pros: Simpler data consistency. Lower cost than active-active. Cons: The passive region is "wasted" capacity during normal operation. Failover is not instant.

Active-Active#

Both regions serve traffic simultaneously. Data is replicated bidirectionally. On failure, the surviving region absorbs the full load.

         ┌──────────────┐
         │  Global LB   │
         │ (Route 53 /  │
         │  Traffic Mgr) │
         └──┬───────┬───┘
            │       │
     ┌──────▼──┐ ┌──▼──────┐
     │ Region A │ │ Region B │
     │ (active) │ │ (active) │
     │  App + DB │ │  App + DB │
     └────┬─────┘ └─────┬───┘
          │              │
          └──── sync ────┘
        (bi-directional replication)

Pros: No wasted capacity. Lower latency for geographically distributed users. Near-zero RTO. Cons: Conflict resolution is hard. Split-brain scenarios require careful handling (CRDTs, last-writer-wins, or application-level merging).

Failover Strategies#

DNS-Based Failover#

Tools like AWS Route 53 or Azure Traffic Manager use health checks to route traffic away from unhealthy endpoints. TTL settings control how fast clients pick up the change.

Simple to implement
Limited by DNS TTL propagation (30s–300s typical)
Clients may cache stale records

Load Balancer Failover#

A global load balancer (AWS Global Accelerator, Cloudflare LB) detects backend failures and reroutes in seconds, bypassing DNS caching entirely.

Faster than DNS failover
Requires anycast or global LB infrastructure
Better for latency-sensitive workloads

Database Failover#

The hardest piece. Options include:

Strategy	RPO	RTO	Complexity
Async replication + promotion	Minutes	Minutes	Low
Sync replication + auto-failover	~0	Seconds	Medium
Multi-master (active-active)	0	0	High

PostgreSQL supports streaming replication with automatic failover via Patroni or pg_auto_failover. Cloud-managed options like Aurora Global Database or Cloud Spanner abstract this away.

Backup and Recovery Strategies#

Snapshots#

Point-in-time snapshots of volumes or databases. Fast to create, but RPO equals the snapshot interval.

WAL Shipping (Write-Ahead Log)#

Continuously ship transaction logs to a standby or object storage. Enables Point-in-Time Recovery (PITR) — restore to any second within the retention window.

Primary DB ──── WAL stream ────► S3 / GCS
                                    │
                              ┌─────▼─────┐
                              │  PITR to   │
                              │  any point │
                              └────────────┘

The 3-2-1 Rule#

3 copies of your data
2 different storage media
1 offsite (different region or provider)

Backup Testing#

A backup you have never restored is not a backup. Schedule regular restore drills. Automate them. Measure actual RTO against your target.

Chaos Engineering for DR Testing#

You cannot trust a DR plan that has never been exercised. Chaos engineering introduces controlled failures to validate resilience.

Practices:

Game days — Scheduled DR simulations where you fail over an entire region and measure RTO/RPO.
Fault injection — Kill processes, drop packets, corrupt disks using tools like Chaos Monkey, Litmus, or Gremlin.
Steady-state hypothesis — Define what "normal" looks like, inject failure, and verify the system returns to steady state.

Chaos engineering loop:

  Define steady state
        │
        ▼
  Hypothesize (system survives X failure)
        │
        ▼
  Inject failure
        │
        ▼
  Observe ── meets hypothesis? ── Yes ── confidence++
        │
        No
        │
        ▼
  Fix weakness, repeat

Start small. Kill a single pod. Then escalate: kill a node, an AZ, a region. Each level builds confidence.

Runbook Automation#

Manual runbooks fail under pressure. People skip steps, misread instructions, and panic. Automate your DR runbooks:

Infrastructure as Code — Terraform or Pulumi to provision standby environments
Automated failover scripts — Triggered by monitoring alerts, not human judgment
Orchestration tools — AWS Systems Manager, Rundeck, or PagerDuty Runbook Automation
Validation checks — Automated smoke tests that confirm the failover target is healthy before shifting traffic

The gold standard: a single command (or zero commands) that fails over your entire stack and verifies it is serving correctly.

Tools Quick Reference#

Tool	Purpose
AWS Route 53	DNS-based global failover
Azure Traffic Manager	DNS traffic routing + health checks
AWS Global Accelerator	Anycast-based global load balancing
Cloudflare LB	Global load balancing with fast failover
Patroni / pg_auto_failover	PostgreSQL HA + automatic failover
Aurora Global Database	Managed cross-region DB replication
Chaos Monkey / Gremlin / Litmus	Fault injection and chaos testing
Terraform / Pulumi	Infrastructure as Code for DR environments
Rundeck	Runbook automation

Key Takeaways#

Define your RPO and RTO first. Every design decision flows from these numbers.
HA is not DR. You need both — local resilience and regional recovery.
Choose the right standby tier (cold/warm/hot) based on your budget and RTO target.
Active-active multi-region gives the best RTO but introduces data consistency challenges.
Test your DR plan regularly. Chaos engineering and game days are not optional — they are how you find out your plan works before a real disaster does.

Your disaster recovery architecture is only as good as the last time you tested it. Build it, automate it, break it on purpose, and fix what fails.

Explore more system design deep dives and engineering guides at codelit.io.

147 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

Try these templates

Figma Collaborative Design Platform

Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.

10 components

PostgreSQL High Availability Cluster

Production PostgreSQL with streaming replication, connection pooling, automated failover, and monitoring.