Database Connection Failover — Keeping Applications Online When Databases Go Down
Why database failover matters#
Your application's uptime is bounded by your database's uptime. A single database instance is a single point of failure — hardware fails, patches require restarts, and network partitions happen. Database failover ensures that when the primary goes down, a standby takes over with minimal disruption.
The goal is not to prevent failures but to recover from them so quickly that users never notice.
Failover fundamentals#
RTO and RPO#
- Recovery Time Objective (RTO) — how long can the database be unavailable? Seconds? Minutes?
- Recovery Point Objective (RPO) — how much data can you afford to lose? Zero (synchronous replication) or a few seconds (asynchronous)?
These two numbers drive every architectural decision in your failover strategy.
Replication modes#
| Mode | RPO | Latency impact | Consistency |
|---|---|---|---|
| Synchronous | Zero | Higher (waits for replica ack) | Strong |
| Semi-synchronous | Near-zero | Moderate | Nearly strong |
| Asynchronous | Seconds | None | Eventual |
Synchronous replication guarantees zero data loss but adds write latency. Asynchronous replication is faster but risks losing the most recent transactions during failover.
Automatic failover#
How automatic failover works#
- A monitoring system continuously checks the primary's health (heartbeat, query execution, replication lag)
- When the primary fails consecutive health checks, the monitor declares it dead
- The most up-to-date replica is promoted to primary
- Connection endpoints are updated to point to the new primary
- The old primary is fenced (prevented from accepting writes) to avoid split-brain
Avoiding split-brain#
Split-brain occurs when both the old and new primary accept writes simultaneously. Prevention strategies:
- STONITH (Shoot The Other Node In The Head) — forcibly power off the old primary via IPMI or cloud API
- Fencing tokens — each primary gets a monotonically increasing token; replicas reject writes from stale tokens
- Quorum-based decisions — require a majority of nodes to agree before promoting a new primary
PostgreSQL automatic failover#
Patroni is the most popular tool for PostgreSQL automatic failover:
# patroni.yml
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect the leader. When the leader fails, Patroni promotes the most caught-up replica and updates the DCS.
MySQL automatic failover#
MySQL Group Replication with InnoDB Cluster provides built-in automatic failover:
- Paxos-based consensus for group membership
- Automatic conflict detection and resolution
- MySQL Router handles connection routing to the current primary
Read replica promotion#
When the primary fails, one replica must be promoted. The choice matters:
Selection criteria#
- Replication lag — choose the replica with the least lag (fewest missing transactions)
- Data centre locality — prefer a replica in the same AZ as the application servers
- Hardware capacity — ensure the promoted replica can handle write load
- Replication chain position — avoid promoting a cascading replica if a direct replica is available
Promotion steps (PostgreSQL)#
- Stop replication on the target replica
- Run
pg_promote()or create a trigger file - The replica replays remaining WAL and opens for writes
- Reconfigure other replicas to follow the new primary
- Update connection routing
Promotion steps (MySQL)#
- Ensure the target replica has applied all relay logs
- Run
STOP REPLICA; RESET REPLICA ALL; - Enable writes:
SET GLOBAL read_only = OFF; - Point other replicas to the new primary with
CHANGE REPLICATION SOURCE TO
Connection retry strategies#
Applications must handle connection failures gracefully. A naive retry can cause thundering herds or connect to a stale primary.
Exponential backoff with jitter#
import random
import time
def connect_with_retry(max_retries=5, base_delay=0.1):
for attempt in range(max_retries):
try:
return create_connection()
except ConnectionError:
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay)
time.sleep(delay + jitter)
raise Exception("Failed to connect after retries")
Connection string failover#
Most database drivers support multiple hosts in the connection string:
# PostgreSQL libpq
postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write
# MySQL Connector
jdbc:mysql://host1:3306,host2:3306,host3:3306/mydb?failoverReadOnly=false
The driver tries each host in order and connects to the first one that is available and writable.
DNS-based failover#
Use a DNS CNAME that always points to the current primary. After failover, update the CNAME. Applications reconnect transparently — but only after the DNS TTL expires.
PgBouncer failover#
PgBouncer is a lightweight connection pooler for PostgreSQL. In a failover scenario, PgBouncer itself must be highly available and must point to the current primary.
Architecture patterns#
Pattern 1: PgBouncer follows Patroni
Patroni updates the PgBouncer configuration and reloads it when the primary changes:
# Patroni callback script
#!/bin/bash
if [ "$1" = "on_start" ] || [ "$1" = "on_role_change" ]; then
NEW_PRIMARY=$(patronictl list -f json | jq -r '.[] | select(.Role=="Leader") | .Host')
sed -i "s/host=.*/host=${NEW_PRIMARY}/" /etc/pgbouncer/pgbouncer.ini
kill -HUP $(cat /var/run/pgbouncer/pgbouncer.pid)
fi
Pattern 2: PgBouncer behind a VIP
A virtual IP floats between PgBouncer instances. If one PgBouncer fails, the VIP moves to another instance. Applications always connect to the VIP.
Pattern 3: PgBouncer per application pod
Deploy PgBouncer as a sidecar container alongside each application pod. The sidecar handles connection pooling and points to the current primary via DNS.
HAProxy for database load balancing#
HAProxy sits between applications and database instances, routing connections based on health checks and routing rules.
Configuration#
# haproxy.cfg
listen postgres_write
bind *:5432
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2
server pg1 10.0.1.1:5432 check port 8008
server pg2 10.0.1.2:5432 check port 8008
server pg3 10.0.1.3:5432 check port 8008
listen postgres_read
bind *:5433
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server pg1 10.0.1.1:5432 check port 8008
server pg2 10.0.1.2:5432 check port 8008
server pg3 10.0.1.3:5432 check port 8008
The health check endpoint (/primary or /replica) is served by Patroni's REST API. HAProxy only routes write traffic to the current primary and read traffic to healthy replicas.
HAProxy high availability#
HAProxy itself must be redundant:
- Deploy two HAProxy instances with keepalived managing a floating VIP
- Or run HAProxy as a Kubernetes service with multiple replicas
Multi-AZ deployment#
AWS RDS Multi-AZ#
RDS Multi-AZ deploys a synchronous standby replica in a different availability zone:
- Automatic failover in 60–120 seconds
- DNS endpoint automatically updates to the new primary
- Zero data loss (synchronous replication)
- No application changes required
Custom multi-AZ with Patroni#
For self-managed PostgreSQL:
- Deploy Patroni nodes across 3 AZs (one primary, two replicas)
- Run etcd across all 3 AZs for consensus
- Use synchronous replication to at least one AZ for zero RPO
- HAProxy or PgBouncer in each AZ for local connection routing
Cross-region failover#
For disaster recovery, replicate asynchronously to a different region:
- RPO is bounded by replication lag (typically seconds)
- RTO depends on promotion time plus DNS propagation
- Use Route 53 health checks with failover routing to automate DNS cutover
Monitoring failover readiness#
Track these metrics to ensure your failover will work when you need it:
- Replication lag — if lag exceeds your RPO, you will lose data during failover
- Failover test frequency — run chaos engineering exercises monthly
- Connection pool health — stale connections after failover cause errors
- DNS TTL alignment — ensure TTLs are low enough for your RTO target
- Split-brain detection — alert if multiple nodes claim to be primary
Failover testing checklist#
- Kill the primary process — does the replica promote automatically?
- Network-partition the primary — does fencing prevent split-brain?
- Measure actual RTO — does it meet your SLA?
- Verify zero data loss — do all committed transactions survive?
- Test application reconnection — do connections recover without manual intervention?
- Simulate cascading failure — what happens if the new primary also fails?
399 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
E-Learning Platform
Coursera-like platform with video courses, quizzes, certificates, progress tracking, and live classes.
9 componentsOnline Auction Platform
eBay-like auction system with bidding, buy-it-now, seller management, and fraud prevention.
8 componentsPostgreSQL High Availability Cluster
Production PostgreSQL with streaming replication, connection pooling, automated failover, and monitoring.
10 componentsBuild this architecture
Generate an interactive architecture for Database Connection Failover in seconds.
Try it in Codelit →
Comments