databasesreliabilitysystem-designinfrastructurehigh-availability

Database Connection Failover — Keeping Applications Online When Databases Go Down

March 29, 2026 7 min readBy Codelit Team Discussion

Why database failover matters#

Your application's uptime is bounded by your database's uptime. A single database instance is a single point of failure — hardware fails, patches require restarts, and network partitions happen. Database failover ensures that when the primary goes down, a standby takes over with minimal disruption.

The goal is not to prevent failures but to recover from them so quickly that users never notice.

Failover fundamentals#

RTO and RPO#

Recovery Time Objective (RTO) — how long can the database be unavailable? Seconds? Minutes?
Recovery Point Objective (RPO) — how much data can you afford to lose? Zero (synchronous replication) or a few seconds (asynchronous)?

These two numbers drive every architectural decision in your failover strategy.

Replication modes#

Mode	RPO	Latency impact	Consistency
Synchronous	Zero	Higher (waits for replica ack)	Strong
Semi-synchronous	Near-zero	Moderate	Nearly strong
Asynchronous	Seconds	None	Eventual

Synchronous replication guarantees zero data loss but adds write latency. Asynchronous replication is faster but risks losing the most recent transactions during failover.

Automatic failover#

How automatic failover works#

A monitoring system continuously checks the primary's health (heartbeat, query execution, replication lag)
When the primary fails consecutive health checks, the monitor declares it dead
The most up-to-date replica is promoted to primary
Connection endpoints are updated to point to the new primary
The old primary is fenced (prevented from accepting writes) to avoid split-brain

Avoiding split-brain#

Split-brain occurs when both the old and new primary accept writes simultaneously. Prevention strategies:

STONITH (Shoot The Other Node In The Head) — forcibly power off the old primary via IPMI or cloud API
Fencing tokens — each primary gets a monotonically increasing token; replicas reject writes from stale tokens
Quorum-based decisions — require a majority of nodes to agree before promoting a new primary

PostgreSQL automatic failover#

Patroni is the most popular tool for PostgreSQL automatic failover:

# patroni.yml
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10

Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect the leader. When the leader fails, Patroni promotes the most caught-up replica and updates the DCS.

MySQL automatic failover#

MySQL Group Replication with InnoDB Cluster provides built-in automatic failover:

Paxos-based consensus for group membership
Automatic conflict detection and resolution
MySQL Router handles connection routing to the current primary

Read replica promotion#

When the primary fails, one replica must be promoted. The choice matters:

Selection criteria#

Replication lag — choose the replica with the least lag (fewest missing transactions)
Data centre locality — prefer a replica in the same AZ as the application servers
Hardware capacity — ensure the promoted replica can handle write load
Replication chain position — avoid promoting a cascading replica if a direct replica is available

Promotion steps (PostgreSQL)#

Stop replication on the target replica
Run pg_promote() or create a trigger file
The replica replays remaining WAL and opens for writes
Reconfigure other replicas to follow the new primary
Update connection routing

Promotion steps (MySQL)#

Ensure the target replica has applied all relay logs
Run STOP REPLICA; RESET REPLICA ALL;
Enable writes: SET GLOBAL read_only = OFF;
Point other replicas to the new primary with CHANGE REPLICATION SOURCE TO

Connection retry strategies#

Applications must handle connection failures gracefully. A naive retry can cause thundering herds or connect to a stale primary.

Exponential backoff with jitter#

import random
import time

def connect_with_retry(max_retries=5, base_delay=0.1):
    for attempt in range(max_retries):
        try:
            return create_connection()
        except ConnectionError:
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay)
            time.sleep(delay + jitter)
    raise Exception("Failed to connect after retries")

Connection string failover#

Most database drivers support multiple hosts in the connection string:

# PostgreSQL libpq
postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write

# MySQL Connector
jdbc:mysql://host1:3306,host2:3306,host3:3306/mydb?failoverReadOnly=false

The driver tries each host in order and connects to the first one that is available and writable.

DNS-based failover#

Use a DNS CNAME that always points to the current primary. After failover, update the CNAME. Applications reconnect transparently — but only after the DNS TTL expires.

PgBouncer failover#

PgBouncer is a lightweight connection pooler for PostgreSQL. In a failover scenario, PgBouncer itself must be highly available and must point to the current primary.

Architecture patterns#

Pattern 1: PgBouncer follows Patroni

Patroni updates the PgBouncer configuration and reloads it when the primary changes:

# Patroni callback script
#!/bin/bash
if [ "$1" = "on_start" ] || [ "$1" = "on_role_change" ]; then
  NEW_PRIMARY=$(patronictl list -f json | jq -r '.[] | select(.Role=="Leader") | .Host')
  sed -i "s/host=.*/host=${NEW_PRIMARY}/" /etc/pgbouncer/pgbouncer.ini
  kill -HUP $(cat /var/run/pgbouncer/pgbouncer.pid)
fi

Pattern 2: PgBouncer behind a VIP

A virtual IP floats between PgBouncer instances. If one PgBouncer fails, the VIP moves to another instance. Applications always connect to the VIP.

Pattern 3: PgBouncer per application pod

Deploy PgBouncer as a sidecar container alongside each application pod. The sidecar handles connection pooling and points to the current primary via DNS.

HAProxy for database load balancing#

HAProxy sits between applications and database instances, routing connections based on health checks and routing rules.

Configuration#

# haproxy.cfg
listen postgres_write
    bind *:5432
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server pg1 10.0.1.1:5432 check port 8008
    server pg2 10.0.1.2:5432 check port 8008
    server pg3 10.0.1.3:5432 check port 8008

listen postgres_read
    bind *:5433
    mode tcp
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server pg1 10.0.1.1:5432 check port 8008
    server pg2 10.0.1.2:5432 check port 8008
    server pg3 10.0.1.3:5432 check port 8008

The health check endpoint (/primary or /replica) is served by Patroni's REST API. HAProxy only routes write traffic to the current primary and read traffic to healthy replicas.

HAProxy high availability#

HAProxy itself must be redundant:

Deploy two HAProxy instances with keepalived managing a floating VIP
Or run HAProxy as a Kubernetes service with multiple replicas

Multi-AZ deployment#

AWS RDS Multi-AZ#

RDS Multi-AZ deploys a synchronous standby replica in a different availability zone:

Automatic failover in 60–120 seconds
DNS endpoint automatically updates to the new primary
Zero data loss (synchronous replication)
No application changes required

Custom multi-AZ with Patroni#

For self-managed PostgreSQL:

Deploy Patroni nodes across 3 AZs (one primary, two replicas)
Run etcd across all 3 AZs for consensus
Use synchronous replication to at least one AZ for zero RPO
HAProxy or PgBouncer in each AZ for local connection routing

Cross-region failover#

For disaster recovery, replicate asynchronously to a different region:

RPO is bounded by replication lag (typically seconds)
RTO depends on promotion time plus DNS propagation
Use Route 53 health checks with failover routing to automate DNS cutover

Monitoring failover readiness#

Track these metrics to ensure your failover will work when you need it:

Replication lag — if lag exceeds your RPO, you will lose data during failover
Failover test frequency — run chaos engineering exercises monthly
Connection pool health — stale connections after failover cause errors
DNS TTL alignment — ensure TTLs are low enough for your RTO target
Split-brain detection — alert if multiple nodes claim to be primary

Failover testing checklist#

Kill the primary process — does the replica promote automatically?
Network-partition the primary — does fencing prevent split-brain?
Measure actual RTO — does it meet your SLA?
Verify zero data loss — do all committed transactions survive?
Test application reconnection — do connections recover without manual intervention?
Simulate cascading failure — what happens if the new primary also fails?

399 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Agent Reliability Engineering

2 min read

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

Try these templates

E-Learning Platform

Coursera-like platform with video courses, quizzes, certificates, progress tracking, and live classes.

9 components

Online Auction Platform

eBay-like auction system with bidding, buy-it-now, seller management, and fraud prevention.

8 components

PostgreSQL High Availability Cluster

Production PostgreSQL with streaming replication, connection pooling, automated failover, and monitoring.

10 components

Build this architecture

Generate an interactive architecture for Database Connection Failover in seconds.

Try it in Codelit →

databasesreliabilitysystem-designinfrastructurehigh-availability

Database Connection Failover — Keeping Applications Online When Databases Go Down

March 29, 2026 7 min readBy Codelit Team Discussion

Why database failover matters#

The goal is not to prevent failures but to recover from them so quickly that users never notice.

Failover fundamentals#

RTO and RPO#

Recovery Time Objective (RTO) — how long can the database be unavailable? Seconds? Minutes?
Recovery Point Objective (RPO) — how much data can you afford to lose? Zero (synchronous replication) or a few seconds (asynchronous)?

These two numbers drive every architectural decision in your failover strategy.

Replication modes#

Mode	RPO	Latency impact	Consistency
Synchronous	Zero	Higher (waits for replica ack)	Strong
Semi-synchronous	Near-zero	Moderate	Nearly strong
Asynchronous	Seconds	None	Eventual

Synchronous replication guarantees zero data loss but adds write latency. Asynchronous replication is faster but risks losing the most recent transactions during failover.

Automatic failover#

How automatic failover works#

A monitoring system continuously checks the primary's health (heartbeat, query execution, replication lag)
When the primary fails consecutive health checks, the monitor declares it dead
The most up-to-date replica is promoted to primary
Connection endpoints are updated to point to the new primary
The old primary is fenced (prevented from accepting writes) to avoid split-brain

Avoiding split-brain#

Split-brain occurs when both the old and new primary accept writes simultaneously. Prevention strategies:

STONITH (Shoot The Other Node In The Head) — forcibly power off the old primary via IPMI or cloud API
Fencing tokens — each primary gets a monotonically increasing token; replicas reject writes from stale tokens
Quorum-based decisions — require a majority of nodes to agree before promoting a new primary

PostgreSQL automatic failover#

Patroni is the most popular tool for PostgreSQL automatic failover:

# patroni.yml
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10

Patroni uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect the leader. When the leader fails, Patroni promotes the most caught-up replica and updates the DCS.

MySQL automatic failover#

MySQL Group Replication with InnoDB Cluster provides built-in automatic failover:

Paxos-based consensus for group membership
Automatic conflict detection and resolution
MySQL Router handles connection routing to the current primary

Read replica promotion#

When the primary fails, one replica must be promoted. The choice matters:

Selection criteria#

Replication lag — choose the replica with the least lag (fewest missing transactions)
Data centre locality — prefer a replica in the same AZ as the application servers
Hardware capacity — ensure the promoted replica can handle write load
Replication chain position — avoid promoting a cascading replica if a direct replica is available

Promotion steps (PostgreSQL)#

Stop replication on the target replica
Run pg_promote() or create a trigger file
The replica replays remaining WAL and opens for writes
Reconfigure other replicas to follow the new primary
Update connection routing

Promotion steps (MySQL)#

Ensure the target replica has applied all relay logs
Run STOP REPLICA; RESET REPLICA ALL;
Enable writes: SET GLOBAL read_only = OFF;
Point other replicas to the new primary with CHANGE REPLICATION SOURCE TO

Connection retry strategies#

Applications must handle connection failures gracefully. A naive retry can cause thundering herds or connect to a stale primary.

Exponential backoff with jitter#

import random
import time

def connect_with_retry(max_retries=5, base_delay=0.1):
    for attempt in range(max_retries):
        try:
            return create_connection()
        except ConnectionError:
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay)
            time.sleep(delay + jitter)
    raise Exception("Failed to connect after retries")

Connection string failover#

Most database drivers support multiple hosts in the connection string:

# PostgreSQL libpq
postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write

# MySQL Connector
jdbc:mysql://host1:3306,host2:3306,host3:3306/mydb?failoverReadOnly=false

The driver tries each host in order and connects to the first one that is available and writable.

DNS-based failover#

Use a DNS CNAME that always points to the current primary. After failover, update the CNAME. Applications reconnect transparently — but only after the DNS TTL expires.

PgBouncer failover#

PgBouncer is a lightweight connection pooler for PostgreSQL. In a failover scenario, PgBouncer itself must be highly available and must point to the current primary.

Architecture patterns#

Pattern 1: PgBouncer follows Patroni

Patroni updates the PgBouncer configuration and reloads it when the primary changes:

# Patroni callback script
#!/bin/bash
if [ "$1" = "on_start" ] || [ "$1" = "on_role_change" ]; then
  NEW_PRIMARY=$(patronictl list -f json | jq -r '.[] | select(.Role=="Leader") | .Host')
  sed -i "s/host=.*/host=${NEW_PRIMARY}/" /etc/pgbouncer/pgbouncer.ini
  kill -HUP $(cat /var/run/pgbouncer/pgbouncer.pid)
fi

Pattern 2: PgBouncer behind a VIP

A virtual IP floats between PgBouncer instances. If one PgBouncer fails, the VIP moves to another instance. Applications always connect to the VIP.

Pattern 3: PgBouncer per application pod

Deploy PgBouncer as a sidecar container alongside each application pod. The sidecar handles connection pooling and points to the current primary via DNS.

HAProxy for database load balancing#

HAProxy sits between applications and database instances, routing connections based on health checks and routing rules.

Configuration#

# haproxy.cfg
listen postgres_write
    bind *:5432
    mode tcp
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server pg1 10.0.1.1:5432 check port 8008
    server pg2 10.0.1.2:5432 check port 8008
    server pg3 10.0.1.3:5432 check port 8008

listen postgres_read
    bind *:5433
    mode tcp
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server pg1 10.0.1.1:5432 check port 8008
    server pg2 10.0.1.2:5432 check port 8008
    server pg3 10.0.1.3:5432 check port 8008

The health check endpoint (/primary or /replica) is served by Patroni's REST API. HAProxy only routes write traffic to the current primary and read traffic to healthy replicas.

HAProxy high availability#

HAProxy itself must be redundant:

Deploy two HAProxy instances with keepalived managing a floating VIP
Or run HAProxy as a Kubernetes service with multiple replicas

Multi-AZ deployment#

AWS RDS Multi-AZ#

RDS Multi-AZ deploys a synchronous standby replica in a different availability zone:

Automatic failover in 60–120 seconds
DNS endpoint automatically updates to the new primary
Zero data loss (synchronous replication)
No application changes required

Custom multi-AZ with Patroni#

For self-managed PostgreSQL:

Deploy Patroni nodes across 3 AZs (one primary, two replicas)
Run etcd across all 3 AZs for consensus
Use synchronous replication to at least one AZ for zero RPO
HAProxy or PgBouncer in each AZ for local connection routing

Cross-region failover#

For disaster recovery, replicate asynchronously to a different region:

RPO is bounded by replication lag (typically seconds)
RTO depends on promotion time plus DNS propagation
Use Route 53 health checks with failover routing to automate DNS cutover

Monitoring failover readiness#

Track these metrics to ensure your failover will work when you need it:

Replication lag — if lag exceeds your RPO, you will lose data during failover
Failover test frequency — run chaos engineering exercises monthly
Connection pool health — stale connections after failover cause errors
DNS TTL alignment — ensure TTLs are low enough for your RTO target
Split-brain detection — alert if multiple nodes claim to be primary

Failover testing checklist#

Kill the primary process — does the replica promote automatically?
Network-partition the primary — does fencing prevent split-brain?
Measure actual RTO — does it meet your SLA?
Verify zero data loss — do all committed transactions survive?
Test application reconnection — do connections recover without manual intervention?
Simulate cascading failure — what happens if the new primary also fails?

399 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Agent Reliability Engineering

2 min read

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

Build this architecture

Generate an interactive architecture for Database Connection Failover in seconds.

Try it in Codelit →

Database Connection Failover — Keeping Applications Online When Databases Go Down

Why database failover matters#

Failover fundamentals#

RTO and RPO#

Replication modes#

Automatic failover#

How automatic failover works#

Avoiding split-brain#

PostgreSQL automatic failover#

MySQL automatic failover#

Read replica promotion#

Selection criteria#

Promotion steps (PostgreSQL)#

Promotion steps (MySQL)#

Connection retry strategies#

Exponential backoff with jitter#

Connection string failover#

DNS-based failover#

PgBouncer failover#

Architecture patterns#

HAProxy for database load balancing#

Configuration#

HAProxy high availability#

Multi-AZ deployment#

AWS RDS Multi-AZ#

Custom multi-AZ with Patroni#

Cross-region failover#

Monitoring failover readiness#

Failover testing checklist#

Comments

Related articles

Agent Reliability Engineering

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

E-Learning Platform

Online Auction Platform

PostgreSQL High Availability Cluster

Build this architecture

Database Connection Failover — Keeping Applications Online When Databases Go Down

Why database failover matters#

Failover fundamentals#

RTO and RPO#

Replication modes#

Automatic failover#

How automatic failover works#

Avoiding split-brain#

PostgreSQL automatic failover#

MySQL automatic failover#

Read replica promotion#

Selection criteria#

Promotion steps (PostgreSQL)#

Promotion steps (MySQL)#

Connection retry strategies#

Exponential backoff with jitter#

Connection string failover#

DNS-based failover#

PgBouncer failover#

Architecture patterns#

HAProxy for database load balancing#

Configuration#

HAProxy high availability#

Multi-AZ deployment#

AWS RDS Multi-AZ#

Custom multi-AZ with Patroni#

Cross-region failover#

Monitoring failover readiness#

Failover testing checklist#

Comments

Related articles

Agent Reliability Engineering

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

Try these templates

E-Learning Platform

Online Auction Platform

PostgreSQL High Availability Cluster

Build this architecture