databasemigrationsDevOpszero downtimesystem design

Zero Downtime Database Migrations: Evolve Your Schema Without Breaking Production

March 29, 2026 6 min readBy Codelit Team Discussion

Zero Downtime Database Migrations#

Every production database must evolve. New features demand new columns, renamed fields, restructured tables. The challenge is doing this while thousands of queries per second keep flowing — no maintenance windows, no downtime.

Why Traditional Migrations Break#

A naive ALTER TABLE on a large table locks it:

-- This locks the users table for minutes/hours on large datasets
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';

During that lock:

All reads and writes to the table block
Connection pools fill up
Application threads wait, then timeout
Users see 500 errors or infinite spinners

For a table with 100M rows, a simple ADD COLUMN with a default value can lock the table for 10+ minutes in MySQL 5.7 or older PostgreSQL versions.

Online Schema Change Tools#

pt-online-schema-change (Percona)#

The original tool for MySQL online DDL:

# How it works internally:
1. Create shadow table: _users_new (copy of users schema)
2. ALTER the shadow table (instant — it's empty)
3. Create triggers on original table:
   - INSERT trigger → also INSERT into _users_new
   - UPDATE trigger → also UPDATE _users_new
   - DELETE trigger → also DELETE from _users_new
4. Copy rows in chunks from users → _users_new
5. Atomic rename: users → _users_old, _users_new → users
6. Drop _users_old

Key flags to know:

pt-online-schema-change \
  --alter "ADD COLUMN preferences JSON" \
  --chunk-size=1000 \
  --max-lag=1s \
  --check-replication-filters \
  D=mydb,t=users

gh-ost (GitHub)#

GitHub's triggerless online schema migration:

# gh-ost approach (no triggers):
1. Create ghost table: _users_gho
2. ALTER the ghost table
3. Connect to replica as a MySQL replication client
4. Read binlog events for the original table
5. Apply binlog events to ghost table (async)
6. Copy existing rows in chunks
7. Cut-over: brief lock for atomic rename

Advantages over pt-online-schema-change:

No triggers — triggers cause write amplification and locking
Pausable — stop and resume migration anytime
Testable — run on replica first with --test-on-replica
Throttle-aware — automatically slows down under load

gh-ost \
  --alter="ADD COLUMN preferences JSON" \
  --database=mydb \
  --table=users \
  --chunk-size=1000 \
  --max-load=Threads_running=25 \
  --critical-load=Threads_running=100 \
  --execute

The Expand-Contract Pattern#

The safest pattern for schema evolution across any database:

Phase 1: Expand#

Add new structures alongside existing ones:

-- Add new column (nullable, no default — instant in most databases)
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);

Phase 2: Migrate#

Backfill data and update application code:

# Dual-write: application writes to both columns
def update_email(user_id, email):
    normalized = email.strip().lower()
    db.execute("""
        UPDATE users
        SET email = %s, email_normalized = %s
        WHERE id = %s
    """, (email, normalized, user_id))

Phase 3: Contract#

Remove the old structure once migration is complete:

-- Only after verifying all reads use the new column
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;

Each phase is a separate deployment. If anything breaks, you roll back one phase — not the entire migration.

Shadow Columns and Dual-Write#

Shadow columns let you validate new data structures before committing:

class UserService:
    def save_user(self, user_data):
        # Write to both old and new columns
        self.db.execute("""
            INSERT INTO users (name, address_text, address_json)
            VALUES (%s, %s, %s)
        """, (
            user_data['name'],
            user_data['address'],           # old format
            json.dumps(user_data['address_structured'])  # new format
        ))

    def verify_shadow_consistency(self):
        """Compare old and new columns for data integrity."""
        mismatches = self.db.execute("""
            SELECT id, address_text, address_json
            FROM users
            WHERE address_text != address_json->>'full_address'
        """)
        return mismatches

The dual-write verification loop:

Deploy v1: Write old column only
Deploy v2: Write both columns (shadow write)
Verify:    Run consistency checks for 24-48 hours
Deploy v3: Read from new column, write both
Verify:    Monitor error rates for 24-48 hours
Deploy v4: Write new column only, drop old column

Data Backfill Strategies#

For tables with hundreds of millions of rows, backfill must be chunked:

def backfill_preferences(batch_size=1000, sleep_between=0.1):
    """Backfill preferences column without overwhelming the database."""
    last_id = 0

    while True:
        rows = db.execute("""
            SELECT id FROM users
            WHERE id > %s AND preferences IS NULL
            ORDER BY id
            LIMIT %s
        """, (last_id, batch_size))

        if not rows:
            break

        ids = [r['id'] for r in rows]
        db.execute("""
            UPDATE users
            SET preferences = '{}'::jsonb
            WHERE id = ANY(%s)
        """, (ids,))

        last_id = ids[-1]
        time.sleep(sleep_between)  # Throttle to reduce replica lag

    log.info(f"Backfill complete. Last ID processed: {last_id}")

Important backfill rules:

Always use a cursor (WHERE id > last_id) — never OFFSET
Sleep between batches to let replicas catch up
Monitor replica lag and pause if it exceeds threshold
Make it idempotent — safe to restart from any point
Run during low traffic when possible

Migration Testing Framework#

Never run a migration in production without testing it first:

class MigrationTestSuite:
    def test_migration_forward(self):
        """Apply migration and verify schema state."""
        self.apply_migration('020_add_preferences')
        assert self.column_exists('users', 'preferences')
        assert self.column_nullable('users', 'preferences')

    def test_migration_rollback(self):
        """Verify clean rollback."""
        self.apply_migration('020_add_preferences')
        self.rollback_migration('020_add_preferences')
        assert not self.column_exists('users', 'preferences')

    def test_migration_with_data(self):
        """Verify existing data survives migration."""
        self.seed_data('users', count=10000)
        self.apply_migration('020_add_preferences')
        assert self.row_count('users') == 10000
        assert self.null_count('users', 'preferences') == 10000

    def test_migration_performance(self):
        """Verify migration completes within time budget."""
        self.seed_data('users', count=1_000_000)
        start = time.time()
        self.apply_migration('020_add_preferences')
        duration = time.time() - start
        assert duration < 60, f"Migration took {duration}s (budget: 60s)"

Testing with Production-Like Data#

# Restore anonymized production snapshot
pg_restore --dbname=migration_test production_snapshot.dump

# Run migration
python manage.py migrate --database=migration_test

# Verify data integrity
python manage.py check_migration_integrity --database=migration_test

# Load test with migration applied
k6 run --env DB=migration_test load_test.js

Migration Checklist#

Before every production migration:

Pre-Migration:
  [ ] Migration tested on production-size dataset
  [ ] Rollback script written and tested
  [ ] Backfill script is idempotent and chunked
  [ ] Application code handles both old and new schema
  [ ] Monitoring alerts configured for migration metrics

During Migration:
  [ ] Replica lag monitored (pause if > 5s)
  [ ] Connection pool utilization tracked
  [ ] Query latency dashboards open
  [ ] Rollback command ready to execute

Post-Migration:
  [ ] Data integrity verified
  [ ] Application error rates normal
  [ ] Old schema artifacts cleaned up
  [ ] Migration documented in runbook

Key Takeaways#

Zero downtime migrations are not optional for production systems:

Use online schema change tools (gh-ost, pt-online-schema-change) for large table DDL
Follow expand-contract to make every change reversible
Dual-write with shadow columns to validate before committing
Chunk backfills with cursor-based pagination and throttling
Test migrations against production-size data before executing

The database is the hardest part of any system to change. Invest in migration tooling early — it pays dividends every time you ship a feature.

Article #319 in the Codelit engineering series. Want to master database operations, DevOps, and system design? Explore more at codelit.io.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Zero Downtime Database Migrations in seconds.

Try it in Codelit →

databasemigrationsDevOpszero downtimesystem design

Zero Downtime Database Migrations: Evolve Your Schema Without Breaking Production

March 29, 2026 6 min readBy Codelit Team Discussion

Zero Downtime Database Migrations#

Why Traditional Migrations Break#

A naive ALTER TABLE on a large table locks it:

-- This locks the users table for minutes/hours on large datasets
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';

During that lock:

All reads and writes to the table block
Connection pools fill up
Application threads wait, then timeout
Users see 500 errors or infinite spinners

For a table with 100M rows, a simple ADD COLUMN with a default value can lock the table for 10+ minutes in MySQL 5.7 or older PostgreSQL versions.

Online Schema Change Tools#

pt-online-schema-change (Percona)#

The original tool for MySQL online DDL:

# How it works internally:
1. Create shadow table: _users_new (copy of users schema)
2. ALTER the shadow table (instant — it's empty)
3. Create triggers on original table:
   - INSERT trigger → also INSERT into _users_new
   - UPDATE trigger → also UPDATE _users_new
   - DELETE trigger → also DELETE from _users_new
4. Copy rows in chunks from users → _users_new
5. Atomic rename: users → _users_old, _users_new → users
6. Drop _users_old

Key flags to know:

pt-online-schema-change \
  --alter "ADD COLUMN preferences JSON" \
  --chunk-size=1000 \
  --max-lag=1s \
  --check-replication-filters \
  D=mydb,t=users

gh-ost (GitHub)#

GitHub's triggerless online schema migration:

# gh-ost approach (no triggers):
1. Create ghost table: _users_gho
2. ALTER the ghost table
3. Connect to replica as a MySQL replication client
4. Read binlog events for the original table
5. Apply binlog events to ghost table (async)
6. Copy existing rows in chunks
7. Cut-over: brief lock for atomic rename

Advantages over pt-online-schema-change:

No triggers — triggers cause write amplification and locking
Pausable — stop and resume migration anytime
Testable — run on replica first with --test-on-replica
Throttle-aware — automatically slows down under load

gh-ost \
  --alter="ADD COLUMN preferences JSON" \
  --database=mydb \
  --table=users \
  --chunk-size=1000 \
  --max-load=Threads_running=25 \
  --critical-load=Threads_running=100 \
  --execute

The Expand-Contract Pattern#

The safest pattern for schema evolution across any database:

Phase 1: Expand#

Add new structures alongside existing ones:

-- Add new column (nullable, no default — instant in most databases)
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);

Phase 2: Migrate#

Backfill data and update application code:

# Dual-write: application writes to both columns
def update_email(user_id, email):
    normalized = email.strip().lower()
    db.execute("""
        UPDATE users
        SET email = %s, email_normalized = %s
        WHERE id = %s
    """, (email, normalized, user_id))

Phase 3: Contract#

Remove the old structure once migration is complete:

-- Only after verifying all reads use the new column
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;

Each phase is a separate deployment. If anything breaks, you roll back one phase — not the entire migration.

Shadow Columns and Dual-Write#

Shadow columns let you validate new data structures before committing:

class UserService:
    def save_user(self, user_data):
        # Write to both old and new columns
        self.db.execute("""
            INSERT INTO users (name, address_text, address_json)
            VALUES (%s, %s, %s)
        """, (
            user_data['name'],
            user_data['address'],           # old format
            json.dumps(user_data['address_structured'])  # new format
        ))

    def verify_shadow_consistency(self):
        """Compare old and new columns for data integrity."""
        mismatches = self.db.execute("""
            SELECT id, address_text, address_json
            FROM users
            WHERE address_text != address_json->>'full_address'
        """)
        return mismatches

The dual-write verification loop:

Deploy v1: Write old column only
Deploy v2: Write both columns (shadow write)
Verify:    Run consistency checks for 24-48 hours
Deploy v3: Read from new column, write both
Verify:    Monitor error rates for 24-48 hours
Deploy v4: Write new column only, drop old column

Data Backfill Strategies#

For tables with hundreds of millions of rows, backfill must be chunked:

def backfill_preferences(batch_size=1000, sleep_between=0.1):
    """Backfill preferences column without overwhelming the database."""
    last_id = 0

    while True:
        rows = db.execute("""
            SELECT id FROM users
            WHERE id > %s AND preferences IS NULL
            ORDER BY id
            LIMIT %s
        """, (last_id, batch_size))

        if not rows:
            break

        ids = [r['id'] for r in rows]
        db.execute("""
            UPDATE users
            SET preferences = '{}'::jsonb
            WHERE id = ANY(%s)
        """, (ids,))

        last_id = ids[-1]
        time.sleep(sleep_between)  # Throttle to reduce replica lag

    log.info(f"Backfill complete. Last ID processed: {last_id}")

Important backfill rules:

Always use a cursor (WHERE id > last_id) — never OFFSET
Sleep between batches to let replicas catch up
Monitor replica lag and pause if it exceeds threshold
Make it idempotent — safe to restart from any point
Run during low traffic when possible

Migration Testing Framework#

Never run a migration in production without testing it first:

class MigrationTestSuite:
    def test_migration_forward(self):
        """Apply migration and verify schema state."""
        self.apply_migration('020_add_preferences')
        assert self.column_exists('users', 'preferences')
        assert self.column_nullable('users', 'preferences')

    def test_migration_rollback(self):
        """Verify clean rollback."""
        self.apply_migration('020_add_preferences')
        self.rollback_migration('020_add_preferences')
        assert not self.column_exists('users', 'preferences')

    def test_migration_with_data(self):
        """Verify existing data survives migration."""
        self.seed_data('users', count=10000)
        self.apply_migration('020_add_preferences')
        assert self.row_count('users') == 10000
        assert self.null_count('users', 'preferences') == 10000

    def test_migration_performance(self):
        """Verify migration completes within time budget."""
        self.seed_data('users', count=1_000_000)
        start = time.time()
        self.apply_migration('020_add_preferences')
        duration = time.time() - start
        assert duration < 60, f"Migration took {duration}s (budget: 60s)"

Testing with Production-Like Data#

# Restore anonymized production snapshot
pg_restore --dbname=migration_test production_snapshot.dump

# Run migration
python manage.py migrate --database=migration_test

# Verify data integrity
python manage.py check_migration_integrity --database=migration_test

# Load test with migration applied
k6 run --env DB=migration_test load_test.js

Migration Checklist#

Before every production migration:

Pre-Migration:
  [ ] Migration tested on production-size dataset
  [ ] Rollback script written and tested
  [ ] Backfill script is idempotent and chunked
  [ ] Application code handles both old and new schema
  [ ] Monitoring alerts configured for migration metrics

During Migration:
  [ ] Replica lag monitored (pause if > 5s)
  [ ] Connection pool utilization tracked
  [ ] Query latency dashboards open
  [ ] Rollback command ready to execute

Post-Migration:
  [ ] Data integrity verified
  [ ] Application error rates normal
  [ ] Old schema artifacts cleaned up
  [ ] Migration documented in runbook

Key Takeaways#

Zero downtime migrations are not optional for production systems:

Use online schema change tools (gh-ost, pt-online-schema-change) for large table DDL
Follow expand-contract to make every change reversible
Dual-write with shadow columns to validate before committing
Chunk backfills with cursor-based pagination and throttling
Test migrations against production-size data before executing

The database is the hardest part of any system to change. Invest in migration tooling early — it pays dividends every time you ship a feature.

Article #319 in the Codelit engineering series. Want to master database operations, DevOps, and system design? Explore more at codelit.io.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Try these templates

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Zero Downtime Database Migrations in seconds.

Try it in Codelit →

Zero Downtime Database Migrations: Evolve Your Schema Without Breaking Production

Zero Downtime Database Migrations#

Why Traditional Migrations Break#

Online Schema Change Tools#

pt-online-schema-change (Percona)#

gh-ost (GitHub)#

The Expand-Contract Pattern#

Phase 1: Expand#

Phase 2: Migrate#

Phase 3: Contract#

Shadow Columns and Dual-Write#

Data Backfill Strategies#

Migration Testing Framework#

Testing with Production-Like Data#

Migration Checklist#

Key Takeaways#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

CI/CD Pipeline Architecture

Build this architecture

Zero Downtime Database Migrations: Evolve Your Schema Without Breaking Production

Zero Downtime Database Migrations#

Why Traditional Migrations Break#

Online Schema Change Tools#

pt-online-schema-change (Percona)#

gh-ost (GitHub)#

The Expand-Contract Pattern#

Phase 1: Expand#

Phase 2: Migrate#

Phase 3: Contract#

Shadow Columns and Dual-Write#

Data Backfill Strategies#

Migration Testing Framework#

Testing with Production-Like Data#

Migration Checklist#

Key Takeaways#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

CI/CD Pipeline Architecture

Build this architecture