Zero Downtime Database Migrations: Evolve Your Schema Without Breaking Production
Zero Downtime Database Migrations#
Every production database must evolve. New features demand new columns, renamed fields, restructured tables. The challenge is doing this while thousands of queries per second keep flowing — no maintenance windows, no downtime.
Why Traditional Migrations Break#
A naive ALTER TABLE on a large table locks it:
-- This locks the users table for minutes/hours on large datasets
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';
During that lock:
- All reads and writes to the table block
- Connection pools fill up
- Application threads wait, then timeout
- Users see 500 errors or infinite spinners
For a table with 100M rows, a simple ADD COLUMN with a default value can lock the table for 10+ minutes in MySQL 5.7 or older PostgreSQL versions.
Online Schema Change Tools#
pt-online-schema-change (Percona)#
The original tool for MySQL online DDL:
# How it works internally:
1. Create shadow table: _users_new (copy of users schema)
2. ALTER the shadow table (instant — it's empty)
3. Create triggers on original table:
- INSERT trigger → also INSERT into _users_new
- UPDATE trigger → also UPDATE _users_new
- DELETE trigger → also DELETE from _users_new
4. Copy rows in chunks from users → _users_new
5. Atomic rename: users → _users_old, _users_new → users
6. Drop _users_old
Key flags to know:
pt-online-schema-change \
--alter "ADD COLUMN preferences JSON" \
--chunk-size=1000 \
--max-lag=1s \
--check-replication-filters \
D=mydb,t=users
gh-ost (GitHub)#
GitHub's triggerless online schema migration:
# gh-ost approach (no triggers):
1. Create ghost table: _users_gho
2. ALTER the ghost table
3. Connect to replica as a MySQL replication client
4. Read binlog events for the original table
5. Apply binlog events to ghost table (async)
6. Copy existing rows in chunks
7. Cut-over: brief lock for atomic rename
Advantages over pt-online-schema-change:
- No triggers — triggers cause write amplification and locking
- Pausable — stop and resume migration anytime
- Testable — run on replica first with
--test-on-replica - Throttle-aware — automatically slows down under load
gh-ost \
--alter="ADD COLUMN preferences JSON" \
--database=mydb \
--table=users \
--chunk-size=1000 \
--max-load=Threads_running=25 \
--critical-load=Threads_running=100 \
--execute
The Expand-Contract Pattern#
The safest pattern for schema evolution across any database:
Phase 1: Expand#
Add new structures alongside existing ones:
-- Add new column (nullable, no default — instant in most databases)
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);
Phase 2: Migrate#
Backfill data and update application code:
# Dual-write: application writes to both columns
def update_email(user_id, email):
normalized = email.strip().lower()
db.execute("""
UPDATE users
SET email = %s, email_normalized = %s
WHERE id = %s
""", (email, normalized, user_id))
Phase 3: Contract#
Remove the old structure once migration is complete:
-- Only after verifying all reads use the new column
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;
Each phase is a separate deployment. If anything breaks, you roll back one phase — not the entire migration.
Shadow Columns and Dual-Write#
Shadow columns let you validate new data structures before committing:
class UserService:
def save_user(self, user_data):
# Write to both old and new columns
self.db.execute("""
INSERT INTO users (name, address_text, address_json)
VALUES (%s, %s, %s)
""", (
user_data['name'],
user_data['address'], # old format
json.dumps(user_data['address_structured']) # new format
))
def verify_shadow_consistency(self):
"""Compare old and new columns for data integrity."""
mismatches = self.db.execute("""
SELECT id, address_text, address_json
FROM users
WHERE address_text != address_json->>'full_address'
""")
return mismatches
The dual-write verification loop:
Deploy v1: Write old column only
Deploy v2: Write both columns (shadow write)
Verify: Run consistency checks for 24-48 hours
Deploy v3: Read from new column, write both
Verify: Monitor error rates for 24-48 hours
Deploy v4: Write new column only, drop old column
Data Backfill Strategies#
For tables with hundreds of millions of rows, backfill must be chunked:
def backfill_preferences(batch_size=1000, sleep_between=0.1):
"""Backfill preferences column without overwhelming the database."""
last_id = 0
while True:
rows = db.execute("""
SELECT id FROM users
WHERE id > %s AND preferences IS NULL
ORDER BY id
LIMIT %s
""", (last_id, batch_size))
if not rows:
break
ids = [r['id'] for r in rows]
db.execute("""
UPDATE users
SET preferences = '{}'::jsonb
WHERE id = ANY(%s)
""", (ids,))
last_id = ids[-1]
time.sleep(sleep_between) # Throttle to reduce replica lag
log.info(f"Backfill complete. Last ID processed: {last_id}")
Important backfill rules:
- Always use a cursor (WHERE id > last_id) — never OFFSET
- Sleep between batches to let replicas catch up
- Monitor replica lag and pause if it exceeds threshold
- Make it idempotent — safe to restart from any point
- Run during low traffic when possible
Migration Testing Framework#
Never run a migration in production without testing it first:
class MigrationTestSuite:
def test_migration_forward(self):
"""Apply migration and verify schema state."""
self.apply_migration('020_add_preferences')
assert self.column_exists('users', 'preferences')
assert self.column_nullable('users', 'preferences')
def test_migration_rollback(self):
"""Verify clean rollback."""
self.apply_migration('020_add_preferences')
self.rollback_migration('020_add_preferences')
assert not self.column_exists('users', 'preferences')
def test_migration_with_data(self):
"""Verify existing data survives migration."""
self.seed_data('users', count=10000)
self.apply_migration('020_add_preferences')
assert self.row_count('users') == 10000
assert self.null_count('users', 'preferences') == 10000
def test_migration_performance(self):
"""Verify migration completes within time budget."""
self.seed_data('users', count=1_000_000)
start = time.time()
self.apply_migration('020_add_preferences')
duration = time.time() - start
assert duration < 60, f"Migration took {duration}s (budget: 60s)"
Testing with Production-Like Data#
# Restore anonymized production snapshot
pg_restore --dbname=migration_test production_snapshot.dump
# Run migration
python manage.py migrate --database=migration_test
# Verify data integrity
python manage.py check_migration_integrity --database=migration_test
# Load test with migration applied
k6 run --env DB=migration_test load_test.js
Migration Checklist#
Before every production migration:
Pre-Migration:
[ ] Migration tested on production-size dataset
[ ] Rollback script written and tested
[ ] Backfill script is idempotent and chunked
[ ] Application code handles both old and new schema
[ ] Monitoring alerts configured for migration metrics
During Migration:
[ ] Replica lag monitored (pause if > 5s)
[ ] Connection pool utilization tracked
[ ] Query latency dashboards open
[ ] Rollback command ready to execute
Post-Migration:
[ ] Data integrity verified
[ ] Application error rates normal
[ ] Old schema artifacts cleaned up
[ ] Migration documented in runbook
Key Takeaways#
Zero downtime migrations are not optional for production systems:
- Use online schema change tools (gh-ost, pt-online-schema-change) for large table DDL
- Follow expand-contract to make every change reversible
- Dual-write with shadow columns to validate before committing
- Chunk backfills with cursor-based pagination and throttling
- Test migrations against production-size data before executing
The database is the hardest part of any system to change. Invest in migration tooling early — it pays dividends every time you ship a feature.
Article #319 in the Codelit engineering series. Want to master database operations, DevOps, and system design? Explore more at codelit.io.
Try it on Codelit
AI Architecture Review
Get an AI audit covering security gaps, bottlenecks, and scaling risks
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Zero Downtime Database Migrations in seconds.
Try it in Codelit →
Comments