Zero Downtime Migration: Strategies for Seamless Database and Service Transitions
Every minute of downtime costs money, trust, and users. Whether you are migrating a database schema, swapping a backing service, or moving between cloud providers, the goal is the same: zero downtime. This guide walks through the strategies, patterns, and tooling that make it possible.
Why Zero-Downtime Matters#
Traditional "stop-the-world" migrations lock users out while you alter tables, move data, and restart services. In a world of global traffic and SLA commitments, that approach is unacceptable:
- Revenue loss — E-commerce sites lose thousands of dollars per minute of outage.
- User trust — Repeated maintenance windows erode confidence.
- Cascading failures — Downstream services that depend on your API also go dark.
- Contractual penalties — Enterprise SLAs often mandate 99.95 %+ uptime.
Zero-downtime migration eliminates the maintenance window entirely. Every change is deployed incrementally, validated, and reversible.
Database Migration Strategies#
Expand-Contract (a.k.a. Parallel Change)#
The expand-contract pattern is the backbone of zero-downtime schema evolution:
Phase 1 — EXPAND
Add the new column / table alongside the old one.
Deploy code that writes to BOTH old and new.
Phase 2 — MIGRATE
Backfill existing rows into the new structure.
Validate data integrity.
Phase 3 — CONTRACT
Remove reads from the old column / table.
Deploy code that only writes to the new.
Drop the old column after a bake period.
Key rules:
- Never rename a column in place. Create the new column, migrate, then drop the old one.
- Never drop a column that running code still reads. Deploy the code change first.
- Keep migrations backward-compatible. The previous application version must still work after the schema change lands.
Shadow Writes#
Shadow writes extend the expand phase by writing to a completely separate data store — useful when migrating between databases (e.g., PostgreSQL to DynamoDB):
┌────────────┐ write ┌──────────────┐
│ App Code │──────────────▶│ Primary DB │ (source of truth)
│ │──────────────▶│ Shadow DB │ (new target)
└────────────┘ └──────────────┘
│ read from Primary only
- All writes go to both stores.
- Reads remain on the primary until you are confident the shadow store is consistent.
- A reconciliation job periodically compares rows and flags drift.
Dual Reads#
Once shadow writes are stable, you can introduce dual reads:
- Read from both stores in parallel.
- Return the primary result to the user.
- Log any discrepancy between primary and shadow.
- When the mismatch rate drops to zero over a sustained window, cut reads to the shadow store.
This pattern gives you a concrete, measurable signal that the new store is ready.
Blue-Green Deployments for Migrations#
Blue-green deployments are typically associated with application releases, but they work equally well for migration cut-overs:
Load Balancer
┌─────┴─────┐
┌─────▼───┐ ┌────▼────┐
│ Blue │ │ Green │
│ (old DB)│ │ (new DB)│
└─────────┘ └─────────┘
- Blue runs the current schema and code.
- Green runs the new schema with the updated code.
- A replication pipeline keeps Green in sync with Blue.
- Flip the load balancer to Green.
- If anything fails, flip back to Blue within seconds.
The critical prerequisite is a reliable replication pipeline — tools like Debezium (CDC), AWS DMS, or custom Kafka consumers can fill this role.
Feature Flags During Migration#
Feature flags let you decouple deployment from activation:
if feature_flags.is_enabled("use_new_inventory_table", user):
result = query_new_table(product_id)
else:
result = query_old_table(product_id)
Benefits during migration:
- Gradual rollout — Enable the new path for 1 %, then 10 %, then 50 %, then 100 %.
- Instant rollback — Flip the flag off without a deploy.
- Targeted testing — Enable for internal users or a specific region first.
- Audit trail — Flag platforms (LaunchDarkly, Unleash, Flagsmith) log every state change.
Combine feature flags with shadow writes: the flag controls which store handles reads, while writes always go to both.
Data Backfill Patterns#
Backfilling existing data into a new schema is often the riskiest phase. Patterns to keep it safe:
Chunked Backfill#
Process rows in small batches with a configurable delay:
-- Pseudocode: backfill in chunks of 1 000
LOOP
UPDATE new_table
SET col_x = transform(old_table.col_y)
FROM old_table
WHERE new_table.id = old_table.id
AND new_table.col_x IS NULL
LIMIT 1000;
IF row_count = 0 THEN EXIT;
SLEEP 100ms; -- throttle to avoid lock contention
END LOOP;
Lazy Backfill (Read-Repair)#
Instead of a batch job, backfill on first access:
- Application reads the new column.
- If null, compute the value from the old column, write it, and return.
- A background sweep handles rows that are never read.
This spreads the load over time and guarantees that hot data is migrated first.
Event-Sourced Backfill#
If your system uses event sourcing, replay the event log into the new projection. This is deterministic and inherently idempotent.
Rollback Strategies#
Every migration plan needs a rollback plan that is tested before the migration begins.
| Strategy | Speed | Data Loss Risk | Complexity |
|---|---|---|---|
| Feature flag flip | Seconds | None | Low |
| Blue-green switch | Seconds | Minimal | Medium |
| Schema revert migration | Minutes | Possible | High |
| Point-in-time restore | Minutes–hours | Yes (to snapshot) | High |
Best practices:
- Always keep the old schema alive during the bake period. Dropping columns too early is the number-one cause of failed rollbacks.
- Version your migration scripts. Tools like Flyway, Alembic, and golang-migrate support reversible migrations.
- Automate rollback triggers. If error rates exceed a threshold, roll back automatically via your CI/CD pipeline.
Monitoring During Migration#
You cannot safely migrate what you cannot observe. Instrument these signals:
Application Metrics#
- Error rate — by endpoint, by database call.
- Latency percentiles — p50, p95, p99 before, during, and after migration.
- Feature flag evaluation rate — confirms the rollout percentage matches expectations.
Database Metrics#
- Replication lag — critical for blue-green and shadow-write strategies.
- Lock wait time — long waits indicate schema changes are blocking production queries.
- Connection pool saturation — dual writes double the connection load.
Data Integrity Checks#
- Row count comparison — old store vs. new store.
- Checksum sampling — hash random rows and compare.
- Reconciliation job alerts — fire when mismatch count exceeds zero.
Alerting Posture#
During migration, lower your alert thresholds. A 5 % latency increase that you would normally ignore could be the first sign of lock contention from a backfill job. Create a dedicated migration dashboard and have the team watch it in real time.
Putting It All Together#
A typical zero-downtime migration follows this sequence:
- Plan — Document the old and new schemas, write reversible migration scripts, define success criteria.
- Expand — Deploy the new schema alongside the old. Begin shadow writes.
- Backfill — Migrate existing data in chunks. Validate with reconciliation jobs.
- Dual read — Compare results from both stores. Monitor mismatch rate.
- Cut over — Flip the feature flag to read from the new store. Watch dashboards.
- Contract — Remove old code paths. Drop old columns after the bake period.
- Celebrate — Zero users noticed.
Zero-downtime migration is not a single trick — it is a discipline that combines schema design, deployment strategy, observability, and feature management into a seamless workflow.
Plan, build, and ship with confidence at codelit.io.
This is article #165 in the Codelit engineering blog series.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Scalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsURL Shortener Service
Scalable URL shortening with analytics, custom aliases, and expiration.
7 componentsGmail-Scale Email Service
Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.
10 componentsBuild this architecture
Generate an interactive architecture for Zero Downtime Migration in seconds.
Try it in Codelit →
Comments