zero downtime migrationdatabase migrationexpand-contractblue-green deploymentfeature flagsrollback strategysystem designDevOps

Zero Downtime Migration: Strategies for Seamless Database and Service Transitions

March 28, 2026 7 min readBy Codelit Team Discussion

Every minute of downtime costs money, trust, and users. Whether you are migrating a database schema, swapping a backing service, or moving between cloud providers, the goal is the same: zero downtime. This guide walks through the strategies, patterns, and tooling that make it possible.

Why Zero-Downtime Matters#

Traditional "stop-the-world" migrations lock users out while you alter tables, move data, and restart services. In a world of global traffic and SLA commitments, that approach is unacceptable:

Revenue loss — E-commerce sites lose thousands of dollars per minute of outage.
User trust — Repeated maintenance windows erode confidence.
Cascading failures — Downstream services that depend on your API also go dark.
Contractual penalties — Enterprise SLAs often mandate 99.95 %+ uptime.

Zero-downtime migration eliminates the maintenance window entirely. Every change is deployed incrementally, validated, and reversible.

Database Migration Strategies#

Expand-Contract (a.k.a. Parallel Change)#

The expand-contract pattern is the backbone of zero-downtime schema evolution:

Phase 1 — EXPAND
  Add the new column / table alongside the old one.
  Deploy code that writes to BOTH old and new.

Phase 2 — MIGRATE
  Backfill existing rows into the new structure.
  Validate data integrity.

Phase 3 — CONTRACT
  Remove reads from the old column / table.
  Deploy code that only writes to the new.
  Drop the old column after a bake period.

Key rules:

Never rename a column in place. Create the new column, migrate, then drop the old one.
Never drop a column that running code still reads. Deploy the code change first.
Keep migrations backward-compatible. The previous application version must still work after the schema change lands.

Shadow Writes#

Shadow writes extend the expand phase by writing to a completely separate data store — useful when migrating between databases (e.g., PostgreSQL to DynamoDB):

┌────────────┐     write     ┌──────────────┐
│  App Code  │──────────────▶│  Primary DB  │  (source of truth)
│            │──────────────▶│  Shadow DB   │  (new target)
└────────────┘               └──────────────┘
       │  read from Primary only

All writes go to both stores.
Reads remain on the primary until you are confident the shadow store is consistent.
A reconciliation job periodically compares rows and flags drift.

Dual Reads#

Once shadow writes are stable, you can introduce dual reads:

Read from both stores in parallel.
Return the primary result to the user.
Log any discrepancy between primary and shadow.
When the mismatch rate drops to zero over a sustained window, cut reads to the shadow store.

This pattern gives you a concrete, measurable signal that the new store is ready.

Blue-Green Deployments for Migrations#

Blue-green deployments are typically associated with application releases, but they work equally well for migration cut-overs:

           Load Balancer
           ┌─────┴─────┐
     ┌─────▼───┐  ┌────▼────┐
     │  Blue   │  │  Green  │
     │ (old DB)│  │ (new DB)│
     └─────────┘  └─────────┘

Blue runs the current schema and code.
Green runs the new schema with the updated code.
A replication pipeline keeps Green in sync with Blue.
Flip the load balancer to Green.
If anything fails, flip back to Blue within seconds.

The critical prerequisite is a reliable replication pipeline — tools like Debezium (CDC), AWS DMS, or custom Kafka consumers can fill this role.

Feature Flags During Migration#

Feature flags let you decouple deployment from activation:

if feature_flags.is_enabled("use_new_inventory_table", user):
    result = query_new_table(product_id)
else:
    result = query_old_table(product_id)

Benefits during migration:

Gradual rollout — Enable the new path for 1 %, then 10 %, then 50 %, then 100 %.
Instant rollback — Flip the flag off without a deploy.
Targeted testing — Enable for internal users or a specific region first.
Audit trail — Flag platforms (LaunchDarkly, Unleash, Flagsmith) log every state change.

Combine feature flags with shadow writes: the flag controls which store handles reads, while writes always go to both.

Data Backfill Patterns#

Backfilling existing data into a new schema is often the riskiest phase. Patterns to keep it safe:

Chunked Backfill#

Process rows in small batches with a configurable delay:

-- Pseudocode: backfill in chunks of 1 000
LOOP
  UPDATE new_table
  SET    col_x = transform(old_table.col_y)
  FROM   old_table
  WHERE  new_table.id = old_table.id
    AND  new_table.col_x IS NULL
  LIMIT  1000;

  IF row_count = 0 THEN EXIT;
  SLEEP 100ms;  -- throttle to avoid lock contention
END LOOP;

Lazy Backfill (Read-Repair)#

Instead of a batch job, backfill on first access:

Application reads the new column.
If null, compute the value from the old column, write it, and return.
A background sweep handles rows that are never read.

This spreads the load over time and guarantees that hot data is migrated first.

Event-Sourced Backfill#

If your system uses event sourcing, replay the event log into the new projection. This is deterministic and inherently idempotent.

Rollback Strategies#

Every migration plan needs a rollback plan that is tested before the migration begins.

Strategy	Speed	Data Loss Risk	Complexity
Feature flag flip	Seconds	None	Low
Blue-green switch	Seconds	Minimal	Medium
Schema revert migration	Minutes	Possible	High
Point-in-time restore	Minutes–hours	Yes (to snapshot)	High

Best practices:

Always keep the old schema alive during the bake period. Dropping columns too early is the number-one cause of failed rollbacks.
Version your migration scripts. Tools like Flyway, Alembic, and golang-migrate support reversible migrations.
Automate rollback triggers. If error rates exceed a threshold, roll back automatically via your CI/CD pipeline.

Monitoring During Migration#

You cannot safely migrate what you cannot observe. Instrument these signals:

Application Metrics#

Error rate — by endpoint, by database call.
Latency percentiles — p50, p95, p99 before, during, and after migration.
Feature flag evaluation rate — confirms the rollout percentage matches expectations.

Database Metrics#

Replication lag — critical for blue-green and shadow-write strategies.
Lock wait time — long waits indicate schema changes are blocking production queries.
Connection pool saturation — dual writes double the connection load.

Data Integrity Checks#

Row count comparison — old store vs. new store.
Checksum sampling — hash random rows and compare.
Reconciliation job alerts — fire when mismatch count exceeds zero.

Alerting Posture#

During migration, lower your alert thresholds. A 5 % latency increase that you would normally ignore could be the first sign of lock contention from a backfill job. Create a dedicated migration dashboard and have the team watch it in real time.

Putting It All Together#

A typical zero-downtime migration follows this sequence:

Plan — Document the old and new schemas, write reversible migration scripts, define success criteria.
Expand — Deploy the new schema alongside the old. Begin shadow writes.
Backfill — Migrate existing data in chunks. Validate with reconciliation jobs.
Dual read — Compare results from both stores. Monitor mismatch rate.
Cut over — Flip the feature flag to read from the new store. Watch dashboards.
Contract — Remove old code paths. Drop old columns after the bake period.
Celebrate — Zero users noticed.

Zero-downtime migration is not a single trick — it is a discipline that combines schema design, deployment strategy, observability, and feature management into a seamless workflow.

Plan, build, and ship with confidence at codelit.io.

This is article #165 in the Codelit engineering blog series.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

URL Shortener Service

Scalable URL shortening with analytics, custom aliases, and expiration.

7 components

Gmail-Scale Email Service

Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.

10 components

Build this architecture

Generate an interactive architecture for Zero Downtime Migration in seconds.

Try it in Codelit →

zero downtime migrationdatabase migrationexpand-contractblue-green deploymentfeature flagsrollback strategysystem designDevOps

Zero Downtime Migration: Strategies for Seamless Database and Service Transitions

March 28, 2026 7 min readBy Codelit Team Discussion

Why Zero-Downtime Matters#

Traditional "stop-the-world" migrations lock users out while you alter tables, move data, and restart services. In a world of global traffic and SLA commitments, that approach is unacceptable:

Revenue loss — E-commerce sites lose thousands of dollars per minute of outage.
User trust — Repeated maintenance windows erode confidence.
Cascading failures — Downstream services that depend on your API also go dark.
Contractual penalties — Enterprise SLAs often mandate 99.95 %+ uptime.

Zero-downtime migration eliminates the maintenance window entirely. Every change is deployed incrementally, validated, and reversible.

Database Migration Strategies#

Expand-Contract (a.k.a. Parallel Change)#

The expand-contract pattern is the backbone of zero-downtime schema evolution:

Phase 1 — EXPAND
  Add the new column / table alongside the old one.
  Deploy code that writes to BOTH old and new.

Phase 2 — MIGRATE
  Backfill existing rows into the new structure.
  Validate data integrity.

Phase 3 — CONTRACT
  Remove reads from the old column / table.
  Deploy code that only writes to the new.
  Drop the old column after a bake period.

Key rules:

Never rename a column in place. Create the new column, migrate, then drop the old one.
Never drop a column that running code still reads. Deploy the code change first.
Keep migrations backward-compatible. The previous application version must still work after the schema change lands.

Shadow Writes#

Shadow writes extend the expand phase by writing to a completely separate data store — useful when migrating between databases (e.g., PostgreSQL to DynamoDB):

┌────────────┐     write     ┌──────────────┐
│  App Code  │──────────────▶│  Primary DB  │  (source of truth)
│            │──────────────▶│  Shadow DB   │  (new target)
└────────────┘               └──────────────┘
       │  read from Primary only

All writes go to both stores.
Reads remain on the primary until you are confident the shadow store is consistent.
A reconciliation job periodically compares rows and flags drift.

Dual Reads#

Once shadow writes are stable, you can introduce dual reads:

Read from both stores in parallel.
Return the primary result to the user.
Log any discrepancy between primary and shadow.
When the mismatch rate drops to zero over a sustained window, cut reads to the shadow store.

This pattern gives you a concrete, measurable signal that the new store is ready.

Blue-Green Deployments for Migrations#

Blue-green deployments are typically associated with application releases, but they work equally well for migration cut-overs:

           Load Balancer
           ┌─────┴─────┐
     ┌─────▼───┐  ┌────▼────┐
     │  Blue   │  │  Green  │
     │ (old DB)│  │ (new DB)│
     └─────────┘  └─────────┘

Blue runs the current schema and code.
Green runs the new schema with the updated code.
A replication pipeline keeps Green in sync with Blue.
Flip the load balancer to Green.
If anything fails, flip back to Blue within seconds.

The critical prerequisite is a reliable replication pipeline — tools like Debezium (CDC), AWS DMS, or custom Kafka consumers can fill this role.

Feature Flags During Migration#

Feature flags let you decouple deployment from activation:

if feature_flags.is_enabled("use_new_inventory_table", user):
    result = query_new_table(product_id)
else:
    result = query_old_table(product_id)

Benefits during migration:

Gradual rollout — Enable the new path for 1 %, then 10 %, then 50 %, then 100 %.
Instant rollback — Flip the flag off without a deploy.
Targeted testing — Enable for internal users or a specific region first.
Audit trail — Flag platforms (LaunchDarkly, Unleash, Flagsmith) log every state change.

Combine feature flags with shadow writes: the flag controls which store handles reads, while writes always go to both.

Data Backfill Patterns#

Backfilling existing data into a new schema is often the riskiest phase. Patterns to keep it safe:

Chunked Backfill#

Process rows in small batches with a configurable delay:

-- Pseudocode: backfill in chunks of 1 000
LOOP
  UPDATE new_table
  SET    col_x = transform(old_table.col_y)
  FROM   old_table
  WHERE  new_table.id = old_table.id
    AND  new_table.col_x IS NULL
  LIMIT  1000;

  IF row_count = 0 THEN EXIT;
  SLEEP 100ms;  -- throttle to avoid lock contention
END LOOP;

Lazy Backfill (Read-Repair)#

Instead of a batch job, backfill on first access:

Application reads the new column.
If null, compute the value from the old column, write it, and return.
A background sweep handles rows that are never read.

This spreads the load over time and guarantees that hot data is migrated first.

Event-Sourced Backfill#

If your system uses event sourcing, replay the event log into the new projection. This is deterministic and inherently idempotent.

Rollback Strategies#

Every migration plan needs a rollback plan that is tested before the migration begins.

Strategy	Speed	Data Loss Risk	Complexity
Feature flag flip	Seconds	None	Low
Blue-green switch	Seconds	Minimal	Medium
Schema revert migration	Minutes	Possible	High
Point-in-time restore	Minutes–hours	Yes (to snapshot)	High

Best practices:

Always keep the old schema alive during the bake period. Dropping columns too early is the number-one cause of failed rollbacks.
Version your migration scripts. Tools like Flyway, Alembic, and golang-migrate support reversible migrations.
Automate rollback triggers. If error rates exceed a threshold, roll back automatically via your CI/CD pipeline.

Monitoring During Migration#

You cannot safely migrate what you cannot observe. Instrument these signals:

Application Metrics#

Error rate — by endpoint, by database call.
Latency percentiles — p50, p95, p99 before, during, and after migration.
Feature flag evaluation rate — confirms the rollout percentage matches expectations.

Database Metrics#

Replication lag — critical for blue-green and shadow-write strategies.
Lock wait time — long waits indicate schema changes are blocking production queries.
Connection pool saturation — dual writes double the connection load.

Data Integrity Checks#

Row count comparison — old store vs. new store.
Checksum sampling — hash random rows and compare.
Reconciliation job alerts — fire when mismatch count exceeds zero.

Alerting Posture#

Putting It All Together#

A typical zero-downtime migration follows this sequence:

Plan — Document the old and new schemas, write reversible migration scripts, define success criteria.
Expand — Deploy the new schema alongside the old. Begin shadow writes.
Backfill — Migrate existing data in chunks. Validate with reconciliation jobs.
Dual read — Compare results from both stores. Monitor mismatch rate.
Cut over — Flip the feature flag to read from the new store. Watch dashboards.
Contract — Remove old code paths. Drop old columns after the bake period.
Celebrate — Zero users noticed.

Zero-downtime migration is not a single trick — it is a discipline that combines schema design, deployment strategy, observability, and feature management into a seamless workflow.

Plan, build, and ship with confidence at codelit.io.

This is article #165 in the Codelit engineering blog series.

Try it on Codelit

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Zero Downtime Migration in seconds.

Try it in Codelit →

Zero Downtime Migration: Strategies for Seamless Database and Service Transitions

Why Zero-Downtime Matters#

Database Migration Strategies#

Expand-Contract (a.k.a. Parallel Change)#

Shadow Writes#

Dual Reads#

Blue-Green Deployments for Migrations#

Feature Flags During Migration#

Data Backfill Patterns#

Chunked Backfill#

Lazy Backfill (Read-Repair)#

Event-Sourced Backfill#

Rollback Strategies#

Monitoring During Migration#

Application Metrics#

Database Metrics#

Data Integrity Checks#

Alerting Posture#

Putting It All Together#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Scalable SaaS Application

URL Shortener Service

Gmail-Scale Email Service

Build this architecture

Zero Downtime Migration: Strategies for Seamless Database and Service Transitions

Why Zero-Downtime Matters#

Database Migration Strategies#

Expand-Contract (a.k.a. Parallel Change)#

Shadow Writes#

Dual Reads#

Blue-Green Deployments for Migrations#

Feature Flags During Migration#

Data Backfill Patterns#

Chunked Backfill#

Lazy Backfill (Read-Repair)#

Event-Sourced Backfill#

Rollback Strategies#

Monitoring During Migration#

Application Metrics#

Database Metrics#

Data Integrity Checks#

Alerting Posture#

Putting It All Together#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Scalable SaaS Application

URL Shortener Service

Gmail-Scale Email Service

Build this architecture