Blue-Green Deployment: Zero-Downtime Releases With Instant Rollback
Blue-Green Deployment#
Blue-green deployment eliminates downtime and deployment anxiety by running two identical production environments and switching traffic between them instantly.
How Blue-Green Works#
You maintain two environments:
Blue (current live) ← all traffic
Green (idle) ← no traffic
Deployment steps:
- Deploy new version to Green (no users affected)
- Run smoke tests against Green
- Switch traffic from Blue to Green
- Green is now live, Blue is idle
- If anything goes wrong, switch back to Blue instantly
Before switch: [Load Balancer] → Blue (v1.0) ← live
Green (v1.1) ← staging
After switch: [Load Balancer] → Green (v1.1) ← live
Blue (v1.0) ← rollback ready
The key insight: deployment and release are separate events. You deploy to Green without affecting anyone. You release by flipping the switch.
DNS vs Load Balancer Switching#
DNS Switching#
Update the DNS record to point to the new environment's IP:
Before: app.example.com → 10.0.1.100 (Blue)
After: app.example.com → 10.0.2.100 (Green)
Pros:
- Simple to understand
- Works with any infrastructure
Cons:
- DNS TTL propagation — clients cache the old IP for minutes to hours
- No instant rollback — switching back has the same propagation delay
- Split traffic — during propagation, some users hit Blue, others hit Green
DNS switching is rarely used in practice for blue-green due to the propagation problem.
Load Balancer Switching (Recommended)#
Update the load balancer's target group:
Before: ALB → Target Group A (Blue instances)
After: ALB → Target Group B (Green instances)
Pros:
- Instant switch — takes effect in seconds
- Instant rollback — flip back just as fast
- No client caching — load balancer controls routing
- Health checks — LB verifies Green is healthy before sending traffic
This is the standard approach. AWS ALB, nginx, HAProxy, and cloud load balancers all support this.
Database Migration Challenges#
Blue-green is straightforward for stateless services. Databases make it hard.
The Problem#
Blue (v1) uses schema v1
Green (v2) needs schema v2
Both point to the same database
If you migrate the schema for Green, Blue breaks. If you don't migrate, Green can't run.
Solution: Expand and Contract#
Split schema changes into backward-compatible steps:
Phase 1 — Expand (before switch):
-- Add new column, keep old one
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Backfill new column
UPDATE users SET full_name = first_name || ' ' || last_name;
Both Blue (v1) and Green (v2) work with this schema. Blue ignores full_name. Green uses it.
Phase 2 — Switch traffic to Green.
Phase 3 — Contract (after switch, Blue is idle):
-- Now safe to remove old columns
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
Rules for Safe Migrations#
- Never rename columns — add new, migrate data, drop old
- Never drop columns in the expand phase
- Never change column types directly — add new column with new type
- Always make migrations reversible until the contract phase
Rollback#
Rollback is blue-green's superpower. Since the old environment is still running:
Green (v1.1) has a bug → switch traffic back to Blue (v1.0)
Total rollback time: seconds, not minutes or hours.
When Rollback Gets Complicated#
Rollback is instant for application code. But if Green has written data to the database in a new format:
Green (v1.1) writes orders with new "priority" field
Rollback to Blue (v1.0) which doesn't know about "priority"
Mitigations:
- Design for backward compatibility — old code should ignore unknown fields
- Keep a rollback window — monitor for 15-30 minutes before decommissioning Blue
- Feature flags — disable new features without full rollback
Smoke Testing#
Before switching traffic, validate Green thoroughly:
Deploy to Green
→ Health check endpoints respond 200
→ Core API endpoints return valid data
→ Database connectivity verified
→ External service integrations working
→ Performance baseline met (response time, error rate)
→ Critical user flows pass (login, checkout, etc.)
Switch traffic to Green
Automated Smoke Test Example#
# smoke-tests.yml
tests:
- name: Health check
url: https://green.internal/health
expect_status: 200
- name: API response
url: https://green.internal/api/v1/status
expect_status: 200
expect_body_contains: "ok"
- name: Database connectivity
url: https://green.internal/api/v1/db-check
expect_status: 200
max_response_time_ms: 500
- name: Auth flow
url: https://green.internal/api/v1/auth/test
method: POST
expect_status: 200
If any smoke test fails, abort the switch. Green stays idle, Blue continues serving.
Tools#
AWS CodeDeploy#
Native blue-green support with ALB integration:
# appspec.yml
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: "arn:aws:ecs:...:task-def/app:2"
LoadBalancerInfo:
ContainerName: "app"
ContainerPort: 8080
Hooks:
- BeforeAllowTraffic: "LambdaSmokeTest"
- AfterAllowTraffic: "LambdaValidation"
CodeDeploy handles traffic shifting, health checks, and automatic rollback on failure.
Kubernetes#
Kubernetes doesn't natively support blue-green, but you can implement it with Services:
# Switch by updating the selector
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: green # Change to "blue" for rollback
ports:
- port: 80
targetPort: 8080
Deploy the green version as a separate Deployment, update the Service selector, done.
Argo Rollouts#
Purpose-built for advanced deployment strategies on Kubernetes:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
prePromotionAnalysis:
templates:
- templateName: smoke-tests
scaleDownDelaySeconds: 600
Argo Rollouts adds: preview services, automated analysis, manual promotion gates, and configurable scale-down delays.
Cost Considerations#
Blue-green means double the infrastructure during deployment:
Always running: 1x environment (Blue OR Green)
During deploy: 2x environments (Blue AND Green)
After switch: 1x environment + idle standby
Reducing Cost#
- Scale down idle environment — keep it at minimum capacity, scale up before next deployment
- Use spot/preemptible instances for the idle environment
- Containerize — spin up Green only when deploying, tear it down after validation
- Serverless — with Lambda or Cloud Functions, you only pay for invocations, so idle Green costs nothing
Cost vs Other Strategies#
Blue-Green: 2x infra during deploy, instant rollback
Rolling Update: 1x infra, slower rollback
Canary: 1.01x infra, gradual rollback
Blue-green costs more but gives you the fastest, most reliable rollback.
When to Use Blue-Green#
Good fit:
- Applications that need zero-downtime deployments
- Systems where fast rollback is critical (e-commerce, fintech)
- Stateless services or services with backward-compatible schemas
Poor fit:
- Databases with complex schema migrations (consider canary instead)
- Very large infrastructures where doubling cost is prohibitive
- Systems with heavy local state or sticky sessions
Key Takeaways#
- Two identical environments, switch traffic instantly between them
- Load balancer switching over DNS — instant, no propagation delay
- Expand-and-contract for database migrations — never break backward compatibility
- Rollback in seconds by switching back to the old environment
- Smoke test Green before switching — health checks, API validation, performance baseline
- Use Argo Rollouts on Kubernetes for automated analysis and promotion gates
283 articles on system design at codelit.io/blog.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Vercel Deployment Platform
Frontend deployment platform with instant previews, edge functions, serverless builds, and global CDN.
10 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsMicroservices with API Gateway
Microservices architecture with API gateway, service discovery, circuit breakers, and distributed tracing.
10 components
Comments