Canary Deployments: Ship to 1% Before You Ship to Everyone
Canary Deployments#
A canary deployment routes a small percentage of traffic to the new version while the rest stays on the old one. If metrics look good, you increase the percentage. If they don't, you roll back — and only 1% of users were ever affected.
Canary vs Blue-Green#
Both strategies reduce deployment risk, but they work differently:
Blue-Green:
[Load Balancer] ──100%──▶ [Blue (current)]
──0%───▶ [Green (new)]
Flip: 0% / 100% instantly
Canary:
[Load Balancer] ──99%──▶ [v1 (current)]
──1%──▶ [v2 (canary)]
Gradual: 1% → 5% → 25% → 100%
| Aspect | Blue-Green | Canary |
|---|---|---|
| Traffic shift | All-at-once | Gradual |
| Blast radius | 100% if broken | 1-5% initially |
| Infrastructure | 2x capacity needed | Minimal extra capacity |
| Rollback speed | Instant (flip back) | Instant (shift to 0%) |
| Confidence building | None — binary | High — observe at each step |
| Complexity | Low | Medium-High |
Blue-green is simpler but gives no confidence window. Canary is the better choice when you need observable proof that the new version works before full rollout.
Traffic Splitting Strategy#
The standard progression:
Stage 1: 1% traffic → watch for 10 minutes
Stage 2: 5% traffic → watch for 15 minutes
Stage 3: 25% traffic → watch for 30 minutes
Stage 4: 50% traffic → watch for 30 minutes
Stage 5: 100% traffic → deployment complete
Why These Percentages Matter#
- 1% catches catastrophic failures (crashes, 5xx spikes) with minimal user impact
- 5% surfaces performance regressions visible under light load
- 25% reveals issues that only appear at moderate scale (connection pool exhaustion, cache contention)
- 50% validates behavior under near-production load distribution
- 100% full rollout — the canary is now production
Each stage should have a minimum bake time — the shortest duration you wait before promoting, even if metrics look perfect. This catches slow-building issues like memory leaks.
Metrics to Monitor#
Your canary is only as good as the metrics you watch:
Primary Metrics (Automated Gates)#
Latency:
p50 canary vs baseline: delta must be less than 10%
p99 canary vs baseline: delta must be less than 25%
Error Rate:
5xx rate canary: must be less than 0.5%
Error rate delta: must be less than 0.1% above baseline
Throughput:
Requests per second should be proportional to traffic split
Significant deviation suggests routing issues
Secondary Metrics (Manual Review)#
- CPU and memory utilization trends
- Downstream service error rates
- Database query latency changes
- Queue depth and processing lag
- Business metrics (conversion rate, checkout completion)
Custom Metrics#
Define domain-specific canary metrics:
E-commerce: cart abandonment rate, payment success rate
Streaming: buffering ratio, playback start time
SaaS: API response time by endpoint, webhook delivery rate
Automated Canary Analysis with Kayenta#
Manual observation doesn't scale. Kayenta (by Netflix/Google) automates the statistical comparison between canary and baseline.
How Kayenta Works#
┌─────────────────┐
Metrics Store ───▶│ Kayenta │───▶ Pass / Fail Score
(Prometheus, │ │
Datadog, │ 1. Fetch canary │
Stackdriver) │ metrics │
│ 2. Fetch baseline│
│ metrics │
│ 3. Statistical │
│ comparison │
│ 4. Score (0-100)│
└─────────────────┘
Kayenta uses the Mann-Whitney U test to compare metric distributions. A score of 0-100 determines pass/fail:
- Score above 75: Canary is healthy, promote to next stage
- Score 50-75: Marginal, extend bake time
- Score below 50: Canary is degraded, trigger rollback
Configuration Example#
canaryConfig:
metrics:
- name: error-rate
query: "rate(http_requests_total{status=~'5..'}[5m])"
direction: increase # higher is worse
nanStrategy: replace
- name: latency-p99
query: "histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))"
direction: increase
thresholds:
pass: 75
marginal: 50
lifetime: 30m
analysisInterval: 5m
Rollback Triggers#
Automatic rollback should fire immediately when:
Immediate Rollback:
├── Error rate exceeds 5% (absolute)
├── p99 latency exceeds 3x baseline
├── Pod crash loop detected
├── Health check failures exceed 3 consecutive
└── Kayenta score below 40
Delayed Rollback (after grace period):
├── Error rate exceeds 1% for more than 5 minutes
├── Memory usage trending upward continuously
└── Kayenta score below 60 for 2 consecutive analyses
Rollback Mechanics#
Rollback triggered:
1. Shift 100% traffic back to stable version
2. Scale down canary pods
3. Send alert to on-call and deployment channel
4. Mark release as failed in deployment tracker
5. Preserve canary pods for debugging (optional)
The rollback must be faster than the failure. If your rollback takes 5 minutes but your error budget burns in 2 minutes, you need a faster mechanism (pre-provisioned stable pods, instant traffic shift via service mesh).
Tools for Canary Deployments#
Argo Rollouts#
Kubernetes-native progressive delivery controller:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: { duration: 10m }
- setWeight: 5
- pause: { duration: 15m }
- setWeight: 25
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: { duration: 30m }
- setWeight: 100
canaryService: myapp-canary
stableService: myapp-stable
Flagger#
Works with Istio, Linkerd, App Mesh, and NGINX:
apiVersion: flagger.app/v1beta1
kind: Canary
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 600
analysis:
interval: 1m
threshold: 5 # max failed checks before rollback
maxWeight: 50
stepWeight: 10 # increase by 10% each interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # milliseconds
interval: 1m
Istio (Service Mesh)#
Fine-grained traffic splitting at the network level:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp
subset: stable
weight: 95
- destination:
host: myapp
subset: canary
weight: 5
Istio enables header-based routing for internal canary testing before any public traffic:
Route Rule: if header "x-canary: true" → canary pods
else → stable pods
Progressive Delivery Pipeline#
The full lifecycle:
Code Merge
│
▼
Build + Test (CI)
│
▼
Deploy Canary (1 pod)
│
▼
Shift 1% traffic ──▶ Analyze (10 min) ──▶ Pass? ──No──▶ Rollback
│ │
│ Yes
▼ │
Shift 5% traffic ──▶ Analyze (15 min) ──▶ Pass? ──No──▶ Rollback
│ │
│ Yes
▼ │
Shift 25% traffic ──▶ Analyze (30 min) ──▶ Pass? ──No──▶ Rollback
│ │
│ Yes
▼ │
Shift 100% ──▶ Scale down old version ──▶ Done
Key Takeaways#
- Start at 1% — catch catastrophic failures with minimal blast radius
- Automate analysis — use Kayenta or built-in tool analysis, not human eyeballs
- Define rollback triggers upfront — don't decide thresholds during an incident
- Bake time matters — memory leaks and slow degradations need time to surface
- Canary complements, not replaces — still run tests, still do code review
- Monitor business metrics — a technically healthy canary can still hurt conversion rates
284 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Canary Deployments in seconds.
Try it in Codelit →
Comments