Blue-Green and Canary deploy
Blue-Green and Canary deploy
1) Challenge and key ideas
Blue-Green and Canary are non-stop release strategies that reduce the risk of adoption:- Blue-Green: keep two parallel versions (Blue - active, Green - new), switch traffic atomically. A quick rollback → instantly return Blue.
- Canary: turn on the new version in stages (1% → 5% → 25% → 50% → 100%), monitor SLO metrics and stop/roll back during degradation.
The general principle is to separate "artifact delivery" from "traffic inclusion" and automate observability + rollbacks.
2) When to choose
Blue-Green is suitable when:- need instant switching (hard RTO), simple state-less services;
- there are strict release/freeze windows and clear smoke checks;
- it is expensive to hold a long double capacity - but it is possible for a short time.
- complex changes, step-by-step validation on real traffic is required;
- there is mature telemetry (SLO, business metrics), auto-stop capability;
- critically limit the radius of damage (fintech/iGaming streams).
Combo pattern: roll out Green and switch to it through canary-stages (Blue-Green as a frame, Canary as a method of carrying traffic).
3) Traffic routing architecture
Options for switching/adding traffic:1. L4/L7 balancer (ALB/NLB, Cloud Load Balancer) - weighted target groups.
2. API gateway/WAF - route/weight by versions, headers, cookies, regions.
3. Service Mesh (Istio/Linkerd/Consul) - percentage distribution, fault injection, timeout/retray/restriction handles.
4. Ingress/NGINX/Envoy - upstream weights and attribute routing.
5. Argo Rollouts/Flagger - operator-controller, automatic progression, integration with Prometheus/New Relic/Datadog.
4) Kubernetes: practical templates
4. 1 Blue-Green (via Service selector)
Два Deployment: `app-blue` и `app-green`.
One Service 'app-svc' with a selector for the desired 'version'.
yaml apiVersion: apps/v1 kind: Deployment metadata: { name: app-green, labels: { app: app, version: green } }
spec:
replicas: 4 selector: { matchLabels: { app: app, version: green } }
template:
metadata: { labels: { app: app, version: green } }
spec:
containers:
- name: app image: ghcr. io/org/app:1. 8. 0 apiVersion: v1 kind: Service metadata: { name: app-svc }
spec:
selector: {app: app, version: blue} # ← switch to green - change ports: [{port: 80, targetPort: 8080}]
Switching - atomic change of selector (or labels) with controlled drain.
4. 2 Canary (Istio VirtualService)
yaml apiVersion: networking. istio. io/v1beta1 kind: VirtualService metadata: { name: app }
spec:
hosts: ["app. example. com"]
http:
- route:
- destination: { host: app. blue. svc. cluster. local, subset: v1 }
weight: 90
- destination: { host: app. green. svc. cluster. local, subset: v2 }
weight: 10
Change 'weight' by step; add retry, timeout, outlier-detector to DestinationRule.
4. 3 Argo Rollouts (Auto Canary Run)
yaml apiVersion: argoproj. io/v1alpha1 kind: Rollout metadata: { name: app }
spec:
replicas: 6 strategy:
canary:
canaryService: app-canary stableService: app-stable steps:
- setWeight: 5
- pause: {duration: 300} # 5 min observation
- analysis:
templates:
- templateName: slo-guard
- setWeight: 25
- pause: { duration: 600 }
- analysis:
templates: [{ templateName: slo-guard }]
- setWeight: 50
- pause: {}
trafficRouting:
istio:
virtualService:
name: app routes: ["http-route"]
The template analysis is associated with metrics (see below).
5) SLO gates and auto rollback
Protected metrics (examples):- Technical: 'p95 _ latency', '5xx _ rate', 'error _ budget _ burn', 'CPU/Memory throttling'.
- Grocery: 'CR (deposit)', 'success of payments', 'scoring fraud', 'ARPPU' (on cold windows).
- If the '5xx _ rate' of the new version is> 0. 5% for 10 min - pause and rollback.
- If 'p95 _ latency' ↑> 20% of the base - rollback.
- If canary promotion goes but budget SLO is burned> 2 %/hour - hold.
yaml apiVersion: argoproj. io/v1alpha1 kind: AnalysisTemplate metadata: { name: slo-guard }
spec:
metrics:
- name: http_5xx_rate interval: 1m successCondition: result < 0. 005 provider:
prometheus:
address: http://prometheus. monitoring:9090 query:
sum(rate(http_requests_total{app="app",status=~"5.."}[5m])) /
sum(rate(http_requests_total{app="app"}[5m]))
6) Data and compatibility (the most common cause of pain)
Use the expand → migrate → contract strategy:- Expand: add new nullable columns/indexes, support both schemes.
- Migrate: Double Write/Read, Back-Fill.
- Contract: delete old fields/code after exiting 100% of traffic.
- Event/queues: version payload (v1/v2), support idempotency.
- Cache/sessions: version keys; Ensure format compatibility.
7) Integration with CI/CD and GitOps
CI: build once, image signature, SBOM, tests.
CD: artifact promotion through environments; Blue-Green/Canary are governed by manifestos.
GitOps: MR → controller (Argo CD/Flux) applies weights/selectors.
Environments/Approvals: for production steps - manual gate + audit decisions.
8) NGINX/Envoy and Cloud LBs: Quick Examples
8. 1 NGINX (upstream weights)
nginx upstream app_upstream {
server app-blue:8080 weight=90;
server app-green:8080 weight=10;
}
server {
location / { proxy_pass http://app_upstream; }
}
8. 2 AWS ALB (Weighted Target Groups)
TG-Blue: 90, TG-Green: 10 → change weights via IaC/CLI.
Link CloudWatch alerts to rollback auto scripts (weight change to 0/100).
9) Safety and compliance
Zero trust between versions: distinguish between encryption secrets/rolling keys.
Policy-as-Code: disallow unsigned image deploy, 'no latest'.
Secrets and configs as version artifacts; rollback includes rollback of configs.
Audit: who, when he lifted the weight/switched the selector, with what ticket.
10) Cost and capacity
Blue-Green requires double the power for the release period → plan a window.
Canary can last longer → cost of telemetry/surveillance, parallel content of two versions.
Optimization: autoscaling by HPA/VPA, short Blue-Green windows, night releases for "heavy" services.
11) Runbooks
1. Pause the promotion.
2. Reduce Green weight to 0% (canary )/return selector to Blue (blue-green).
3. Check: errors/latency returned to basic, drain connections.
4. Open an incident, collect artifacts (logs, tracks, comparison of metrics).
5. Fix/reprod to stage, drive smoke, restart progression.
12) Anti-patterns
Rebuilding an artifact between stage and prod (violation of "build once").
"Deaf" canary without SLO/metrics is a formality, not a defense.
Lack of feature flags: the release is forced to include behavior 100% at once.
Non-working health-checks/liveness → "sticky" bottoms and false stability.
Database compatibility "at random": the contract breaks when switching.
Mutable image tags/' latest 'in the prod.
13) Implementation checklist (0-45 days)
0-10 days
Choose a strategy for services: B/G, Canary or combined.
Enable image signing, health-checks, readiness-samples, 'no latest'.
Prepare SLO dashboards (latency/error rate/business metrics).
11-25 days
Automate weights (Istio/Argo Rollouts/ALB-weights).
Configure analysis templates, alerts and auto-rollback.
Template manifests (Helm/Kustomize), integrate with GitOps.
26-45 days
Implement the expand-migrate-contract strategy for the database.
Cover critical kill-switch flags.
Spend "game day": simulate a rollback and incident.
14) Maturity metrics
% of releases through Blue-Green/Canary (target> 90%).
Average switchover/rollback time (target <3 min).
Share of releases with SLO auto-stop (and without incidents).
Service coverage by telemetry (traces/logs/metrics)> 95%.
The share of DB migrations according to the expand-migrate-contract scheme is> 90%.
15) Attachments: Policy and Pipeline Templates
OPA (disallow unsigned images)
rego package admission. image
deny[msg] {
input. request. kind. kind == "Deployment"
some c img:= input. request. object. spec. template. spec. containers[c].image not startswith(img, "ghcr. io/org/")
msg:= sprintf("Image not from trusted registry: %v", [img])
}
Helm-values for canary (simplified)
yaml canary:
enabled: true steps:
- weight: 5 pause: 300
- weight: 25 pause: 600
- weight: 50 pause: 900 sloGuards:
max5xxPct: 0. 5 maxP95IncreasePct: 20
GitHub Actions - weight promotion (pseudo)
yaml
- name: Promote canary to 25%
run: kubectl patch virtualservice app \
--type=json \
-p='[{"op":"replace","path":"/spec/http/0/route/1/weight","value":25}]'
16) Conclusion
Blue-Green and Canary are not mutually exclusive, but complementary strategies. Build them on top of signed artifacts, SLO observability, automatic gates and GitOps control. Separate delivery from inclusion, keep a quick rollback and migration discipline - and releases become predictable, secure and fast.