Blue-Green and Canary deploy

1) Challenge and key ideas

Blue-Green and Canary are non-stop release strategies that reduce the risk of adoption:

Blue-Green: keep two parallel versions (Blue - active, Green - new), switch traffic atomically. A quick rollback → instantly return Blue.
Canary: turn on the new version in stages (1% → 5% → 25% → 50% → 100%), monitor SLO metrics and stop/roll back during degradation.

The general principle is to separate "artifact delivery" from "traffic inclusion" and automate observability + rollbacks.

2) When to choose

Blue-Green is suitable when:

need instant switching (hard RTO), simple state-less services;
there are strict release/freeze windows and clear smoke checks;
it is expensive to hold a long double capacity - but it is possible for a short time.

Canary is suitable when:

complex changes, step-by-step validation on real traffic is required;
there is mature telemetry (SLO, business metrics), auto-stop capability;
critically limit the radius of damage (fintech/iGaming streams).

Combo pattern: roll out Green and switch to it through canary-stages (Blue-Green as a frame, Canary as a method of carrying traffic).

3) Traffic routing architecture

Options for switching/adding traffic:

1. L4/L7 balancer (ALB/NLB, Cloud Load Balancer) - weighted target groups.

2. API gateway/WAF - route/weight by versions, headers, cookies, regions.

3. Service Mesh (Istio/Linkerd/Consul) - percentage distribution, fault injection, timeout/retray/restriction handles.

4. Ingress/NGINX/Envoy - upstream weights and attribute routing.

5. Argo Rollouts/Flagger - operator-controller, automatic progression, integration with Prometheus/New Relic/Datadog.

4) Kubernetes: practical templates

4. 1 Blue-Green (via Service selector)

Два Deployment: `app-blue` и `app-green`.
One Service 'app-svc' with a selector for the desired 'version'.

yaml apiVersion: apps/v1 kind: Deployment metadata: { name: app-green, labels: { app: app, version: green } }
spec:
replicas: 4 selector: { matchLabels: { app: app, version: green } }
template:
metadata: { labels: { app: app, version: green } }
spec:
containers:
- name: app image: ghcr. io/org/app:1. 8. 0 apiVersion: v1 kind: Service metadata: { name: app-svc }
spec:
selector: {app: app, version: blue} # ← switch to green - change ports: [{port: 80, targetPort: 8080}]

Switching - atomic change of selector (or labels) with controlled drain.

4. 2 Canary (Istio VirtualService)

yaml apiVersion: networking. istio. io/v1beta1 kind: VirtualService metadata: { name: app }
spec:
hosts: ["app. example. com"]
http:
- route:
- destination: { host: app. blue. svc. cluster. local, subset: v1 }
weight: 90
- destination: { host: app. green. svc. cluster. local, subset: v2 }
weight: 10

Change 'weight' by step; add retry, timeout, outlier-detector to DestinationRule.

4. 3 Argo Rollouts (Auto Canary Run)

yaml apiVersion: argoproj. io/v1alpha1 kind: Rollout metadata: { name: app }
spec:
replicas: 6 strategy:
canary:
canaryService: app-canary stableService: app-stable steps:
- setWeight: 5
- pause: {duration: 300} # 5 min observation
- analysis:
templates:
- templateName: slo-guard
- setWeight: 25
- pause: { duration: 600 }
- analysis:
templates: [{ templateName: slo-guard }]
- setWeight: 50
- pause: {}
trafficRouting:
istio:
virtualService:
name: app routes: ["http-route"]

The template analysis is associated with metrics (see below).

5) SLO gates and auto rollback

Protected metrics (examples):

Technical: 'p95 _ latency', '5xx _ rate', 'error _ budget _ burn', 'CPU/Memory throttling'.
Grocery: 'CR (deposit)', 'success of payments', 'scoring fraud', 'ARPPU' (on cold windows).

Stop policy (example):

If the '5xx _ rate' of the new version is> 0. 5% for 10 min - pause and rollback.
If 'p95 _ latency' ↑> 20% of the base - rollback.
If canary promotion goes but budget SLO is burned> 2 %/hour - hold.

Argo AnalysisTemplate (simplified):

yaml apiVersion: argoproj. io/v1alpha1 kind: AnalysisTemplate metadata: { name: slo-guard }
spec:
metrics:
- name: http_5xx_rate interval: 1m successCondition: result < 0. 005 provider:
prometheus:
address: http://prometheus. monitoring:9090 query:
sum(rate(http_requests_total{app="app",status=~"5.."}[5m])) /
sum(rate(http_requests_total{app="app"}[5m]))

6) Data and compatibility (the most common cause of pain)

Use the expand → migrate → contract strategy:

Expand: add new nullable columns/indexes, support both schemes.
Migrate: Double Write/Read, Back-Fill.
Contract: delete old fields/code after exiting 100% of traffic.
Event/queues: version payload (v1/v2), support idempotency.
Cache/sessions: version keys; Ensure format compatibility.

7) Integration with CI/CD and GitOps

CI: build once, image signature, SBOM, tests.
CD: artifact promotion through environments; Blue-Green/Canary are governed by manifestos.
GitOps: MR → controller (Argo CD/Flux) applies weights/selectors.
Environments/Approvals: for production steps - manual gate + audit decisions.

8) NGINX/Envoy and Cloud LBs: Quick Examples

8. 1 NGINX (upstream weights)

nginx upstream app_upstream {
server app-blue:8080 weight=90;
server app-green:8080 weight=10;
}
server {
location / { proxy_pass http://app_upstream; }
}

8. 2 AWS ALB (Weighted Target Groups)

TG-Blue: 90, TG-Green: 10 → change weights via IaC/CLI.
Link CloudWatch alerts to rollback auto scripts (weight change to 0/100).

9) Safety and compliance

Zero trust between versions: distinguish between encryption secrets/rolling keys.
Policy-as-Code: disallow unsigned image deploy, 'no latest'.
Secrets and configs as version artifacts; rollback includes rollback of configs.
Audit: who, when he lifted the weight/switched the selector, with what ticket.

10) Cost and capacity

Blue-Green requires double the power for the release period → plan a window.
Canary can last longer → cost of telemetry/surveillance, parallel content of two versions.
Optimization: autoscaling by HPA/VPA, short Blue-Green windows, night releases for "heavy" services.

11) Runbooks

1. Pause the promotion.
2. Reduce Green weight to 0% (canary )/return selector to Blue (blue-green).
3. Check: errors/latency returned to basic, drain connections.
4. Open an incident, collect artifacts (logs, tracks, comparison of metrics).
5. Fix/reprod to stage, drive smoke, restart progression.

12) Anti-patterns

Rebuilding an artifact between stage and prod (violation of "build once").
"Deaf" canary without SLO/metrics is a formality, not a defense.
Lack of feature flags: the release is forced to include behavior 100% at once.
Non-working health-checks/liveness → "sticky" bottoms and false stability.
Database compatibility "at random": the contract breaks when switching.
Mutable image tags/' latest 'in the prod.

13) Implementation checklist (0-45 days)

0-10 days

Choose a strategy for services: B/G, Canary or combined.
Enable image signing, health-checks, readiness-samples, 'no latest'.
Prepare SLO dashboards (latency/error rate/business metrics).

11-25 days

Automate weights (Istio/Argo Rollouts/ALB-weights).
Configure analysis templates, alerts and auto-rollback.
Template manifests (Helm/Kustomize), integrate with GitOps.

26-45 days

Implement the expand-migrate-contract strategy for the database.
Cover critical kill-switch flags.
Spend "game day": simulate a rollback and incident.

14) Maturity metrics

% of releases through Blue-Green/Canary (target> 90%).
Average switchover/rollback time (target <3 min).
Share of releases with SLO auto-stop (and without incidents).
Service coverage by telemetry (traces/logs/metrics)> 95%.
The share of DB migrations according to the expand-migrate-contract scheme is> 90%.

15) Attachments: Policy and Pipeline Templates

OPA (disallow unsigned images)

rego package admission. image

deny[msg] {
input. request. kind. kind == "Deployment"
some c img:= input. request. object. spec. template. spec. containers[c].image not startswith(img, "ghcr. io/org/")
msg:= sprintf("Image not from trusted registry: %v", [img])
}

Helm-values for canary (simplified)

yaml canary:
enabled: true steps:
- weight: 5 pause: 300
- weight: 25 pause: 600
- weight: 50 pause: 900 sloGuards:
max5xxPct: 0. 5 maxP95IncreasePct: 20

GitHub Actions - weight promotion (pseudo)

yaml
- name: Promote canary to 25%
run: kubectl patch virtualservice app \
--type=json \
-p='[{"op":"replace","path":"/spec/http/0/route/1/weight","value":25}]'

16) Conclusion

Blue-Green and Canary are not mutually exclusive, but complementary strategies. Build them on top of signed artifacts, SLO observability, automatic gates and GitOps control. Separate delivery from inclusion, keep a quick rollback and migration discipline - and releases become predictable, secure and fast.

Blue-Green and Canary deploy