Release strategies: blue-green and canary
(Section: Technology and Infrastructure)
Brief Summary
Blue-green gives an instant switch between two full stacks (Blue/Green) with the simplest rollback. Canary is gradually increasing the share of traffic to the new version under the control of SLO gates (latency, error-rate, business metrics). For iGaming, this is a way to release without downtime at the peak of tournaments and promotions, while maintaining a stable p99 and quality.
1) When to choose
Blue-green - fast releases, minimal complexity, you need a "double" cluster/resource budget. Good for API/front without complex state migration.
Canary - high-risk releases (new flow, critical changes), allows you to "catch" degradation by 1-5% of traffic. Requires telemetry and automatic gates.
2) Architectural principles
1. L7 level routing: balancer/Ingress/service-mesh (weighted traffic modules, cookies/flag-routing).
2. Isolated dependencies: configurations, phicheflags, secrets, caches - separately for revisions.
3. Data compatibility: forward-compatible (expand→migrate→contract) database migrations.
4. Observability: individual labels/version labels in metrics/logs/tracks.
5. Autogates: p95/p99 comparison, error-rate, business KPI; automatic rollback.
3) Blue-green: basic pattern
Stream
1. Expand Green (a copy of Blue) → warm up the caches/connections.
2. Run health-/smoke-tests.
3. Switch traffic (DNS/LB/Ingress) to Green.
4. We keep Blue in a "warm" state as a fallback until the end of the window.
Example: Ingress level switching (idea)
yaml
Annotated/Backend Option - In Prod, it is usually controlled by the spec operator/rollout:
rules:
- host: api. example. com http:
paths:
- path: /
backend:
service:
name: api-green # used to be api-blue port:
number: 80
Pros/Cons
Simple rollback (returned Blue).
Predictable release time.
Requires duplication of resources.
Risk of "big bang" without canary measurement.
4) Canary: Gradual build-up
Stream
1. Shadow traffic (optional) → 1% of real traffic → 5% → 25% → 50% → 100%.
2. At each stage - gates by SLO/business metrics.
3. During degradation - auto-rollback and preservation of diagnostic artifacts.
Example: Argo Rollouts (snippet)
yaml apiVersion: argoproj. io/v1alpha1 kind: Rollout metadata: { name: payments-api }
spec:
strategy:
canary:
canaryService: payments-canary stableService: payments-stable steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: slo-latency
- setWeight: 25
- pause: { duration: 10m }
- analysis:
templates:
- templateName: error-rate
- setWeight: 50
- pause: { duration: 20m }
- setWeight: 100
Example: Flagger + Istio/NGINX (idea)
yaml apiVersion: flagger. app/v1beta1 kind: Canary metadata: { name: games-api }
spec:
targetRef:
apiVersion: apps/v1 kind: Deployment name: games-api service:
port: 80 analysis:
interval: 1m threshold: 5 metrics:
- name: request-success-rate thresholdRange: { min: 99 }
- name: request-duration thresholdRange: { max: 300 }
webhooks:
- name: smoke url: http://tester/smoke
5) Warm-up and condition management
Caches/sources: warm up the Redis/HTTP cache/CDN, prepare warm-pool connections to the database/PSP.
ML/LLM/models: loading weights/indices/embeddings, KV cache, primary requests for "warming up."
Files/artifacts: static content, templates, configs - submit in advance to local volume/sidecar.
Ficheflags: rollout at 1-5% of audience/segment, emergency-kill opportunity.
6) Databases: "expand → migrate → contract" strategy
1. Expand: add nullable/new columns/indexes, support both versions.
2. Migrate: code uses a new scheme; the old paths remain valid.
3. Contract: delete old fields/indexes after full unwind.
In the logs, fix the schema and client version; all changes are idempotent.
For heavy migrations - background jabs, throttling and "stop-the-world" windows as agreed.
7) Observability and gates (SLO/SLA)
SRE metrics: p50/p95/p99, error-rate, saturation (CPU/GPU/IO), queue-depth, cold start time.
Business metrics: payment conversion, bid success, withdrawal time (TTW), promo responses.
Content quality/LLM: tokens/s, response length, toxicity, RAG-score.
Gates: auto-promotion/rollback when going beyond the thresholds and/or when the "useful metric" falls.
gate:
p95_latency_ms <= 250 error_rate % <= 1. 0 payment_conv >= baseline - 0. 3%
action:
promote rollback
8) Release orchestration and integration with CI/CD
GitOps: version/weight changes - via PR to the manifest repository.
Auto-checks smoke/e2e before traffic starts.
Release plan: canary step schedule, responsible, ChatOps channels, rollback windows.
Archiving artifacts: routing configs, dashboard snapshots, metric comparison logs.
9) Multi-region and edge
Order: first the "least critical" region/ROR, then the main ones.
Latency-based routing: monitor local SLOs; don't mix traffic for no reason.
DR-vision: Blue in region-A could be DR-site for Green in region-B.
10) Safety and compliance release
Signed Looks/Charts, SBOM; checking signatures in admission policies.
Secrets: external managers only; independent versions for Blue/Green.
PII/regionality: do not divert traffic from PII through a foreign region; mask the logs when comparing.
Audit: who promoted, which gates worked, where the rollback.
11) Configuration examples
NGINX: Canary Branch by Cookie/Header (Idea)
nginx map $http_x_canary $canary {
default 0;
"1" 1;
}
upstream api_stable { server stable:80; }
upstream api_canary { server canary:80; }
server {
location / {
if ($canary) { proxy_pass http://api_canary; }
proxy_pass http://api_stable;
}
}
Feature-flag "fractional rollout" (pseudo)
yaml feature: new_checkout rollout:
percentage: 5 criteria:
country: ["TR", "BR", "MX"]
cohort: "new-users"
kill_switch: true
12) Runbooks (typical scenarios)
The growth of p99 on the canary: stop the promotion → increase the batch/timeout, turn off heavy features through flags → restart some of the pods.
Payment conversion drop: compare PSP routes/features, enable shadow logging, rollback to stable.
Problem with database migration: freeze traffic for writing, enable read-only mode, rollback the schema (if possible), emergency fix jabs.
PII incident: cut off the canary version, revocation of secrets, report and audit.
13) Implementation checklist
1. Define the policy: where blue-green, where canary; which is considered "critical."
2. Configure weighted routing (Ingress/mesh/router).
3. Capture SLO threshold gates and auto rollbacks.
4. Implement expand→migrate→contract for the database; migration tests.
5. Enable warm-up of caches/models and warm-pool connections.
6. Enter GitOps and log all release actions.
7. Visualize the comparison of metrics (canary vs stable).
8. Spend game-day: Simulate a gate rollback/fail/database problem.
9. Document the runbooks and the kill-switch.
10. Plan multiregion releases in turn, not at the same time.
14) Anti-patterns
Canary release without gates and telemetry → late detection of degradation.
Blending DB Schema: Disruptive Migrations to Code Unwind.
One shared cache/queue for Blue and Green without isolation → mutual impact.
DNS switching with low TTL without verification - "flapping" traffic.
Secrets/configs common to both revisions → complex rollback.
Traffic to food without shadow/smoke is a big bang risk.
No kill-switch/feature-flag for quick shutdown.
Summary
Blue-green provides instant and easy switching, canary - managed risk and early problem detection. In iGaming, both patterns are combined: canary on "sharp" changes + blue-green as a basic mechanism without downtime. Add SLO gates, GitOps, warm-up, database compatibility and dependency isolation - and releases will be predictable, rollbacks are fast, and p99 and business metrics are stable even during peak periods.