GH GambleHub

Release strategies: blue-green and canary

(Section: Technology and Infrastructure)

Brief Summary

Blue-green gives an instant switch between two full stacks (Blue/Green) with the simplest rollback. Canary is gradually increasing the share of traffic to the new version under the control of SLO gates (latency, error-rate, business metrics). For iGaming, this is a way to release without downtime at the peak of tournaments and promotions, while maintaining a stable p99 and quality.

1) When to choose

Blue-green - fast releases, minimal complexity, you need a "double" cluster/resource budget. Good for API/front without complex state migration.
Canary - high-risk releases (new flow, critical changes), allows you to "catch" degradation by 1-5% of traffic. Requires telemetry and automatic gates.

2) Architectural principles

1. L7 level routing: balancer/Ingress/service-mesh (weighted traffic modules, cookies/flag-routing).
2. Isolated dependencies: configurations, phicheflags, secrets, caches - separately for revisions.
3. Data compatibility: forward-compatible (expand→migrate→contract) database migrations.
4. Observability: individual labels/version labels in metrics/logs/tracks.
5. Autogates: p95/p99 comparison, error-rate, business KPI; automatic rollback.

3) Blue-green: basic pattern

Stream

1. Expand Green (a copy of Blue) → warm up the caches/connections.
2. Run health-/smoke-tests.
3. Switch traffic (DNS/LB/Ingress) to Green.
4. We keep Blue in a "warm" state as a fallback until the end of the window.

Example: Ingress level switching (idea)

yaml
Annotated/Backend Option - In Prod, it is usually controlled by the spec operator/rollout:
rules:
- host: api. example. com http:
paths:
- path: /
backend:
service:
name: api-green # used to be api-blue port:
number: 80

Pros/Cons

Simple rollback (returned Blue).
Predictable release time.
Requires duplication of resources.
Risk of "big bang" without canary measurement.

4) Canary: Gradual build-up

Stream

1. Shadow traffic (optional) → 1% of real traffic → 5% → 25% → 50% → 100%.
2. At each stage - gates by SLO/business metrics.
3. During degradation - auto-rollback and preservation of diagnostic artifacts.

Example: Argo Rollouts (snippet)

yaml apiVersion: argoproj. io/v1alpha1 kind: Rollout metadata: { name: payments-api }
spec:
strategy:
canary:
canaryService: payments-canary stableService: payments-stable steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: slo-latency
- setWeight: 25
- pause: { duration: 10m }
- analysis:
templates:
- templateName: error-rate
- setWeight: 50
- pause: { duration: 20m }
- setWeight: 100

Example: Flagger + Istio/NGINX (idea)

yaml apiVersion: flagger. app/v1beta1 kind: Canary metadata: { name: games-api }
spec:
targetRef:
apiVersion: apps/v1 kind: Deployment name: games-api service:
port: 80 analysis:
interval: 1m threshold: 5 metrics:
- name: request-success-rate thresholdRange: { min: 99 }
- name: request-duration thresholdRange: { max: 300 }
webhooks:
- name: smoke url: http://tester/smoke

5) Warm-up and condition management

Caches/sources: warm up the Redis/HTTP cache/CDN, prepare warm-pool connections to the database/PSP.

ML/LLM/models: loading weights/indices/embeddings, KV cache, primary requests for "warming up."

Files/artifacts: static content, templates, configs - submit in advance to local volume/sidecar.
Ficheflags: rollout at 1-5% of audience/segment, emergency-kill opportunity.

6) Databases: "expand → migrate → contract" strategy

1. Expand: add nullable/new columns/indexes, support both versions.
2. Migrate: code uses a new scheme; the old paths remain valid.
3. Contract: delete old fields/indexes after full unwind.

In the logs, fix the schema and client version; all changes are idempotent.
For heavy migrations - background jabs, throttling and "stop-the-world" windows as agreed.

7) Observability and gates (SLO/SLA)

SRE metrics: p50/p95/p99, error-rate, saturation (CPU/GPU/IO), queue-depth, cold start time.
Business metrics: payment conversion, bid success, withdrawal time (TTW), promo responses.
Content quality/LLM: tokens/s, response length, toxicity, RAG-score.
Gates: auto-promotion/rollback when going beyond the thresholds and/or when the "useful metric" falls.

Example of "policy" (pseudo):

gate:
p95_latency_ms <= 250 error_rate %  <= 1. 0 payment_conv  >= baseline - 0. 3%
action:
promote      rollback

8) Release orchestration and integration with CI/CD

GitOps: version/weight changes - via PR to the manifest repository.
Auto-checks smoke/e2e before traffic starts.
Release plan: canary step schedule, responsible, ChatOps channels, rollback windows.
Archiving artifacts: routing configs, dashboard snapshots, metric comparison logs.

9) Multi-region and edge

Order: first the "least critical" region/ROR, then the main ones.
Latency-based routing: monitor local SLOs; don't mix traffic for no reason.
DR-vision: Blue in region-A could be DR-site for Green in region-B.

10) Safety and compliance release

Signed Looks/Charts, SBOM; checking signatures in admission policies.
Secrets: external managers only; independent versions for Blue/Green.
PII/regionality: do not divert traffic from PII through a foreign region; mask the logs when comparing.
Audit: who promoted, which gates worked, where the rollback.

11) Configuration examples

NGINX: Canary Branch by Cookie/Header (Idea)

nginx map $http_x_canary $canary {
default 0;
"1"   1;
}

upstream api_stable { server stable:80; }
upstream api_canary { server canary:80; }

server {
location / {
if ($canary) { proxy_pass http://api_canary; }
proxy_pass http://api_stable;
}
}

Feature-flag "fractional rollout" (pseudo)

yaml feature: new_checkout rollout:
percentage: 5 criteria:
country: ["TR", "BR", "MX"]
cohort: "new-users"
kill_switch: true

12) Runbooks (typical scenarios)

The growth of p99 on the canary: stop the promotion → increase the batch/timeout, turn off heavy features through flags → restart some of the pods.
Payment conversion drop: compare PSP routes/features, enable shadow logging, rollback to stable.
Problem with database migration: freeze traffic for writing, enable read-only mode, rollback the schema (if possible), emergency fix jabs.
PII incident: cut off the canary version, revocation of secrets, report and audit.

13) Implementation checklist

1. Define the policy: where blue-green, where canary; which is considered "critical."

2. Configure weighted routing (Ingress/mesh/router).
3. Capture SLO threshold gates and auto rollbacks.
4. Implement expand→migrate→contract for the database; migration tests.
5. Enable warm-up of caches/models and warm-pool connections.
6. Enter GitOps and log all release actions.
7. Visualize the comparison of metrics (canary vs stable).
8. Spend game-day: Simulate a gate rollback/fail/database problem.
9. Document the runbooks and the kill-switch.
10. Plan multiregion releases in turn, not at the same time.

14) Anti-patterns

Canary release without gates and telemetry → late detection of degradation.
Blending DB Schema: Disruptive Migrations to Code Unwind.
One shared cache/queue for Blue and Green without isolation → mutual impact.
DNS switching with low TTL without verification - "flapping" traffic.
Secrets/configs common to both revisions → complex rollback.
Traffic to food without shadow/smoke is a big bang risk.
No kill-switch/feature-flag for quick shutdown.

Summary

Blue-green provides instant and easy switching, canary - managed risk and early problem detection. In iGaming, both patterns are combined: canary on "sharp" changes + blue-green as a basic mechanism without downtime. Add SLO gates, GitOps, warm-up, database compatibility and dependency isolation - and releases will be predictable, rollbacks are fast, and p99 and business metrics are stable even during peak periods.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.