Rollbacks and stability recovery

(Section: Technology and Infrastructure)

Brief Summary

Rollback is a managed return to the latest stable version with minimal risk of data loss and SLO violations. A reliable process includes: SLO signals, clear gates and rollback criteria, a switching mechanism (GitOps/Ingress/mesh), a compatible data scheme, isolated configs/secrets/caches, runabook and post-incident improvement cycle.

1) When to roll back (start criteria)

SLO/business gates: p95/99 above the threshold, error-rate ↑, drop in payment/rate conversion, increase in PSP timeouts.
Tech signals: hearth crashes, memory leaks, queue growth, token degradation/sec (LLM), 5xx on Edge.
Data risk: incorrect migrations, inconsistency of replicas, orphan transactions/payments.
Safety/PII: suspected leak - immediate rollback/isolation.

Rule: if 2 + key metrics are outside thresholds> N minutes, rollback is triggered.

2) Types of rollbacks

1. Application: rollback of containers/package to the previous tag.
2. Feature: instant shutdown via feature flag/kill-switch.
3. Routing - Returns weight to the stable version (canary→stable) or Blue→Green.
4. Database: logical rollback (compensation), phased return of the scheme; PITR is a last resort.
5. Infrastructure: rolling back manifests/Terraform plan; returning network/WAF configurations.
6. Data/cache/queues: reset/disability/replay messages; version caches.

3) Architectural principles of safe rollback

Schema compatibility: expand→migrate→contract strategy (rollback is possible between expand and contract).
Isolated dependencies: separate secrets/configs/caches/queues for revisions.
Idempotent operations: repeated start of migrations and job - safe.
Immutability of artifacts: images, charts, SQL scripts - versioned and signed.
GitOps true: The current version and routing are committed to the manifest repository.

4) Rollback mechanics (Kubernetes/GitOps)

Argo Rollouts (weight return)

yaml apiVersion: argoproj. io/v1alpha1 kind: Rollout metadata: { name: api }
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 10m }
in case of analysis failure → automatic rollback to stable

GitOps rollback (idea)


git revert <commit_with_bad_version>
git push # Argo CD/Flux revert cluster to previous revision

NGINX: fast switch on stable

nginx map $cookie_canary $to_canary { default 0; 1 1; }
upstream stable { server api-stable:80; }
upstream canary { server api-canary:80; }
server {
location / {
if ($to_canary) { proxy_pass http://canary; }
proxy_pass http ://stable; # removed canary cookie - instant rollback
}
}

5) Database Rolback and Data Protection

Expand → Migrate → Contract:

Expand - Add new fields/indexes, code supports old and new schema.
Migrate: the code starts writing to a new scheme, we do not break the old one.
Contract: delete the old only after stabilization.
PITR/snapshots: use only if logical compensation is not possible.
Compensations: separate scripts/jobs for fixing inserts/balances/payments.
Read-only windows: when criticized, we temporarily block the recording in order to "freeze" the state.

Example (SQL idea, excessively safe):

sql
-- expand
ALTER TABLE wallet ADD COLUMN bonus_balance NUMERIC DEFAULT 0 NULL;
CREATE INDEX CONCURRENTLY idx_wallet_bonus ON wallet(bonus_balance);

-- migrate in code, two-sided write
-- contract (after stabilization)
ALTER TABLE wallet DROP COLUMN legacy_bonus_balance;

6) Queues and caches on rollback

Version cache: keys prefixed with version ('v2:') → safe coexistence.
Disability: during rollback - mass cleaning 'v2:', return to 'v1:'.

Queues: parties/topics according to the version; replaying messages "from the checkpoint."

De-duplication/idempotency: idempotence keys for duplicate-free reprocessing.

7) SLO gates and auto rollbacks

Metrics: p95/99, error-rate, saturations (CPU/IO/GPU), queue depth, tokens/sec, payment conversion.

Policy (example):


if p95_latency_ms > 250 for 5m OR error_rate > 1. 5% for 3m OR payment_conv < baseline-0. 3%
then rollback release && open incident && freeze deploys

8) Runabooks (playbooks)

A) Post-release growth of p99 and 5xx

1. Stop promote (freeze canary/blue-green).
2. Switch traffic to stable revision.
3. Check cache hit/queue/PSP delay.
4. Remove diagnostics: logs, profiles, client/schema versions.
5. Communication: ChatOps, status channel, incident card.
6. Start corrective action: patch/hot fix/cancellation of feature.

B) Database migration error

1. Freeze writes (read-only, briefly).
2. Application rollback → stable version (compatible with old schema).
3. Execute compensation/rollback script.
4. Thaw record; observe drift/errors.

C) Payment Degradation (PSP)

1. Switch PSP routing to the previous route.
2. Rolling back processing release.
3. Reconcile all pending payments, repeat with idempotent keys.

D) LLM/recommendations degrade

1. Disable new model/parameters (feature flag).
2. Return the previous endpoint/weight; Clear new revision KV cache.
3. Check tokens/s, first latency token, toxicity.

9) Communications and freezing releases

Freeze window: after rollback - pause releases to RCA/fix.
Single channel: status updates, chronology of actions, who did what.
Stakeholders: Product/CS/Payments/Lawyers (at PII).

10) Post-incident: analysis and prevention

RCA (no charges): root cause, contributing factors, why gates didn't work (if they didn't).
Actions: migration tests, limits, feature gates, observability.

SLO threshold: adjustment if too "soft "/" hard. "

Documentation: update runabooks, add alerts, training (game-day).

11) Tools and templates

GitOps: Argo CD/Flux - 'revert '/' rollback' commit with version.
Progressive delivery: Argo Rollouts/Flagger - stop/roll back on metrics.
Edge/Ingress: weight routing, cookie routing, fast switch.
Feature flags: fractional rollout, kill-switch.
Migration DB: mig-frameworks with up/down, dry-run, throttling.
Observability: ready-made dashboards "release compare" (stable vs canary).

12) Checklist of readiness for rollbacks

1. Versioned and signed artifacts (images/charts/SQL).
2. Two-rail configs/secrets/caches/queues (version prefixes).
3. DB diagram by expand→migrate→contract.
4. Canary and blue-green releases with SLO gates and auto kickbacks.
5. Runabooks for key scenarios (payments/DB/cache/LLM).
6. ChatOps buttons: '/rollback ', '/freeze', '/promote '.
7. Audit and logging: who, when, what rolled back; diagnostic artifacts.
8. Game-day workouts: simulating dips and recoveries.
9. Business and support communications plan.
10. Stable vs new on one screen.

13) Anti-patterns

Disruptive migrations before code is rolled out (no backward compatibility).
Shared caches/queues without versions → dirty rollback.
No GitOps/Change History → Manual Edits to Prod.
Canary release without gates/telemetry → late detection.
Rollback without freeze and RCA → repeat the incident.
Monitoring only technical metrics without business metrics (payments/rates).
"Secrets common" to all revisions → it is difficult to isolate the incident.

Summary

Reliable rollback is not a "stop crane," but a process built into releases: versioning and compatibility, isolated dependencies, SLO gates, GitOps reality, automatic rollbacks and clear runabooks. This approach allows iGaming platforms to quickly regain stability, minimizing data and revenue losses, and turn each incident into a source of improvement.

Rollbacks and stability recovery

Brief Summary

GitOps rollback (idea)

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects