Rollback Scenarios

(Section: Operations and Management)

1) Why do you need rollback scenarios

Even with perfect testing, some of the changes lead to degradation. Rollback is the managed operation of reverting to a pre-defined "secure" version without data loss or compliance. Objectives: Reduce MTTR, protect money/data, maintain trust of partners and regulators.

2) Classification of changes and rollback approaches

Code and containers: versioned artifacts → blue-green, canary, rolling with instant rollback to the previous image.
Configurations/phicheflags: feature toggle rollback, atomic switches with TTL and audit.
Database schemas: expand → migrate → contract, bidirectional migrations, "shadow" columns, backfill in the background.
Data/price lists/taxes: artifact versions ('fx _ version', 'tax _ rule _ version', 'pricelist _ version'), "freeze" and return.
Integrations (PSP/KYC/content providers): switching routes/pools, fallback to the backup provider.
Infrastructure/networks/CDN: phased rollback of rules/routes, rollback of certificates/keys with double loading.

3) Architectural patterns for reversibility

Immutable releases: each release is a signed artifact (image/config) with the ability to instantly select the previous one.
Compatibility layers: schema-compat (add, not remove), tolerant-reader on the consumer side.

Double-write and shadow-reads: compare consistency before "switching."

Idempotence and sagas: compensation steps for cross-service transactions.
Ficheflags: Rapid shutdown/phased-in instead of "hot" redeploy.

4) Rolling strategies with return points

Canary N%: metrics/guardrails → during degradation auto-rollback; if successful - expansion to 100%.
Blue-Green: two prod stacks; traffic switching and instant rollback back.
Rolling with pause: update by party with "pause points" and the ability to roll back to the previous wave.
Ficheflags by cohort: "dark launch," whitelists, regional/tenant flags.

5) Rolling back databases and migrations: secure templates

Never make "disruptive" migrations without expand→migrate→contract:

1. Expand: add new columns/indexes/endpoints, the code writes to both versions.

2. Migrate: backfill and validations; reading "shadow" from the new structure.

3. Contract: disable old after stability.

Bidirectionality: each migration has a'down ()'; for large sets - logical revert (flags, routing) instead of physical deletion.

Snapshots/point-in-time: PITR/snapshot of tables before a critical release.

Scheme control: contract validators in CI/CD + "dry-run" on staging/replica.

6) Catalog/price/tax rollback

Revise price lists and tax rules; keep the publication receipts.
Fix 'fx _ version '/' tax _ rule _ version' in orders - returns do not break old checks.
With "PriceMismatch" → is a force disability of the cache, a return to the previous version of the artifact, compensation by policy.

7) Integrations and external providers

PSP/KYC/content: keep backup routes, health samples, DNS/LB fast switching, individual keys.
Webhooks: include write-drop and queues; during rollback - replays from "dead letters" with idempotent keys.
Certificates/keys: double loading (old + new), checking compatibility before switching.

8) Automation of kickbacks ("runes") and guardrails

Руны (кнопки): Rollback Release, Disable Flag, Re-route, Flush Cache, Scale Back, Restore Schema.
Guardrails: Kickback launch available to IC/owner; signed (DSSE), transaction frequency limits, confirmation checklist.
Auto-rollback: conditions for SLO/percentiles/errors/financial signals (for example, Δ quote↔checkout ≠ 0).

9) Communications and artifacts

In the release card: version, hashes, checklist of previews, rollback playbook, responsible.
During rollback: timestamps, cause, volume of affected traffic, artifacts (log links, before/after metrics).
External communications (status page/partners): concise and factual.

10) Rollback playbooks (reference)

Code/image degraded (P1):

1. Re-route/Blue-Green back → 2) fix version → 3) block further rolling → 4) strain.

The flag causes an increase in errors:

1. Disable Feature Flag (100%) → 2) flush cache/follbacks → 3) fix ticket.

Database migration gives timeouts:

1. stop heavy-backfill → 2) return reading to the old scheme (dual-read off) → 3) reduce load/indexes → 4) evaluate 'down ()' or logical rollback.

PriceMismatch/FX/Tax:

1. rolling back'pricelist _ version '/' tax _ rule _ version '→ 2) disabling edge cache → 3) compensating and reconciling checks.

PSP failure:

1. switching to a standby PSP → 2) quarantine of gray transactions → 3) queue replicas after stabilization.

Key/Certificate broken:

1. return to the previous key (dual-key) → 2) rotation and repablish.

11) RACI

Area	Responsible	Accountable	Consulted	Informed
Design rollback strategies	Platform/SRE	CTO	Security, Data, Product	All
Release/rollback playbooks	Release Eng	Head of Eng	SRE, Owners	Support
Data/migrations	Data/DBA	Head of Data	Product, SRE	Audit
Integrations/Providers	Integration Team	COO	Legal, Finance	Partners
Communications	Comms Lead	COO	IC, Legal	Customers/Partners

12) Quality and SLO metrics

Change Failure Rate (CFR) - share of releases with rollback (target ↓).
MTTR (with rollback) is the median time to return to stability.
Time-to-Rollback - from the trigger to the end of the rollback (P1 ≤ 15-20 minutes).
Δ - before/after metrics (p95, error-rate, E2E success).
Repeated rollbacks of the same cause ≤ N/quarter.
Audit coverage: 100% rollbacks with artifacts and signatures.

13) Security, privacy, compliance

WORM magazines for releases/rollbacks; storage of artifacts by regulators.
PII/Finance: Verifying that rollback does not open access to unresolved zones/old policies.

SoD: "who rolls ≠" "who approves" ≠ "who initiates the rollback."

Credits/secrets: dual-rollover and instant return to previous key.

14) Financial and operational effects

Cost of downtime vs cost of rollback: Automate the solution through SLO guardrails.
SLA compensations/credits - templates in playbooks.
Egress/compute-cap: rollback can temporarily raise the load (replay/caching) - plan windows.

15) Pre-release checklist (go/no-go)

Signed artifacts and return point (image/config/data version).
Rollout plan and rollback playbook (in steps).
Migration validated: expand→migrate→contract, PITR active.
Dials/guardrails SLO: auto-rollback conditions in the alert system.
Communication channels: IC/Owners/Comms on-call.
Backward compatibility tests and "dry run" on staging.
Backup routes for critical integrations.
Communication Plan (Internal/External) and templates.

16) Checklist during rollback (during incident)

Acknowledge trigger and affected volume (region/tenant/channel).
Fix the "what we are rolling back" version.
Run the rollback rune (code/flag/route/data).
Check SLI/SLO and business metrics (E2E, checkout, webhooks).
Check directories/versions (FX/Tax/PriceList).
Fix the state: ban new rolling, collect artifacts.
Communication: status page, partners, internal.

17) Frequent errors and anti-patterns

Rollback "manually" without artifacts and signatures.
Disruptive migrations without bidirectionality and PITR.

Feature-flag without a "global switch."

No backup routes to PSP/KYC.
Flush cache without warming up → avalanche of cold requests.
Unaccounted quote≠checkout after price list return.

18) FAQ

When is a rollback better than an "in place" fix?
In case of SLO violation/risk of money/data, it is faster and safer to return to the known stable version.

Is it possible to roll back "destructive" migrations?
Yes, if designed as expand→migrate→contract with 'down () '/PITR and logical follback.

How do I automate my rollback decision?
SLO guardrails (p95, error-rate, Δ values, success of webhooks) + risk matrix → auto-rune.

What to do with orders/transactions "between"?
Idempotent keys, quarantine of "gray" operations, replicas of queues with deduplication.

Summary: Rollback scenarios are not improvisation, but a pre-engineered ability to quickly return to stability. Versionize everything, keep a reversible data scheme, use ficheflags and canaries, automate runes, capture artifacts and SLO guardrails. Then any release remains manageable, and the business is predictably stable.

Rollback Scenarios

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects