Rollback Scenarios
(Section: Operations and Management)
1) Why do you need rollback scenarios
Even with perfect testing, some of the changes lead to degradation. Rollback is the managed operation of reverting to a pre-defined "secure" version without data loss or compliance. Objectives: Reduce MTTR, protect money/data, maintain trust of partners and regulators.
2) Classification of changes and rollback approaches
Code and containers: versioned artifacts → blue-green, canary, rolling with instant rollback to the previous image.
Configurations/phicheflags: feature toggle rollback, atomic switches with TTL and audit.
Database schemas: expand → migrate → contract, bidirectional migrations, "shadow" columns, backfill in the background.
Data/price lists/taxes: artifact versions ('fx _ version', 'tax _ rule _ version', 'pricelist _ version'), "freeze" and return.
Integrations (PSP/KYC/content providers): switching routes/pools, fallback to the backup provider.
Infrastructure/networks/CDN: phased rollback of rules/routes, rollback of certificates/keys with double loading.
3) Architectural patterns for reversibility
Immutable releases: each release is a signed artifact (image/config) with the ability to instantly select the previous one.
Compatibility layers: schema-compat (add, not remove), tolerant-reader on the consumer side.
Double-write and shadow-reads: compare consistency before "switching."
Idempotence and sagas: compensation steps for cross-service transactions.
Ficheflags: Rapid shutdown/phased-in instead of "hot" redeploy.
4) Rolling strategies with return points
Canary N%: metrics/guardrails → during degradation auto-rollback; if successful - expansion to 100%.
Blue-Green: two prod stacks; traffic switching and instant rollback back.
Rolling with pause: update by party with "pause points" and the ability to roll back to the previous wave.
Ficheflags by cohort: "dark launch," whitelists, regional/tenant flags.
5) Rolling back databases and migrations: secure templates
Never make "disruptive" migrations without expand→migrate→contract:1. Expand: add new columns/indexes/endpoints, the code writes to both versions.
2. Migrate: backfill and validations; reading "shadow" from the new structure.
3. Contract: disable old after stability.
Bidirectionality: each migration has a'down ()'; for large sets - logical revert (flags, routing) instead of physical deletion.
Snapshots/point-in-time: PITR/snapshot of tables before a critical release.
Scheme control: contract validators in CI/CD + "dry-run" on staging/replica.
6) Catalog/price/tax rollback
Revise price lists and tax rules; keep the publication receipts.
Fix 'fx _ version '/' tax _ rule _ version' in orders - returns do not break old checks.
With "PriceMismatch" → is a force disability of the cache, a return to the previous version of the artifact, compensation by policy.
7) Integrations and external providers
PSP/KYC/content: keep backup routes, health samples, DNS/LB fast switching, individual keys.
Webhooks: include write-drop and queues; during rollback - replays from "dead letters" with idempotent keys.
Certificates/keys: double loading (old + new), checking compatibility before switching.
8) Automation of kickbacks ("runes") and guardrails
Руны (кнопки): Rollback Release, Disable Flag, Re-route, Flush Cache, Scale Back, Restore Schema.
Guardrails: Kickback launch available to IC/owner; signed (DSSE), transaction frequency limits, confirmation checklist.
Auto-rollback: conditions for SLO/percentiles/errors/financial signals (for example, Δ quote↔checkout ≠ 0).
9) Communications and artifacts
In the release card: version, hashes, checklist of previews, rollback playbook, responsible.
During rollback: timestamps, cause, volume of affected traffic, artifacts (log links, before/after metrics).
External communications (status page/partners): concise and factual.
10) Rollback playbooks (reference)
Code/image degraded (P1):1. Re-route/Blue-Green back → 2) fix version → 3) block further rolling → 4) strain.
The flag causes an increase in errors:1. Disable Feature Flag (100%) → 2) flush cache/follbacks → 3) fix ticket.
Database migration gives timeouts:1. stop heavy-backfill → 2) return reading to the old scheme (dual-read off) → 3) reduce load/indexes → 4) evaluate 'down ()' or logical rollback.
PriceMismatch/FX/Tax:1. rolling back'pricelist _ version '/' tax _ rule _ version '→ 2) disabling edge cache → 3) compensating and reconciling checks.
PSP failure:1. switching to a standby PSP → 2) quarantine of gray transactions → 3) queue replicas after stabilization.
Key/Certificate broken:1. return to the previous key (dual-key) → 2) rotation and repablish.
11) RACI
12) Quality and SLO metrics
Change Failure Rate (CFR) - share of releases with rollback (target ↓).
MTTR (with rollback) is the median time to return to stability.
Time-to-Rollback - from the trigger to the end of the rollback (P1 ≤ 15-20 minutes).
Δ - before/after metrics (p95, error-rate, E2E success).
Repeated rollbacks of the same cause ≤ N/quarter.
Audit coverage: 100% rollbacks with artifacts and signatures.
13) Security, privacy, compliance
WORM magazines for releases/rollbacks; storage of artifacts by regulators.
PII/Finance: Verifying that rollback does not open access to unresolved zones/old policies.
SoD: "who rolls ≠" "who approves" ≠ "who initiates the rollback."
Credits/secrets: dual-rollover and instant return to previous key.
14) Financial and operational effects
Cost of downtime vs cost of rollback: Automate the solution through SLO guardrails.
SLA compensations/credits - templates in playbooks.
Egress/compute-cap: rollback can temporarily raise the load (replay/caching) - plan windows.
15) Pre-release checklist (go/no-go)
- Signed artifacts and return point (image/config/data version).
- Rollout plan and rollback playbook (in steps).
- Migration validated: expand→migrate→contract, PITR active.
- Dials/guardrails SLO: auto-rollback conditions in the alert system.
- Communication channels: IC/Owners/Comms on-call.
- Backward compatibility tests and "dry run" on staging.
- Backup routes for critical integrations.
- Communication Plan (Internal/External) and templates.
16) Checklist during rollback (during incident)
- Acknowledge trigger and affected volume (region/tenant/channel).
- Fix the "what we are rolling back" version.
- Run the rollback rune (code/flag/route/data).
- Check SLI/SLO and business metrics (E2E, checkout, webhooks).
- Check directories/versions (FX/Tax/PriceList).
- Fix the state: ban new rolling, collect artifacts.
- Communication: status page, partners, internal.
17) Frequent errors and anti-patterns
Rollback "manually" without artifacts and signatures.
Disruptive migrations without bidirectionality and PITR.
Feature-flag without a "global switch."
No backup routes to PSP/KYC.
Flush cache without warming up → avalanche of cold requests.
Unaccounted quote≠checkout after price list return.
18) FAQ
When is a rollback better than an "in place" fix?
In case of SLO violation/risk of money/data, it is faster and safer to return to the known stable version.
Is it possible to roll back "destructive" migrations?
Yes, if designed as expand→migrate→contract with 'down () '/PITR and logical follback.
How do I automate my rollback decision?
SLO guardrails (p95, error-rate, Δ values, success of webhooks) + risk matrix → auto-rune.
What to do with orders/transactions "between"?
Idempotent keys, quarantine of "gray" operations, replicas of queues with deduplication.
Summary: Rollback scenarios are not improvisation, but a pre-engineered ability to quickly return to stability. Versionize everything, keep a reversible data scheme, use ficheflags and canaries, automate runes, capture artifacts and SLO guardrails. Then any release remains manageable, and the business is predictably stable.