Rollback strategies and atomic releases
Rollback strategies and atomic releases
1) Why you need a quick rollback
Even with excellent test coverage, the food does not guarantee error-free. Rollback is the controlled return of a system to a previous steady state by an SLO/business metrics or incident signal. Objectives:- Reduce MTTR to minutes.
- Limit the radius of impact (minimum affected users/transactions).
- Maintain data integrity and contract compatibility.
The key: Build releases so the rollback is a trivial action, not a mini-project.
2) The concept of "atomic release"
Atomic release - when the inclusion of a new version/behavior can be performed (and canceled) by a single atomic operation without lasting side effects.
Atomicity components:- Immutable artifact (signed image/package).
- Versioned configs (promotion versions, not manual edits).
- Separation of "delivery" from "inclusion" (routing/flags).
- Compatible data schema (both versions can live simultaneously).
- Rollback runbook: one clear step (change selector/weight/flag) + check.
3) Rollback machinery inventory
3. 1 Traffic layer (fastest)
Blue-Green: Switch the/target group selector to the stable version.
Canary: Lower weight to 0% and freeze progression.
Gateway/NGINX/Service Mesh: return to previous weights/routes.
3. 2 Conveyor-level
Helm/Argo Rollouts: 'abort/rollback' to previous revision.
GitOps: revert MR/commit to manifest repositories (controller will do the rest).
3. 3 Appendix/Features
Feature-flags/kill-switch: Turn off the risky path instantly.
Toggle configs: Go back to the previous config shot.
3. 4 Data
Roll-forward migrations (preferred) + compatibility.
Point-in-time recovery (PITR) and backups for accidents.
Compensation (Saga) and idempotency for reversible actions.
4) "expand → migrate → contract" pattern
For rollback to be secure, the data schema must allow the old and new versions to co-live.
1. Expand - add new fields/indexes (nullable) without breaking the old logic.
2. Migrate - double write/read, back-fill, background jobs with idempotency.
3. Contract - delete old fields/code after 100% exit and sustained window.
5) SLO gating and auto rollback
The entrance to each release stage must be "guarded" by metrics.
Technical SLOs: p95/p99 latency, 5xx-rate, saturation (CPU/Memory), error-budget burn.
Business metrics: CR for deposit/cashout, denial of payments, fraud percentage, KYC errors.
- 5xx > 0. 5% 10 minutes → rollback.
- p95 ↑> 20% of baseline → hold + analysis.
- PSP error> 0. 3 p.p. → rollback + switching payment route.
6) Examples: Kubernetes/Helm/Argo/NGINX
6. 1 Blue-Green (K8s Service selector)
yaml
Service points to "blue"; switch to green - change selector apiVersion: v1 kind: Service metadata: {name: app-svc}
spec:
selector: { app: app, version: blue }
ports: [{ port: 80, targetPort: 8080 }]
Rollback = return the selector to 'blue' (atomic, no reassembly).
6. 2 Canary (Istio VirtualService веса)
yaml http:
- route:
- destination: { host: app, subset: stable } # 100 weight: 100
- destination: { host: app, subset: canary } # 0 weight: 0
Rollback = weight canary → 0, stable → 100.
6. 3 Argo Rollouts — abort
yaml kubectl argo rollouts abort app # stop and return to stableService
6. 4 Helm - rollback to revision
bash helm history app -n prod helm rollback app 17 -n prod # revert to revision 17
6. 5 NGINX - upstream weight
nginx upstream app {
server blue:8080 weight=100;
server green: 8080 weight = 0; # rollback - return 100/0
}
7) Feature-flags and kill-switch as a "parachute"
Kill-switch for high-risk flows (deposits/payments/bonuses) - mandatory.
Stickiness: assigning users a "variant" via a hash key - predictable comparisons.
Fail-safe: if the flag server is unavailable, safe default.
ts const enabled = flags. bool("new_cashout_flow", false);
if (! enabled) return classicFlow () ;//instant rollback - disable the return newFlow () flag;
8) API and Event Contracts: How Not to "Break the Rollback"
Versioning contracts (OpenAPI/gRPC/Avro): the new version adds fields, does not change the semantics of the old ones.
Event-versioning: 'type = v2', consumers are required to ignore unknown fields.
Outbox + Idempotency: any repetition of the event is safe, the consumer is idempotent.
9) Offset Transactions (Saga)
When there is no "hard" rollback of state (money left, letter sent), use compensation:- Posted write-off - compensation: return, reversal, correction record.
- Recording in the journal of compensatory operations and retractions before success.
- Idempotent keys for each operation.
json
{
"sagaId": "b7d2...",
"action": "withdraw. execute",
"idempotencyKey": "user123:withdraw:7845",
"compensation": "withdraw. refund"
}
10) Configs and Secrets: Rollback as Version
Store configs as version artifacts (semver/commit-sha).
Rollback = revert the config to the previous version (GitOps revert), not "fix with your hands."
Secrets - via storage (KMS/Vault); rotation and versioning are included in the release.
11) Rollback runbook (minimum)
1. Pause progression (canary/rollouts).
2. Traffic return (weights/selector).
3. SLO/business metrics checks are back to baseline.
4. Stabilization of background jobs (stop migrations/back-fill if necessary).
5. Incident and post-factum: artifacts (logs/trails/metrics), hypotheses, correction.
6. Cleaning: close flags, remove left code, return job schedules.
12) Auto-Protect policies
Prohibit 'latest' and mutable tags for images.
Admission control: signed artifacts only.
CI gate: SAST/SCA/Policy-checks must be green for promotion.
Freeze windows: no releases/weights> X% during risk periods.
13) Frequent anti-patterns
We "roll back" the DDL base down instead of compatibility - long locks/downtime.
Instant head-to-head migrations without idempotency and back-fill.
Mixing "delivery" and "inclusion" - there is no way to quickly return traffic.
Manual edits in the production config without audit.
No kill-switch on payments/outputs.
Rebuild the artifact for prod (violation of "build once - run many").
There is no single rollback button/runbook not worked.
14) Implementation checklist (0-45 days)
0-10 days
Include Blue-Green/Canary on key services.
Deny 'latest', enable image signing and Helm/Argo history.
Connect SLO boards (latency, 5xx, key business signals).
11-25 days
Implement kill-switch for risk flow.
Convert database migrations to expand-migrate-contract + idempotency.
Add auto-stop/rollback by SLO (Argo AnalysisTemplate/alerts).
26-45 days
Versioning configs (GitOps), rollback via MR-revert.
Run the runbook to "game-day" (simulation of incident and rollback).
Enter Saga compensation where downward rollback is not possible.
15) Maturity metrics
Rollback MTTR: target <5 min.
% of releases where rollback = route/flag switch (no rebuild)> 90%.
Share of expand-migrate-contract migrations> 90%.
Covering kill-switch services with flags> 95%.
Number of incidents due to incompatible schemes/contracts → 0.
16) Applications: mini-templates
Argo AnalysisTemplate 5xx Stop
yaml apiVersion: argoproj. io/v1alpha1 kind: AnalysisTemplate metadata: { name: guard-5xx }
spec:
metrics:
- name: http_5xx_rate interval: 1m successCondition: result < 0. 005 provider:
prometheus:
address: http://prometheus. monitoring:9090 query:
sum(rate(http_requests_total{app="app",status=~"5.."}[5m])) /
sum(rate(http_requests_total{app="app"}[5m]))
Kubernetes: Deployment's quick rollback
bash kubectl rollout undo deploy/app -n prod
Helm: Atomic Release
bash helm upgrade --install app chart/ \
--atomic \
--timeout 5m \
--set image. tag=v1. 9. 3
NGINX: Canary Crane
nginx map $cookie_canary $weight {
default 0;
"~ on" 10; # enable 10% by cookie
}
17) Conclusion
Reliable rollback is not a "fire button," but a property of architecture: immutable artifacts, separation of delivery and inclusion, compatible data scheme, feature flags and SLO gating. Build releases atomic, rehearse runbook and automate security gates - and any release will be reversible in minutes, without pain for business and users.