Operations and Management → Change Management
Change Management
1) Purpose and principles
The goal is to deliver change quickly and safely, reducing the risk of incidents, downtime, and regulatory violations.
Principles:- Predictable & Reversible: Each change is planned, verifiable, and reversible.
- Risk-based: The depth of control depends on the risk (jurisdictions, money, PII).
- Small & Frequent: Small increments are easier to evaluate and roll back.
- Automation first: infrastructure as code, tests, validations, auto-checks.
- Single Source of Truth: a single RFC/ticket, a single calendar and a log of actions.
2) Scope
Product code (backend/frontend, mobile SDK).
Infrastructure (IaC, Kubernetes/VM/CDN/Edge).
Data (DB diagrams, migrations, storefronts/ETL).
Configurations and feature flags.
Integrations (PSP, KYC, game providers).
Security and access policies.
3) Roles and RACI
Change Owner-Responsible.
Release Curator/RelEng - Release Train Coordination.
SRE/Ops - operation, SLO/SLA gate.
Security/Compliance - Review risk and compliance.
CAB (Change Advisory Board) - approval of normal/high-risk changes.
Business Stakeholders/Support - Informed.
4) Classification of changes
Standard (typical, pre-approved): frequent, low-risk, ready-made playbook (e.g. flag update, key rotation).
Normal: Require RFC, assessment, possible CAB, tests and rollback plan.
Emergency: urgent fixes for P1 incidents; minimal bureaucratic path, post-factum review/SAW.
5) Change lifecycle
1. Trigger (RFC): objective, scope, risk, affected services/regions, backout plan.
2. Risk assessment: Impact × Likelihood matrix, impact on SLO/compliance/value.
3. Planning: window, dependencies, migrations, communications, validation tests.
4. Validation: autotests, static analysis, security check, performance run.
5. Deployment: progressive strategy (see § 8), telemetry and gardrails.
6. Observation: burn-rate SLO, alerts, business metrics (GGR/NGR, conversion).
7. Completion: result acceptance, documentation update, post-mortem for deviations.
6) RFC: minimum composition
Context: why change, influence hypothesis.
Range: systems, regions, customer versions.
Risk: matrix and failure scenarios, blast radius.
Deployment plan: step by step, with go/stop criteria.
Backout plan: commands/steps, start conditions, RTO/RPO expectations.
Test plan: what we check before/after (functionality, performance, safety).
Communications: whom we notify, message templates.
Audit: links to tickets, commits, CI/CD artifacts.
7) Change calendar and windows
Single calendar: all releases, migrations, turn off features, external events (sports/marketing/holidays).
Freeze windows: major sales/championships/peak hours, tax reporting.
Interference policy: prevent conflicting changes to the same critical paths.
Regional waves: first "warm" regions/low traffic, then - the main ones.
8) Technical deployment strategies
Canary: small share of traffic → comparison of metrics (p95 latency, error%, conversion).
Blue-Green: parallel environments, atomic route switching.
Progressive Delivery: Percentage rollout with automatic stop conditions.
Feature Flags: function switches, kill-switch, A/B.
Dark Launch/Shadow Traffic: checking for shadows without affecting users.
Step limits: gradual increase in QPS/competitiveness.
Gardrails: automatic stop when p95/error% thresholds are exceeded, returns/chargebacks increase, authorizations/deposits fall.
9) Data and schema changes
Compatibility: additive migrations → code that reads both the old and the new schema.
Two-phase migrations: (1) add new fields/indexes → (2) switch code → (3) delete old.
Contract versioning: Avro/Protobuf schemes with registry; back/forward compatible.
Large-volume migrations: batches, pauses, idempotency, checkpoints and progress.
Disaster tolerance: RPO/RTO test, snapshots, recovery rehearsals.
BI data: change of showcases/metrics - via MR/SR and metrics dictionary (ID, formula).
10) Configuration and secret management
Config as Data: versioned configs, validation by the scheme, promotion through the environment.
Secrets: key rotation, principles of minimum privileges, auditing of requests.
Regional overrides: limits/partners (PSP/KYC) - through parameterization, not through forks of code.
11) Compliance and audit (iGaming context)
Traces of changes: who/when/what switched (flags, configs, routes, migrations).
Segregation of Duties: different roles for author, reviewer and deploer (SOX-like).
Regulatory reports: fixed releases, version control of settlements (GGR/NGR, bonuses), control of access to PII.
Providers: fixed versions of SDK/provider certificates, SLA obligations.
12) Communications
Alert templates: before release (what/when/risks), during (status,% traffic, metrics), after (totals).
External messages: banners/status page when affecting customers.
Coordination: # release-war-room channel, release owner, update frequency.
13) Performance metrics
DORA: Deployment Frequency, Lead Time for Changes, Change Failure Rate (CFR), MTTR.
SLO Impact: Share of time in SLO before/after releases.
Backout Rate - The frequency of rollbacks by change category.
Release Debt: pending migrations/feature flags in limbo.
Business Impact: conversion, KYC TTV, success rate PSP, GGR/NGR drift when rolling.
14) Anti-patterns
Big-bang releases: Lots of changes at a time - it's hard to understand the cause of regression.
Incompatible migrations: deleting/renaming fields without double reading.
Flags without owners and deadlines for removal: "eternal" branches of logic.
Releases without telemetry and stop criteria: "by eye" and late detection of damage.
Ignoring calendar: intersections with peak events/campaigns.
Manual steps without playbooks and auditing: high variability and risk.
15) Checklists
Before Start (RFC Ready)
- Change objective and KPIs are formulated
- Risk and blast radius assessed, change class selected
- Deployment plan and Backout are written step-by-step
- There is a test plan and results on the stage/canary
- Communications and calendar updated, stakeholders notified
During rolling
- p95/error% metrics, business signals and logs are monitored in real time
- Progress steps are confirmed by check points
- At operation of gardrails - auto-stop and rollback
After
- Release results recorded (changelog, versions, artifacts)
- Post-mortem for deviations (≤ 5 working days)
- Debts (flag deletion, final migrations) are logged with owners
16) Mini templates
RFC Template (Short):- Objective/hypothesis
- Scope and influences (services, regions, data, customers)
- Impact × Likelihood and mitigation measures
- Rolling plan (steps,% traffic, go/no-go criteria)
- Backout plan (steps, RTO/RPO, data)
- Test plan (functional/performance/safety)
- Communications (channels, frequency)
- Artifacts (tickets, PR, build numbers)
- Change: "Payments-Service v2. 14 + psp_limits migration"
- Window: 2025-11-02 00: 00-01: 00 EET
- Affected regions: EU, LATAM (10%→50%→100%)
- Risks/gardrails: error%> 2% 10 min - stop and rollback
- Contact: @ Owner, @ SRE-on-call, @ Support-lead
- Triggers: p95> + 25% 10 min, PSP success <97%
- Steps: (1) traffic −→ 0% on v2. 14; (2) switch flags to v2. 13; (3) migration rollback via snapshot/checkpoint; (4) smoke tests; (5) report.
17) Integration with the release train
Release Train: fixed slots (e.g. 2 × per week), SLA on merge-cut.
Hotfix policy: individual trains/branches, fast track to prod.
Versioning: semver, labels in artifacts and environments, SBOM.
18) The bottom line
Change management is not a brake on speed, but a mechanism for safe acceleration. Risk-based classification, good RFCs, progressive rolling, compatible data migrations, clear communications and measurable effect turn releases into a manageable, repeatable and auditable process.