Operations and Management → Change Management

Change Management

1) Purpose and principles

The goal is to deliver change quickly and safely, reducing the risk of incidents, downtime, and regulatory violations.

Principles:

Predictable & Reversible: Each change is planned, verifiable, and reversible.
Risk-based: The depth of control depends on the risk (jurisdictions, money, PII).
Small & Frequent: Small increments are easier to evaluate and roll back.
Automation first: infrastructure as code, tests, validations, auto-checks.
Single Source of Truth: a single RFC/ticket, a single calendar and a log of actions.

2) Scope

Product code (backend/frontend, mobile SDK).
Infrastructure (IaC, Kubernetes/VM/CDN/Edge).
Data (DB diagrams, migrations, storefronts/ETL).
Configurations and feature flags.
Integrations (PSP, KYC, game providers).
Security and access policies.

3) Roles and RACI

Change Owner-Responsible.
Release Curator/RelEng - Release Train Coordination.
SRE/Ops - operation, SLO/SLA gate.
Security/Compliance - Review risk and compliance.
CAB (Change Advisory Board) - approval of normal/high-risk changes.
Business Stakeholders/Support - Informed.

4) Classification of changes

Standard (typical, pre-approved): frequent, low-risk, ready-made playbook (e.g. flag update, key rotation).
Normal: Require RFC, assessment, possible CAB, tests and rollback plan.
Emergency: urgent fixes for P1 incidents; minimal bureaucratic path, post-factum review/SAW.

5) Change lifecycle

1. Trigger (RFC): objective, scope, risk, affected services/regions, backout plan.
2. Risk assessment: Impact × Likelihood matrix, impact on SLO/compliance/value.
3. Planning: window, dependencies, migrations, communications, validation tests.
4. Validation: autotests, static analysis, security check, performance run.
5. Deployment: progressive strategy (see § 8), telemetry and gardrails.
6. Observation: burn-rate SLO, alerts, business metrics (GGR/NGR, conversion).
7. Completion: result acceptance, documentation update, post-mortem for deviations.

6) RFC: minimum composition

Context: why change, influence hypothesis.
Range: systems, regions, customer versions.
Risk: matrix and failure scenarios, blast radius.
Deployment plan: step by step, with go/stop criteria.
Backout plan: commands/steps, start conditions, RTO/RPO expectations.
Test plan: what we check before/after (functionality, performance, safety).
Communications: whom we notify, message templates.
Audit: links to tickets, commits, CI/CD artifacts.

7) Change calendar and windows

Single calendar: all releases, migrations, turn off features, external events (sports/marketing/holidays).
Freeze windows: major sales/championships/peak hours, tax reporting.
Interference policy: prevent conflicting changes to the same critical paths.
Regional waves: first "warm" regions/low traffic, then - the main ones.

8) Technical deployment strategies

Canary: small share of traffic → comparison of metrics (p95 latency, error%, conversion).
Blue-Green: parallel environments, atomic route switching.
Progressive Delivery: Percentage rollout with automatic stop conditions.
Feature Flags: function switches, kill-switch, A/B.
Dark Launch/Shadow Traffic: checking for shadows without affecting users.
Step limits: gradual increase in QPS/competitiveness.

Gardrails: automatic stop when p95/error% thresholds are exceeded, returns/chargebacks increase, authorizations/deposits fall.

9) Data and schema changes

Compatibility: additive migrations → code that reads both the old and the new schema.
Two-phase migrations: (1) add new fields/indexes → (2) switch code → (3) delete old.
Contract versioning: Avro/Protobuf schemes with registry; back/forward compatible.
Large-volume migrations: batches, pauses, idempotency, checkpoints and progress.
Disaster tolerance: RPO/RTO test, snapshots, recovery rehearsals.
BI data: change of showcases/metrics - via MR/SR and metrics dictionary (ID, formula).

10) Configuration and secret management

Config as Data: versioned configs, validation by the scheme, promotion through the environment.
Secrets: key rotation, principles of minimum privileges, auditing of requests.
Regional overrides: limits/partners (PSP/KYC) - through parameterization, not through forks of code.

11) Compliance and audit (iGaming context)

Traces of changes: who/when/what switched (flags, configs, routes, migrations).
Segregation of Duties: different roles for author, reviewer and deploer (SOX-like).
Regulatory reports: fixed releases, version control of settlements (GGR/NGR, bonuses), control of access to PII.
Providers: fixed versions of SDK/provider certificates, SLA obligations.

12) Communications

Alert templates: before release (what/when/risks), during (status,% traffic, metrics), after (totals).
External messages: banners/status page when affecting customers.
Coordination: # release-war-room channel, release owner, update frequency.

13) Performance metrics

DORA: Deployment Frequency, Lead Time for Changes, Change Failure Rate (CFR), MTTR.
SLO Impact: Share of time in SLO before/after releases.
Backout Rate - The frequency of rollbacks by change category.
Release Debt: pending migrations/feature flags in limbo.
Business Impact: conversion, KYC TTV, success rate PSP, GGR/NGR drift when rolling.

14) Anti-patterns

Big-bang releases: Lots of changes at a time - it's hard to understand the cause of regression.
Incompatible migrations: deleting/renaming fields without double reading.
Flags without owners and deadlines for removal: "eternal" branches of logic.
Releases without telemetry and stop criteria: "by eye" and late detection of damage.
Ignoring calendar: intersections with peak events/campaigns.
Manual steps without playbooks and auditing: high variability and risk.

15) Checklists

Before Start (RFC Ready)

Change objective and KPIs are formulated
Risk and blast radius assessed, change class selected
Deployment plan and Backout are written step-by-step
There is a test plan and results on the stage/canary
Communications and calendar updated, stakeholders notified

During rolling

p95/error% metrics, business signals and logs are monitored in real time
Progress steps are confirmed by check points
At operation of gardrails - auto-stop and rollback

After

Release results recorded (changelog, versions, artifacts)
Post-mortem for deviations (≤ 5 working days)
Debts (flag deletion, final migrations) are logged with owners

16) Mini templates

RFC Template (Short):

Objective/hypothesis
Scope and influences (services, regions, data, customers)
Impact × Likelihood and mitigation measures
Rolling plan (steps,% traffic, go/no-go criteria)
Backout plan (steps, RTO/RPO, data)
Test plan (functional/performance/safety)
Communications (channels, frequency)
Artifacts (tickets, PR, build numbers)

Calendar entry template:

Change: "Payments-Service v2. 14 + psp_limits migration"
Window: 2025-11-02 00: 00-01: 00 EET
Affected regions: EU, LATAM (10%→50%→100%)
Risks/gardrails: error%> 2% 10 min - stop and rollback
Contact: @ Owner, @ SRE-on-call, @ Support-lead

Backout pattern:

Triggers: p95> + 25% 10 min, PSP success <97%
Steps: (1) traffic −→ 0% on v2. 14; (2) switch flags to v2. 13; (3) migration rollback via snapshot/checkpoint; (4) smoke tests; (5) report.

17) Integration with the release train

Release Train: fixed slots (e.g. 2 × per week), SLA on merge-cut.
Hotfix policy: individual trains/branches, fast track to prod.
Versioning: semver, labels in artifacts and environments, SBOM.

18) The bottom line

Change management is not a brake on speed, but a mechanism for safe acceleration. Risk-based classification, good RFCs, progressive rolling, compatible data migrations, clear communications and measurable effect turn releases into a manageable, repeatable and auditable process.

Operations and Management → Change Management

Change Management

During rolling

After

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects