GH GambleHub

Maintenance windows

1) What is the "maintenance window" and why it is needed

Maintenance Window - Pre-agreed time frame for activities that potentially impact availability/performance. The goal is controlled changes with predictable risk, transparent communication and evidence-based reporting.

Types:
  • Planned: releases, migrations, certificate/key rotations, database/broker upgrades.
  • Emergency: urgent safety fixes/incident rollbacks.
  • Silent/Zero-impact: no user impact (hidden canaries, replicas, parallel input).
  • Provider-led: windows of external providers (PSP/KYC/CDN/Cloud).

2) Principles

SLO-first: the decision on the time/format of the window is made according to the impact on SLI and error budgets.
Minimum explosive radius: canary → stepwise → full inclusion.
Reversibility: Each operation has a backout plan and a proven rollback.
Single source of truth: window calendar + ticket/RFC with full data package.
Evidence: evidence collection (logs, graphs, screenshots, artifact hashes).
SLA communications: in advance, during the work, upon completion.

3) Planning: Timing and coverage

Window selection: low traffic, minimal impact for key cohorts (regions/VIP/partners).
Time zones: record in UTC + local time (for example, Europe/Kyiv).
Blackout periods: ban on work during peak seasons/events (matches, sales, release "windows of death").
Blast radius: clearly define who will be affected (services, regions, providers).

4) Negotiation process (RFC/CAB lite)

1. The originator creates a ticket/RFC with risk analysis and plan (see template below).
2. Risk assessment (Low/Med/High) and approval by the owner of the service + SRE/security.

3. Calendar: slot booking; Conflict checking (other windows/providers)

4. Comm plan: pre-agreed notifications and status page.
5. Go/No-Go-meeting (in 24-48 hours) for high-risk changes.

5) Prep: Security Gates

Pre-launch checks: successful stage tests, artifacts signed, total risks ≤ acceptable.
Canary: 1%→5%→25% by cohort/region; automatic SLO-gardrails and auto-rollback.
Degradation flags and limits are ready.
Rollback/backout plan checked in sandbox; rollback commands are documented.
Suppression of alerts: only for the expected noise, SLO signals are not muffled.
Accesses: JIT/JEA accounts for operations, mandatory audit.

6) Communications (timing and content)

T-14/7/2 days (planned): heads-up for clients/internal teams (what/when/impact/contacts).
T-60/30/15 minutes: reminders inside and on the status page.
During work: updates every 15-30 minutes (SEV-dependent) according to the template: Impact → Stage → Next update.
After: final "Completed/Partially completed/Rolled back," list of changes, SLO check.

7) Performance of works (reference scenario)

1. Freeze unrelated releases.
2. Transition to canary (restricted cohort) → observe SLI/p95/p99 metrics.
3. Stepwise increase in the share with green gardrails.
4. Verification of business SLI (conversion, success of payments/registrations).
5. Check list functionality verification (happy path + critical scenarios).
6. Release/No-release solution (IC/SRE/service owner).
7. Removal of suppression, return of alert policies.

8) After the window: verification and reporting

Observation window (for example, 1-24 hours): tracking SLO and errors.
Window report: what was done, metrics, deviations, evidence, total.
If there were problems: AAR→RCA→CAPA (fix rules, tests, documentation).
Archive: ticket, artifacts, signatures, checksums.

9) Coordination with external providers

Confirmed slots and provider contacts; window in their status system.
Folback/routing to an alternative provider for the period of work.
A single war-room with a provider (chat/bridge) and SLA updates.

10) Process Maturity Metrics

On-time rate:% of windows started/completed on time.
Change failure rate:% of windows with rollbacks/impact on SLO.
Incident-during-MW: incidents that occurred during the window.
Communication SLA: share of timely updates.
Evidence completeness:% of windows with full evidence package.
Customer impact: complaints/tickets for 1 window, trend.
After 7/30 days: SLO stability and no relapses.

11) Checklists

Before the window

  • RFC/ticket is full; risk assessment completed; owner assigned.
  • Canary and backout plan checked; rollback commands tested.
  • JIT accesses issued; alerts are configured (SLOs are not jammed).
  • Calendar/status page and notifications are prepared.
  • Releases/Competing Windows - Frozen/Shifted.
  • Providers confirmed; contacts and SLAs are recorded.

During

  • Updates on schedule; war-room is active.
  • Gardrails on SLO/peak errors are respected; in case of violation - auto-rollback.
  • Evidence is collected (screenshots, before/after graphs, action log).

After

  • SLO in green area during observation window.
  • Final report with evidence; status page updated.
  • CAPAs are issued (if there were deviations); documentation updated.

12) Templates

RFC Template per Maintenance Window


RFC: MW-2025-11-05-DB-Upgrade
Window: 2025-11-05 00: 00-02: 00 UTC (Europe/Kyiv 02: 00-04: 00)
Service/component: payments-db (PostgreSQL cluster A)
Type: Planned (High)
Target: Upgrade to 15. x for security/bugs
Blast radius: EU region, tenant EU, all write operations
Impact: up to 2 × p99 growth to 400 ms; short-term read-only (≤5 min)
Gardrails: error-rate <0. 5%, p99 <400 ms, SLO not impaired
План: expand→migrate→contract; canary 1 %/5 %/25%; 1..N steps (with commands)
Backout: rolling back replica/slots; TTL DNS does not change; rollback time ≤ 10 min
Suppression: noise of database/replica alerts; SLO alerts are active
Communications: T-7/T-2 days and T-60/15 minutes; war-room #mw-db-a
Owners: @ db-tl, @ sre-ic, @ payments-pm
Evidence: before/after p95/p99 graphs, migration logs, checksums
Risk: High (data) - confirmed by CAB

Client Notification Template (Brief)


Topic: Planned work 05. 11. 2025 02:00–04:00 (Europe/Kyiv)
We will update the payment database. Short delays and read-only mode (up to 5 minutes) are possible.
On-call contacts: status. example. com      support@example. com

Suppression rules (idea)

yaml suppress:
- name: db-maintenance when: window("2025-11-05T00:00Z","2025-11-05T02:00Z")
match: [ "db. replica. lag", "db. connection. reset", "migration. progress" ]
keep: [ "slo. payment. success", "api. availability" ]

13) Features for regulated domains

Audit log unchangeable: who approved, who executed, what commands, hashes of artifacts.
PII/Finance: masking in evidence, limited access to reports.
Terms of notifications to customers and partners - in accordance with contracts.
Provider windows - documented with external SLAs and contacts.

14) Anti-patterns

Window without backout plan and verified rollback.

Jamming of SLO signals "just in case."

Competing windows in the same domain/region.
Comm silence: no before/during/after updates.
Manual edits in the product without auditing and scripts.
"Infinite" windows due to uncertain success criteria.
Lack of evidence - nothing to confirm quality.

15) Implementation Roadmap (4-6 weeks)

1. Ned. 1-Enter a single calendar and RFC template define blackout periods.
2. Ned. 2: standardize gates (canary, SLO-gardrails, backout).
3. Ned. 3: automate suppression/release annotations and status page.
4. Ned. 4: reporting and maturity metrics; weekly MW-review.
5. Ned. 5-6: integration with providers and audit archive; High-risk window simulation.

16) The bottom line

Properly organized service windows are manageable, reversible, and provably secure changes. With SLO-gardrails, canary rasps, strict communications and a full set of evidence, the window turns from a "terrible downtime" into a routine mechanism of improvements without surprises for users and partners.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.