Maintenance windows

1) What is the "maintenance window" and why it is needed

Maintenance Window - Pre-agreed time frame for activities that potentially impact availability/performance. The goal is controlled changes with predictable risk, transparent communication and evidence-based reporting.

Types:

Planned: releases, migrations, certificate/key rotations, database/broker upgrades.
Emergency: urgent safety fixes/incident rollbacks.
Silent/Zero-impact: no user impact (hidden canaries, replicas, parallel input).
Provider-led: windows of external providers (PSP/KYC/CDN/Cloud).

2) Principles

SLO-first: the decision on the time/format of the window is made according to the impact on SLI and error budgets.
Minimum explosive radius: canary → stepwise → full inclusion.
Reversibility: Each operation has a backout plan and a proven rollback.
Single source of truth: window calendar + ticket/RFC with full data package.
Evidence: evidence collection (logs, graphs, screenshots, artifact hashes).
SLA communications: in advance, during the work, upon completion.

3) Planning: Timing and coverage

Window selection: low traffic, minimal impact for key cohorts (regions/VIP/partners).
Time zones: record in UTC + local time (for example, Europe/Kyiv).
Blackout periods: ban on work during peak seasons/events (matches, sales, release "windows of death").
Blast radius: clearly define who will be affected (services, regions, providers).

4) Negotiation process (RFC/CAB lite)

1. The originator creates a ticket/RFC with risk analysis and plan (see template below).
2. Risk assessment (Low/Med/High) and approval by the owner of the service + SRE/security.

3. Calendar: slot booking; Conflict checking (other windows/providers)

4. Comm plan: pre-agreed notifications and status page.
5. Go/No-Go-meeting (in 24-48 hours) for high-risk changes.

5) Prep: Security Gates

Pre-launch checks: successful stage tests, artifacts signed, total risks ≤ acceptable.
Canary: 1%→5%→25% by cohort/region; automatic SLO-gardrails and auto-rollback.
Degradation flags and limits are ready.
Rollback/backout plan checked in sandbox; rollback commands are documented.
Suppression of alerts: only for the expected noise, SLO signals are not muffled.
Accesses: JIT/JEA accounts for operations, mandatory audit.

6) Communications (timing and content)

T-14/7/2 days (planned): heads-up for clients/internal teams (what/when/impact/contacts).
T-60/30/15 minutes: reminders inside and on the status page.
During work: updates every 15-30 minutes (SEV-dependent) according to the template: Impact → Stage → Next update.
After: final "Completed/Partially completed/Rolled back," list of changes, SLO check.

7) Performance of works (reference scenario)

1. Freeze unrelated releases.
2. Transition to canary (restricted cohort) → observe SLI/p95/p99 metrics.
3. Stepwise increase in the share with green gardrails.
4. Verification of business SLI (conversion, success of payments/registrations).
5. Check list functionality verification (happy path + critical scenarios).
6. Release/No-release solution (IC/SRE/service owner).
7. Removal of suppression, return of alert policies.

8) After the window: verification and reporting

Observation window (for example, 1-24 hours): tracking SLO and errors.
Window report: what was done, metrics, deviations, evidence, total.
If there were problems: AAR→RCA→CAPA (fix rules, tests, documentation).
Archive: ticket, artifacts, signatures, checksums.

9) Coordination with external providers

Confirmed slots and provider contacts; window in their status system.
Folback/routing to an alternative provider for the period of work.
A single war-room with a provider (chat/bridge) and SLA updates.

10) Process Maturity Metrics

On-time rate:% of windows started/completed on time.
Change failure rate:% of windows with rollbacks/impact on SLO.
Incident-during-MW: incidents that occurred during the window.
Communication SLA: share of timely updates.
Evidence completeness:% of windows with full evidence package.
Customer impact: complaints/tickets for 1 window, trend.
After 7/30 days: SLO stability and no relapses.

11) Checklists

Before the window

RFC/ticket is full; risk assessment completed; owner assigned.
Canary and backout plan checked; rollback commands tested.
JIT accesses issued; alerts are configured (SLOs are not jammed).
Calendar/status page and notifications are prepared.
Releases/Competing Windows - Frozen/Shifted.
Providers confirmed; contacts and SLAs are recorded.

During

Updates on schedule; war-room is active.
Gardrails on SLO/peak errors are respected; in case of violation - auto-rollback.
Evidence is collected (screenshots, before/after graphs, action log).

After

SLO in green area during observation window.
Final report with evidence; status page updated.
CAPAs are issued (if there were deviations); documentation updated.

12) Templates

RFC Template per Maintenance Window


RFC: MW-2025-11-05-DB-Upgrade
Window: 2025-11-05 00: 00-02: 00 UTC (Europe/Kyiv 02: 00-04: 00)
Service/component: payments-db (PostgreSQL cluster A)
Type: Planned (High)
Target: Upgrade to 15. x for security/bugs
Blast radius: EU region, tenant EU, all write operations
Impact: up to 2 × p99 growth to 400 ms; short-term read-only (≤5 min)
Gardrails: error-rate <0. 5%, p99 <400 ms, SLO not impaired
План: expand→migrate→contract; canary 1 %/5 %/25%; 1..N steps (with commands)
Backout: rolling back replica/slots; TTL DNS does not change; rollback time ≤ 10 min
Suppression: noise of database/replica alerts; SLO alerts are active
Communications: T-7/T-2 days and T-60/15 minutes; war-room #mw-db-a
Owners: @ db-tl, @ sre-ic, @ payments-pm
Evidence: before/after p95/p99 graphs, migration logs, checksums
Risk: High (data) - confirmed by CAB

Client Notification Template (Brief)


Topic: Planned work 05. 11. 2025 02:00–04:00 (Europe/Kyiv)
We will update the payment database. Short delays and read-only mode (up to 5 minutes) are possible.
On-call contacts: status. example. com      support@example. com

Suppression rules (idea)

yaml suppress:
- name: db-maintenance when: window("2025-11-05T00:00Z","2025-11-05T02:00Z")
match: [ "db. replica. lag", "db. connection. reset", "migration. progress" ]
keep: [ "slo. payment. success", "api. availability" ]

13) Features for regulated domains

Audit log unchangeable: who approved, who executed, what commands, hashes of artifacts.
PII/Finance: masking in evidence, limited access to reports.
Terms of notifications to customers and partners - in accordance with contracts.
Provider windows - documented with external SLAs and contacts.

14) Anti-patterns

Window without backout plan and verified rollback.

Jamming of SLO signals "just in case."

Competing windows in the same domain/region.
Comm silence: no before/during/after updates.
Manual edits in the product without auditing and scripts.
"Infinite" windows due to uncertain success criteria.
Lack of evidence - nothing to confirm quality.

15) Implementation Roadmap (4-6 weeks)

1. Ned. 1-Enter a single calendar and RFC template define blackout periods.
2. Ned. 2: standardize gates (canary, SLO-gardrails, backout).
3. Ned. 3: automate suppression/release annotations and status page.
4. Ned. 4: reporting and maturity metrics; weekly MW-review.
5. Ned. 5-6: integration with providers and audit archive; High-risk window simulation.

16) The bottom line

Properly organized service windows are manageable, reversible, and provably secure changes. With SLO-gardrails, canary rasps, strict communications and a full set of evidence, the window turns from a "terrible downtime" into a routine mechanism of improvements without surprises for users and partners.

Maintenance windows

During

After

Client Notification Template (Brief)

Suppression rules (idea)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects