Real-time alerts
1) Purpose and principles
Purpose: to notify the right people/systems in a timely, accurate and targeted manner of events that threaten SLO, revenue and compliance, and to trigger the correct actions (manual/automatic).
Principles: SLO-first, noise minimization, explainability, context, prioritization by business impact, "one signal - one understandable action."
2) Signal taxonomy
SLO signals: burn-rate of the error budget for critical paths (login, deposit, rate, output).
KRI: early risk indicators (PSP auth-success drop by bank/GEO, consumer-lag growth, p99↑).
Event: dependency flags, failover, manual switches, protection activation (rate-limit, WAF).
Security/Compliance: Spike in Sensitive Operations, PII Exports, SoD Violations.
3) Alert levels and SLAs
4) Sources and context correlation
Telemetry: metrics/trails/logs, synthetics and RUM.
Directories: CMDB/service map, owners, dependencies.
Changes: releases, feature flags, migrations, planned work.
External providers: PSP/KYC/game studios/CDN/WAF statuses.
Each alert is enriched: what has changed next? (release/feature), which dependencies are red?, which segment will be affected? (GEO/PSP/bank/tenant).
5) SLO alert rules (core)
Burn-rate: two windows (fast 1h and slow 6-24h). Pager - only if simultaneously exceeded.
Guardrails: thresholds by p99/error-rate serve only as context analysis triggers, do not replace SLO.
Impakt: assessment "share of audience × money/mines × regulyatorika" → P1-P4 level.
6) Noise reduction
Deduplication - grouping by service/tenant/cause; we share one incident instead of dozens of signals.
Hysteresis: N-of-M confirmations, minimum duration of anomaly.
Silences/Meats: Planned works, known incidents, "follow-the-sun" windows.
Rate limits and quotas: per source/label/tenant; protection against "storm."
Cardinality reduction: userId/sessionId is prohibited in alert labels.
7) Routing and Escalation
Routing by context: domain (Payments/Games/Core), environment (prod/stage), region, severity.
Escalation: t0 - on-call L1; t0 + X - L2/domain owner; t0 + Y - IC/Manual. The X/Y time depends on the P1-P3.
Duplication by channels: pager + chat at P1; chat/ticket at P3.
Shift change: auto-transfer of context (timeline, performed actions, hypotheses).
8) Auto-remediation
Payments: PSP switching by health × fee × conversion, restriction of banks/methods, retrai with jitter.
Games/bets: enable cache wedge/limit write operations, queue-page/waiting-room at the front.
Infra: evacuation of traffic, restart of degrading workers, scaling by lag.
Safety/compliance: temporarily close the PII export, enter dual-control for P1 operations.
Any auto-action - with a rollback policy and return criteria.
9) Runbook-first experience
Each alert is associated with a runbook: goal, quick diagnostics (3-5 checks), fix/rollback steps, contact persons, links to dashboards and status page. In the chat/pager we show a short action card.
10) He-call politics
Rotation 24 × 7, domain coverage (Payments/Game Core/SRE).
"Second on-call" for P1, a two-person rule in a var room.
Quiet-hours and follow-the-sun windows.
Training: quarterly exercises (tabletop/game-day), shadow shifts.
Post-incident credits (comp-time) to avoid burnout.
11) Integrations
Incident management: auto-creation of cards, update tapes, IC/CL roles, timers.
Status page: publishing P1/P2 (via Comms Lead) with templates and localization.
Releases: release-gates by SLI, auto-stop/rollback by alert.
Directories: owners, CMDB, provider contacts.
12) Alert examples (iGaming)
1. Auth-success in PSP-1 TR↓ by 25% in 10 min
P2→P1 when> 30% of transactions are covered.
Auto-action: redistribute traffic PSP-2/3; Enable simplified 3DS Partner Manager alert.
2. p99 "stavka→settl"> 3 × norms in EU
Reasons: lag replication, queue of workers.
Auto-action: scale-out workers, warmup cache, temporarily turn off non-critical features.
3. Export PII spikes
P1 in the absence of a ticket/approval.
Auto-action: download block, Compliance notification, SoD check.
13) Alerting Quality Metrics (KPI/KRI)
MTTA-Comms/MTTA-Ops: time to reaction/first action.
Precision/Recall (alert ↔ incident), False Alarm Rate.
Lead-time before SLO violation, TTD (detection time).
Pager fatigue: alerts/person/week, night calls, percentage of "dummies."
Auto-fix rate: the proportion of problems closed by auto-reaction without a person.
Aging: the proportion of P3/P4> X days hanging.
14) Cost management
Quotas for alerts/sources, cutting off redundant labels.
Downsampling and metric aggregation, track sampling; by class.
Regular cost-review: $/alert, $/SLI-dashboard, "heavy" series.
15) Privacy and compliance
Without PII in the text of alerts and labels; tokenization of identifiers.
Access policies (RBAC/ABAC), SoD on alert configuration.
Audit rule changes, versioning, tests and diff.
16) Implementation Roadmap (6-10 weeks)
Ned. 1-2: SLI/KRI directory, owner map, P1-P4 levels, first SLO rules (burn-rate).
Ned. 3-4: dedup/hysteresis/silences, integration with the incident system and chats, runbook bundles.
Ned. 5-6: auto-actions for Payments/Queues, release-gates, status-page feed.
Ned. 7-8: context (releases/feature flags/providers), PSP heat cards × bank × GEO, P1/P2 exercises.
Ned. 9-10: FinOps alerting, KPI dashboards, revision of thresholds and quotas, on-call training.
17) Artifacts and patterns
Alert Spec: metric/condition, windows, suppression, owner, runbook, auto-actions.
Routing Map: domen→kanal→eskalatsii, backup contacts.
Silence Policy: mute rules (planned/known incidents), who can include.
On-call Handbook: rotations, shift changes, P1/P2 checklists, channels.
Post-Incident Pack: alert uploads/timelines, signal quality analysis.
18) Antipatterns
Pager for "raw" p95/p99 without SLO → noise and fatigue.
Dozens of signals about the same thing (no deduplication/correlation).
The alert does not have a runbook or owner.
Threshold "in stone" without seasonality/segmentation (GEO/PSP/bank/hour).
No return after auto-actions (no roll-back criteria).
Labels with PII and userId → risks and an explosion of cardinality.
Result
A really useful alert is an SLO-centric pipeline: context rules with burn-rate, smart noise reduction, clear routing and escalation, runbook-first experience and safe auto-actions. Such a circuit catches critical events earlier than users, reduces MTTR, protects revenue and at the same time protects it-call from the "pager-hellish" routine.