Notification and alert system
(Section: Operations and Management)
1) Purpose and principles
The goal is to deliver little, but accurately: only relevant signals, in a timely manner and to a responsible person/robot with an understandable next-step.
Principles:- Actionable by default: each alert has an owner, priority, response time and an action button.
- SLO-first: Alerts are built around SLI/SLO, not arbitrary metrics.
- Noise-control: deadup, correlations, storm suppression.
- Context-rich: metadata (region, tenant, version, trace_id) and link to runbook.
- Audit-ready: all alerts and reactions are acknowledged and saved in the unchanging log.
2) Signal sources
Those. telemetry: availability, p95/p99, error-rate, queue lag, resource limits.
Business events: PriceMismatch, WebhookLag, RTP Drift, fraud signals.
Security/compliance: SoD violations, PII access, key/certificate expiration.
Scheduler: expired SLA tasks, DLQ avalanches, retry-storms.
3) Classification and priorities
Guardrails: alerts are formulated regarding SLO/error budget (burn rate).
4) Routing and Escalation 24 × 7
Routing by context: 'region/tenant/product/provider/severity'.
Escalator ladder: on-call engineer → command lead → Duty Manager → Exec/Legal (for PII/Finance).
Duty: rotation by role (SRE, App, Data, Security, Payments), backup contacts (chat/voice/SMS).
Silence windows: night, release, marketing; exceptions for P1.
5) Noise reduction and correlations
Deduplication: by '(fingerprint, region, tenant, route)' and 'trace _ id'.
Storm suppression: temporary suppression of duplicates with active P1.
Correlations: grouping signals around the root cause (release/feature/provider).
Hysteresis: entry/exit from the threshold - different to avoid "saw."
6) Alert content (template)
Title: concise and substantive - "EU/Checkout: p95> 250ms (SLO breach)."
Key fields: priority, time, region, tenant, version, trace_id, affected%, †. reason.
What to do now: the first 1-3 steps + a link to the runbook/buttons (Re-route, Rollback, Pause Promo).
Next communication: in N minutes, owner (IC/on-call).
7) Delivery channels
Chat/messenger: the main channel of triage (bot cards with buttons).
Pager/voice/SMS: for P1.
Mail: reports and non-urgent (P3/Info).
Webhooks: integration with ticketing/orchestrators.
Status page: external notification of customers and partners.
8) Integrations and action buttons
Incident bot: creates a card, assigns an IC, opens a video bridge, starts timers.
Руны (auto-actions): Re-route, Rollback, Raise Limit, Flush Cache, Disable Webhooks, Enable Safe Mode.
Rights: Runes launch restricted to roles; all actions are signed and logged.
9) Multi-region and multi-tenant
Independent SLOs/thresholds by region; local incidents do not "paint" the whole world.
Visibility filters: partners/tenants see only their own.
Jurisdictional requirements: notification texts, languages, time zones.
10) Policies, schedules, silence windows
Alert policy: owners, thresholds, channels, escalations, templates.
Calendars: working/non-working hours, release/marketing windows.
Change freeze: Easing thresholds or suppressing "non-P1" during big stocks.
11) Audit and legal fixation
Receipts: for critical alerts - 'receipt _ hash' and DSSE signature.
WORM logs: unchangeable storage of events and reactions (who confirmed what they did).
Chain-of-custody: tracing escalations and decisions.
12) Notification System Metrics and SLO
MTTA (acknowledge): P1 ≤ 5-10 min; P2 ≤ 30 min.
Page rate/On-call load: signals per shift - in the target range.
False Positive%: Target threshold ≤ (typically <10-15%).
Correlation efficiency: the proportion of grouped signals ≥ 80%.
Delivery SLO: chat ≥ 99. 9%, SMS/voice ≥ 99. 5%.
Time-to-Action: p95 to run runes from alert.
13) Dashboards and reports
Operational: active incidents, burn-rate, region/tenant map, alert queue.
Alert quality: noise, FP, threshold retests, silent zones.
On-call load: paging frequency, response time, "out of hours."
Post-incident: runes efficiency, cause recurrence.
14) Specificity of iGaming/fintech
Payments/PSP: P1 - provider failure, increase in authorization failures; auto-route to the backup PSP.
RTP & Limits: Alerts to observed RTP drift, over limits, suspicious win patterns.
Affiliates/webhooks: delivery lag, double growth, drop in confirmed receipts.
Price/FX/Tax: vitrina↔checkout mismatch, out of sync artifact versions.
Responsible play: RG triggers and their timely escalation in support/Compliance.
15) RACI
16) Implementation checklist
- Define North-Star and SLI/SLO; associate alerts with burn-rate.
- Enter policy directory: thresholds, channels, escalations, silence windows.
- Implement deadlock, correlations, hysteresis, storm suppression.
- Configure multi-region and multi-tenant visibility rules.
- Connect "action buttons" and runbooks; Restrict launch rights.
- Enable WORM/Bill, trace_id Trace and Runtime Audit.
- Build quality dashboards (noise, FP, MTTA, page rate).
- Провести GameDay: PSP outage, WebhookLag, PriceMismatch, RTP Drift.
- Regularly review thresholds; A/B thresholds on "dumb" metrics.
- On-call load and improvement report monthly.
17) Playbooks (reference)
PSP Outage (P1): auto-route to reserve, lowering client timeouts, quarantine "gray" transactions, status update in 15 minutes.
WebhookLag (P2): increase workers/batch, queue prioritization, time pause of optional endpoints.
PriceMismatch (P1/P2): cache force disability, 'fx _ version/tax _ rule _ version' reconciliation, artifact rollback, compensations.
RTP Drift (P2): bonus/promo pause, profile audit, monitoring window extension.
Security: SoD/MFA fail (P1/P2): operation blocking, JIT recheck, forensics and Legal if necessary.
18) FAQ
How to reduce false positives?
SLO-oriented rules, correlations, hysteresis, training windows, and regular threshold revisions.
What is more important - coverage or accuracy?
For P1 - accuracy and speed (preferably less, but critical). For P3 - trend and cost coverage.
Do I need phone paging?
Yes, for P1; chat may not be available or "hushed."
How not to "burn" the on-call command?
Page rate limits, load redistribution, follow-the-sun, monthly noise reviews.
Summary: The notification and alert system is a controlled pipeline from signal to action. Build it on SLO, dampen noise, route by context, give action buttons and fix everything legally. This way you reduce MTTA, remove the load from on-call and increase business resilience even with sharp spikes and provider failures.