Notification and alert system

(Section: Operations and Management)

1) Purpose and principles

The goal is to deliver little, but accurately: only relevant signals, in a timely manner and to a responsible person/robot with an understandable next-step.

Principles:

Actionable by default: each alert has an owner, priority, response time and an action button.
SLO-first: Alerts are built around SLI/SLO, not arbitrary metrics.
Noise-control: deadup, correlations, storm suppression.
Context-rich: metadata (region, tenant, version, trace_id) and link to runbook.
Audit-ready: all alerts and reactions are acknowledged and saved in the unchanging log.

2) Signal sources

Those. telemetry: availability, p95/p99, error-rate, queue lag, resource limits.
Business events: PriceMismatch, WebhookLag, RTP Drift, fraud signals.
Security/compliance: SoD violations, PII access, key/certificate expiration.
Scheduler: expired SLA tasks, DLQ avalanches, retry-storms.

3) Classification and priorities

Priority	Reaction	Examples
P1 (SEV-0)	immediately, 24 × 7	Checkout unavailable, PII leak, PSP failure in main region
P2 (SEV-1)	≤ 30-60 min	p95 growth, webhook lag, partial degradation of the provider
P3 (SEV-2)	working hours	egress cost trend, retray growth, proximity to quota caps
Info	no paging	release complete, 80% quota, sert. expires in N days

Guardrails: alerts are formulated regarding SLO/error budget (burn rate).

4) Routing and Escalation 24 × 7

Routing by context: 'region/tenant/product/provider/severity'.
Escalator ladder: on-call engineer → command lead → Duty Manager → Exec/Legal (for PII/Finance).
Duty: rotation by role (SRE, App, Data, Security, Payments), backup contacts (chat/voice/SMS).
Silence windows: night, release, marketing; exceptions for P1.

5) Noise reduction and correlations

Deduplication: by '(fingerprint, region, tenant, route)' and 'trace _ id'.
Storm suppression: temporary suppression of duplicates with active P1.
Correlations: grouping signals around the root cause (release/feature/provider).

Hysteresis: entry/exit from the threshold - different to avoid "saw."

6) Alert content (template)

Title: concise and substantive - "EU/Checkout: p95> 250ms (SLO breach)."

Key fields: priority, time, region, tenant, version, trace_id, affected%, †. reason.
What to do now: the first 1-3 steps + a link to the runbook/buttons (Re-route, Rollback, Pause Promo).
Next communication: in N minutes, owner (IC/on-call).

7) Delivery channels

Chat/messenger: the main channel of triage (bot cards with buttons).
Pager/voice/SMS: for P1.
Mail: reports and non-urgent (P3/Info).
Webhooks: integration with ticketing/orchestrators.
Status page: external notification of customers and partners.

8) Integrations and action buttons

Incident bot: creates a card, assigns an IC, opens a video bridge, starts timers.
Руны (auto-actions): Re-route, Rollback, Raise Limit, Flush Cache, Disable Webhooks, Enable Safe Mode.
Rights: Runes launch restricted to roles; all actions are signed and logged.

9) Multi-region and multi-tenant

Independent SLOs/thresholds by region; local incidents do not "paint" the whole world.
Visibility filters: partners/tenants see only their own.
Jurisdictional requirements: notification texts, languages, time zones.

10) Policies, schedules, silence windows

Alert policy: owners, thresholds, channels, escalations, templates.
Calendars: working/non-working hours, release/marketing windows.
Change freeze: Easing thresholds or suppressing "non-P1" during big stocks.

11) Audit and legal fixation

Receipts: for critical alerts - 'receipt _ hash' and DSSE signature.
WORM logs: unchangeable storage of events and reactions (who confirmed what they did).
Chain-of-custody: tracing escalations and decisions.

12) Notification System Metrics and SLO

MTTA (acknowledge): P1 ≤ 5-10 min; P2 ≤ 30 min.
Page rate/On-call load: signals per shift - in the target range.
False Positive%: Target threshold ≤ (typically <10-15%).
Correlation efficiency: the proportion of grouped signals ≥ 80%.
Delivery SLO: chat ≥ 99. 9%, SMS/voice ≥ 99. 5%.
Time-to-Action: p95 to run runes from alert.

13) Dashboards and reports

Operational: active incidents, burn-rate, region/tenant map, alert queue.
Alert quality: noise, FP, threshold retests, silent zones.

On-call load: paging frequency, response time, "out of hours."

Post-incident: runes efficiency, cause recurrence.

14) Specificity of iGaming/fintech

Payments/PSP: P1 - provider failure, increase in authorization failures; auto-route to the backup PSP.
RTP & Limits: Alerts to observed RTP drift, over limits, suspicious win patterns.
Affiliates/webhooks: delivery lag, double growth, drop in confirmed receipts.
Price/FX/Tax: vitrina↔checkout mismatch, out of sync artifact versions.
Responsible play: RG triggers and their timely escalation in support/Compliance.

15) RACI

Area	R	A	C	I
Architecture and thresholds	SRE/Platform	Head of Eng	Product, Data	All
Escalation/duty	IR Team	COO	HR, Security	Management
Messages and templates	Comms/Support	COO	Legal/Compliance	Partners
Audit/Receipts	Compliance	CCO	Security, Data	Audit
Playbooks/Runes	SRE & Owners	CTO	Product, Integrations	All

16) Implementation checklist

Define North-Star and SLI/SLO; associate alerts with burn-rate.
Enter policy directory: thresholds, channels, escalations, silence windows.
Implement deadlock, correlations, hysteresis, storm suppression.
Configure multi-region and multi-tenant visibility rules.
Connect "action buttons" and runbooks; Restrict launch rights.
Enable WORM/Bill, trace_id Trace and Runtime Audit.
Build quality dashboards (noise, FP, MTTA, page rate).
Провести GameDay: PSP outage, WebhookLag, PriceMismatch, RTP Drift.
Regularly review thresholds; A/B thresholds on "dumb" metrics.
On-call load and improvement report monthly.

17) Playbooks (reference)

PSP Outage (P1): auto-route to reserve, lowering client timeouts, quarantine "gray" transactions, status update in 15 minutes.
WebhookLag (P2): increase workers/batch, queue prioritization, time pause of optional endpoints.
PriceMismatch (P1/P2): cache force disability, 'fx _ version/tax _ rule _ version' reconciliation, artifact rollback, compensations.
RTP Drift (P2): bonus/promo pause, profile audit, monitoring window extension.
Security: SoD/MFA fail (P1/P2): operation blocking, JIT recheck, forensics and Legal if necessary.

18) FAQ

How to reduce false positives?
SLO-oriented rules, correlations, hysteresis, training windows, and regular threshold revisions.

What is more important - coverage or accuracy?
For P1 - accuracy and speed (preferably less, but critical). For P3 - trend and cost coverage.

Do I need phone paging?

Yes, for P1; chat may not be available or "hushed."

How not to "burn" the on-call command?
Page rate limits, load redistribution, follow-the-sun, monthly noise reviews.

Summary: The notification and alert system is a controlled pipeline from signal to action. Build it on SLO, dampen noise, route by context, give action buttons and fix everything legally. This way you reduce MTTA, remove the load from on-call and increase business resilience even with sharp spikes and provider failures.

Notification and alert system

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects