GH GambleHub

Real-time alerts

1) Purpose and principles

Purpose: to notify the right people/systems in a timely, accurate and targeted manner of events that threaten SLO, revenue and compliance, and to trigger the correct actions (manual/automatic).

Principles: SLO-first, noise minimization, explainability, context, prioritization by business impact, "one signal - one understandable action."


2) Signal taxonomy

SLO signals: burn-rate of the error budget for critical paths (login, deposit, rate, output).
KRI: early risk indicators (PSP auth-success drop by bank/GEO, consumer-lag growth, p99↑).
Event: dependency flags, failover, manual switches, protection activation (rate-limit, WAF).
Security/Compliance: Spike in Sensitive Operations, PII Exports, SoD Violations.


3) Alert levels and SLAs

LevelExampleChannelReactionFirst response SLA
P1Deposits/rates not available in region, PII leakPager (call/Push), duty var roomImmediate auto-actions + on-call≤ 5 min
P2Strong degradation of p99, PSP problem in the part of banksPager/Priority ChatIntervention during the window≤ 15 min
P3Local degradation/workaround isChat/TicketScheduled remediation≤ 60 min
P4Notifications/TrendsTicket/MailAnalysis/PlanAs scheduled

4) Sources and context correlation

Telemetry: metrics/trails/logs, synthetics and RUM.
Directories: CMDB/service map, owners, dependencies.
Changes: releases, feature flags, migrations, planned work.
External providers: PSP/KYC/game studios/CDN/WAF statuses.
Each alert is enriched: what has changed next? (release/feature), which dependencies are red?, which segment will be affected? (GEO/PSP/bank/tenant).


5) SLO alert rules (core)

Burn-rate: two windows (fast 1h and slow 6-24h). Pager - only if simultaneously exceeded.
Guardrails: thresholds by p99/error-rate serve only as context analysis triggers, do not replace SLO.
Impakt: assessment "share of audience × money/mines × regulyatorika" → P1-P4 level.


6) Noise reduction

Deduplication - grouping by service/tenant/cause; we share one incident instead of dozens of signals.
Hysteresis: N-of-M confirmations, minimum duration of anomaly.
Silences/Meats: Planned works, known incidents, "follow-the-sun" windows.

Rate limits and quotas: per source/label/tenant; protection against "storm."

Cardinality reduction: userId/sessionId is prohibited in alert labels.


7) Routing and Escalation

Routing by context: domain (Payments/Games/Core), environment (prod/stage), region, severity.
Escalation: t0 - on-call L1; t0 + X - L2/domain owner; t0 + Y - IC/Manual. The X/Y time depends on the P1-P3.
Duplication by channels: pager + chat at P1; chat/ticket at P3.
Shift change: auto-transfer of context (timeline, performed actions, hypotheses).


8) Auto-remediation

Payments: PSP switching by health × fee × conversion, restriction of banks/methods, retrai with jitter.
Games/bets: enable cache wedge/limit write operations, queue-page/waiting-room at the front.
Infra: evacuation of traffic, restart of degrading workers, scaling by lag.
Safety/compliance: temporarily close the PII export, enter dual-control for P1 operations.
Any auto-action - with a rollback policy and return criteria.


9) Runbook-first experience

Each alert is associated with a runbook: goal, quick diagnostics (3-5 checks), fix/rollback steps, contact persons, links to dashboards and status page. In the chat/pager we show a short action card.


10) He-call politics

Rotation 24 × 7, domain coverage (Payments/Game Core/SRE).
"Second on-call" for P1, a two-person rule in a var room.
Quiet-hours and follow-the-sun windows.
Training: quarterly exercises (tabletop/game-day), shadow shifts.
Post-incident credits (comp-time) to avoid burnout.


11) Integrations

Incident management: auto-creation of cards, update tapes, IC/CL roles, timers.
Status page: publishing P1/P2 (via Comms Lead) with templates and localization.
Releases: release-gates by SLI, auto-stop/rollback by alert.
Directories: owners, CMDB, provider contacts.


12) Alert examples (iGaming)

1. Auth-success in PSP-1 TR↓ by 25% in 10 min

P2→P1 when> 30% of transactions are covered.
Auto-action: redistribute traffic PSP-2/3; Enable simplified 3DS Partner Manager alert.

2. p99 "stavka→settl"> 3 × norms in EU

Reasons: lag replication, queue of workers.
Auto-action: scale-out workers, warmup cache, temporarily turn off non-critical features.

3. Export PII spikes

P1 in the absence of a ticket/approval.
Auto-action: download block, Compliance notification, SoD check.


13) Alerting Quality Metrics (KPI/KRI)

MTTA-Comms/MTTA-Ops: time to reaction/first action.
Precision/Recall (alert ↔ incident), False Alarm Rate.
Lead-time before SLO violation, TTD (detection time).

Pager fatigue: alerts/person/week, night calls, percentage of "dummies."

Auto-fix rate: the proportion of problems closed by auto-reaction without a person.
Aging: the proportion of P3/P4> X days hanging.


14) Cost management

Quotas for alerts/sources, cutting off redundant labels.
Downsampling and metric aggregation, track sampling; by class.
Regular cost-review: $/alert, $/SLI-dashboard, "heavy" series.


15) Privacy and compliance

Without PII in the text of alerts and labels; tokenization of identifiers.
Access policies (RBAC/ABAC), SoD on alert configuration.
Audit rule changes, versioning, tests and diff.


16) Implementation Roadmap (6-10 weeks)

Ned. 1-2: SLI/KRI directory, owner map, P1-P4 levels, first SLO rules (burn-rate).
Ned. 3-4: dedup/hysteresis/silences, integration with the incident system and chats, runbook bundles.
Ned. 5-6: auto-actions for Payments/Queues, release-gates, status-page feed.
Ned. 7-8: context (releases/feature flags/providers), PSP heat cards × bank × GEO, P1/P2 exercises.
Ned. 9-10: FinOps alerting, KPI dashboards, revision of thresholds and quotas, on-call training.


17) Artifacts and patterns

Alert Spec: metric/condition, windows, suppression, owner, runbook, auto-actions.
Routing Map: domen→kanal→eskalatsii, backup contacts.
Silence Policy: mute rules (planned/known incidents), who can include.
On-call Handbook: rotations, shift changes, P1/P2 checklists, channels.
Post-Incident Pack: alert uploads/timelines, signal quality analysis.


18) Antipatterns

Pager for "raw" p95/p99 without SLO → noise and fatigue.
Dozens of signals about the same thing (no deduplication/correlation).
The alert does not have a runbook or owner.
Threshold "in stone" without seasonality/segmentation (GEO/PSP/bank/hour).
No return after auto-actions (no roll-back criteria).
Labels with PII and userId → risks and an explosion of cardinality.


Result

A really useful alert is an SLO-centric pipeline: context rules with burn-rate, smart noise reduction, clear routing and escalation, runbook-first experience and safe auto-actions. Such a circuit catches critical events earlier than users, reduces MTTR, protects revenue and at the same time protects it-call from the "pager-hellish" routine.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.