GH GambleHub

Alarm and notification system

1) Role and goals

The signal system is not "sending messages," but a decision-making circuit: it highlights deviations in time, offers actions and maintains a balance between timeliness and silence.

Objectives:
  • Reduce MTTD/MTTR through prioritization and clear playbooks.
  • Reduce alert fatigue through noise cancellation.
  • Give actions directly from the notification (ack, snooze, runbook, auto-action).
  • Observe privacy and consent (opt-in/opt-out, log storage).

2) Taxonomy of events and levels

2. 1 Event types

Metrics/anomalies (SRE, product, finance).
Business rules (limits, fraud, KYC, payments).
System (deploy, degradation, licenses).
User (behavioral triggers, RG/responsible game).

2. 2 Severity levels

Critical - immediate response, risk of loss/safety.
High - significant deterioration of KPI/SLO.
Medium - Action required during business hours.
Low/Info - observation/context, auto-convolution into digests.

2. 3 Priority

'Impact × Urgency'matrix → P1..P4. Link to channels and SLA reactions.

3) Architecture and threads

Producers of signals → Sheena of events → Normalization (enrich, dedup) → the Correlation → Corrected (policy engine) → Routing → Canala deliveries → the Center of preferences → Logs/analytics.

Key components:
  • Enricher: adds tenant, role, region, playbook links.
  • Deduper-Group recurring events by key.
  • Correlator: Glue related signals into an incident.
  • Policy Engine: YAML/DSL rules, quiet hours, escalations.
  • Delivery: in-app, email, push, SMS, webhook, chat integration.

4) Rules and policies (YAML example)

yaml policies:
- id: p_sre_critical match: { domain: "infra", severity: "critical" }
route:
primary: { channel: "pager", targets: ["oncall_sre"] }
fallback: { channel: "sms", delay: "2m" }
suppress:
flapping: {window: "10m," threshold: 5} # suppressing frequent twitching duplicates: {key: ["service, ""cluster,"" error _ code"], ttl: "15m"}
escalate:
after: "10m"
to: ["sre_manager"]
auto_assign: true
- id: p_product_medium match: { domain: "product", severity: "medium", kpi: "conversion" }
route:
primary: { channel: "inapp", audience: "product_owners" }
digest:
window: "1h"
max_items: 10 quiet_hours:
tz: "Europe/Kyiv"
ranges: ["22: 00-07: 00"] # only P1 digests/pager at this time

5) Deduplication, correlation, suppression of flapping

Dedup: group ID 'dedup _ key = hash (service' metric 'dim)'; TTL ≥ Flapping window.
Correlation: combine related signals by topology (servis→zavisimost), time (± N min) and context (release, incident).
Flapping: thresholds "N events per M minutes" → one signal "flapping detected" with a proposal to raise hysteresis or suppress.

6) Routing and RACI

Responsible: who gets the first notification/drag.
Accountable: who escalates after SLA.
Consulted: who to mention in the thread/chat channel.
Informed: who will leave the digest/results.
Assign by role and context (tenant, region, product stream).

7) Delivery channels and nuances

ChannelWhen to useFeatures/Limitations
In-appOperational, but non-critical; actionsRich UI, CTA, context
EmailDigests, reports, non-criticalMay be lost/filtered
PushFor mobile duty teamLength limit, quiet hours
SMS/PagerP1/P0 criticismPaid, concise, without investments
WebhookIntegrations (Jira, Slack, Ops)HMAC signatures, retreats, idempotence
Chat (Slack)Thread of the incident, collaborationText commands (ack, assign)

Retrai: 5xx/429/timeout → backoff + jitter; 'Retry-After' respect. Idempotence: 'X-Notification-Id' on webhooks.

8) Preferences Center

Opt-in/Opt-out by event type, level, channel.
Quiet hours, manual snooze for 15/30/60 min.
Threshold/sensitivity (e.g. ≥ 3 σ anomaly).
Language/locale, time/currency format.
Role binding: presets for SRE/Product/Finance.
Transparency: show why the user received the signal (link to the rule).

9) Content design: message structure

Pattern for critical signal (P1):
  • Title: Brief, with trigger: "[P1] [PSP _ TR] Sharp rise in 3DS failures (+ 12%)."
  • Context: period, affected segments/region, data source.
  • Reason/hypothesis: "Associated with the release of PSP_X 18:20 UTC."
  • SLA/deadline: "Escalation in 10 min."
  • CTA: "Open playbook," "Enable fallback, PSP_Y" "Ack (30 min)."
  • Links: graph, incident-thread, metrics, runbook.
  • Metadata: 'trace _ id', 'incident _ id', 'dedup _ key'.

Tone: facts, no dramatization; Numbers and units avoid abbreviations without decoding.
Localization: variables → placeholders, translations are stored in resources; numbers/dates - by locale.

10) Actions from notifications (Actionable)

Ack/Snooze with time parameters.
Assign/Invite to the incident thread.
Runbook-Open solution steps with context autocomplete.
One-click remediation (where safe): switch route, raise limit, restart job (with confirmation and audit).
Create ticket (Jira/GitHub) with autocomplete fields.

11) Signal quality: metrics and targets

Precision ≥ 80% for P1/P2.
Recall (the proportion of detected incidents among all incidents) ≥ 70%.
Noise: average signals/hour per user (target ceiling).
Ack-time p50/p95, Escalation rate, Snooze rate (as a noise indicator).
MTTD/MTTA/MTTR (in terms of domains and channels).
Silenced-but-should-alert (gaps due to rules) is a separate dashboard.

12) Noise control: techniques

Hysteresis and sliding windows for thresholds.
Anti-aliasing (EWMA) before detection.
Aggregation: instead of 30 small ones - one batch/digest with top contributors.
Context limits: maximum N notifications/hour/channel/user.
Auto-feedback: if the user clicks Snooze for 3 × in a row → suggest raising the threshold/changing the channel.

13) Security, privacy, compliance

HMAC signature for webhooks, rotation of secrets, 'X-Key-Id'.
RBAC/ABAC: signal visibility by role/tenant.
PII minimization, masks in logs, auditing actions (ack/assign/runbook).
Consent and reasons for notification (rule/policy) - in payload.
Retention/TTL notification logs, Legal Hold on incidents.

14) Schemes and payloads

Event (internal)

json
{
"id": "sig_01HX",
"domain": "payments",
"severity": "high",
"priority": "P2",
"title": "The 3DS failure graph has grown to 8. 2% (+3. 1 pp), "
"occurred_at": "2025-11-03T17:55:00Z",
"context": { "psp": "PSP_X", "country": "TR", "release_id": "rel_241103_1820" },
"metrics": { "baseline": 5. 1, "current": 8. 2, "delta_pp": 3. 1 },
"dedup_key": "payments    PSP_X    TR    3DS_FAILURE",
"runbook": "rbk_psp_3ds_spike",
"slo": { "ack_deadline_sec": 600 }
}

Notification (agnostic channel)

json
{
"notification_id": "ntf_91ab",
"signal_id": "sig_01HX",
"targets": ["oncall_payments"],
"channels": ["inapp","slack","webhook"],
"cta": [
{"id": "ack," "label": "Confirm (30 min)," "payload": {"ttl ":" 30m"}},
{"id": "runbook," "label": "Open playbook," "payload": {"id ": "rbk _ psp _ 3ds _ spike"}},
{"id": "fallback," "label": "Enable fallback, PSP_Y" "confirm": true}
],
"hmac": "sha256=AbCd..."
}

15) UX patterns in the product

Inboxes: Critical/High/Other tabs, quantity badges.

Incident feed: correlated signals, timeline of actions, "what was done."

Filters: role, domain, region, time, "only unanswered."

Quick actions in the list (ack/snooze/assign).
Explain: "why you see it" (rule, thresholds, data).
Digests: morning/evening, localized by TZ.

16) Test plan

Unit: dedup keys, hysteresis, flapping, serialization of payloads.
Integration: routing, quiet hours, escalations, retrays of channels.
E2E: scenario P1 from anomaly to ticket closure; P2 in quiet hours → digest.
Chaos: link loss (SMTP/SMS), delays, signal avalanche, clock-skew.
A11y/i18n: screen-readers, keyboard ack/snooze, localization of numbers/dates.

17) Dashboards of quality

Precision/Recall by domain.
Ack time p50/p95 and share of timely confirmed.
Noise per user/hour and top noise rules.

Escalation rate and "false escalations."

Suppressed vs Delivered (how much is suppressed/digested).
User feedback :/messages, comments on noise.

18) Checklists

Design

  • Event taxonomy and levels are consistent
  • Quiet hours/escalation policies are described
  • Dedup/Correlation/Flapping configured
  • Channels, Retras, Webhook Idempotency
  • Preference Center (opt-in/out, snooze)
  • Content templates and localization
  • Playbooks and one-click actions (audited)
  • Quality metrics and dashboards

Operation

  • Threshold Optimization Quarterly
  • A/B rules (threshold, windows, digest)
  • Regular "top noise" and CAPA reviews
  • Channel secret rotation (HMAC, SMTP, SMS)
  • Scheduled game days test

19) Implementation plan (3 iterations)

Iteration 1 - Baseline (2-3 weeks)

Taxonomy, severity/priority, preference center (in-app + email).
Dedup, simple key/time correlation, quiet hours.
Message templates, playbooks, ack/snooze/assign.

Iteration 2 - Reliability and Noise Reduction (3-4 weeks)

Flapping/hysteresis, digests, chat integrations, and webhooks (HMACs, retrays).
Escalation according to SLA, quality dashboards (precision/recall, noise).
One-click remediation (with confirmation and audit).

Iteration 3 - Optimization and Scale (Continuous)

Correlation by topology/releases, auto-suggestions of thresholds.

A/B rules, forecast "when the threshold will work."

Noise reviews and regular game days.

20) Mini-FAQ

How to deal with alert fatigue?
Dedup, correlation, hysteresis, digests and preference centers + regular noise and A/B threshold reviews.

Is ML needed for anomalies?
Useful, but start with deterministic rules and explainable thresholds. ML is like an add-on, always with Explain.

Why do users get "extra" emails?
Check rules matches, quiet hours, "why delivered" audits, set channel/hour limits and digests.

Total

A strong signal system is smart filtering and correct prioritization + one-click actions. Formalize taxonomy and policies, implement dedup/correlation/hysteresis, give users control (preferences, snooze), provide reliable delivery and transparency "why I got it." Then the signals will become a control tool, not a noise source.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.