Alarm and notification system

1) Role and goals

The signal system is not "sending messages," but a decision-making circuit: it highlights deviations in time, offers actions and maintains a balance between timeliness and silence.

Objectives:

Reduce MTTD/MTTR through prioritization and clear playbooks.
Reduce alert fatigue through noise cancellation.
Give actions directly from the notification (ack, snooze, runbook, auto-action).
Observe privacy and consent (opt-in/opt-out, log storage).

2) Taxonomy of events and levels

2. 1 Event types

Metrics/anomalies (SRE, product, finance).
Business rules (limits, fraud, KYC, payments).
System (deploy, degradation, licenses).
User (behavioral triggers, RG/responsible game).

2. 2 Severity levels

Critical - immediate response, risk of loss/safety.
High - significant deterioration of KPI/SLO.
Medium - Action required during business hours.
Low/Info - observation/context, auto-convolution into digests.

2. 3 Priority

'Impact × Urgency'matrix → P1..P4. Link to channels and SLA reactions.

3) Architecture and threads

Producers of signals → Sheena of events → Normalization (enrich, dedup) → the Correlation → Corrected (policy engine) → Routing → Canala deliveries → the Center of preferences → Logs/analytics.

Key components:

Enricher: adds tenant, role, region, playbook links.
Deduper-Group recurring events by key.
Correlator: Glue related signals into an incident.
Policy Engine: YAML/DSL rules, quiet hours, escalations.
Delivery: in-app, email, push, SMS, webhook, chat integration.

4) Rules and policies (YAML example)

yaml policies:
- id: p_sre_critical match: { domain: "infra", severity: "critical" }
route:
primary: { channel: "pager", targets: ["oncall_sre"] }
fallback: { channel: "sms", delay: "2m" }
suppress:
flapping: {window: "10m," threshold: 5} # suppressing frequent twitching duplicates: {key: ["service, ""cluster,"" error _ code"], ttl: "15m"}
escalate:
after: "10m"
to: ["sre_manager"]
auto_assign: true
- id: p_product_medium match: { domain: "product", severity: "medium", kpi: "conversion" }
route:
primary: { channel: "inapp", audience: "product_owners" }
digest:
window: "1h"
max_items: 10 quiet_hours:
tz: "Europe/Kyiv"
ranges: ["22: 00-07: 00"] # only P1 digests/pager at this time

5) Deduplication, correlation, suppression of flapping

Dedup: group ID 'dedup _ key = hash (service' metric 'dim)'; TTL ≥ Flapping window.
Correlation: combine related signals by topology (servis→zavisimost), time (± N min) and context (release, incident).
Flapping: thresholds "N events per M minutes" → one signal "flapping detected" with a proposal to raise hysteresis or suppress.

6) Routing and RACI

Responsible: who gets the first notification/drag.
Accountable: who escalates after SLA.
Consulted: who to mention in the thread/chat channel.
Informed: who will leave the digest/results.
Assign by role and context (tenant, region, product stream).

7) Delivery channels and nuances

Channel	When to use	Features/Limitations
In-app	Operational, but non-critical; actions	Rich UI, CTA, context
Email	Digests, reports, non-critical	May be lost/filtered
Push	For mobile duty team	Length limit, quiet hours
SMS/Pager	P1/P0 criticism	Paid, concise, without investments
Webhook	Integrations (Jira, Slack, Ops)	HMAC signatures, retreats, idempotence
Chat (Slack)	Thread of the incident, collaboration	Text commands (ack, assign)

Retrai: 5xx/429/timeout → backoff + jitter; 'Retry-After' respect. Idempotence: 'X-Notification-Id' on webhooks.

8) Preferences Center

Opt-in/Opt-out by event type, level, channel.
Quiet hours, manual snooze for 15/30/60 min.
Threshold/sensitivity (e.g. ≥ 3 σ anomaly).
Language/locale, time/currency format.
Role binding: presets for SRE/Product/Finance.
Transparency: show why the user received the signal (link to the rule).

9) Content design: message structure

Pattern for critical signal (P1):

Title: Brief, with trigger: "[P1] [PSP _ TR] Sharp rise in 3DS failures (+ 12%)."
Context: period, affected segments/region, data source.
Reason/hypothesis: "Associated with the release of PSP_X 18:20 UTC."
SLA/deadline: "Escalation in 10 min."
CTA: "Open playbook," "Enable fallback, PSP_Y" "Ack (30 min)."
Links: graph, incident-thread, metrics, runbook.
Metadata: 'trace _ id', 'incident _ id', 'dedup _ key'.

Tone: facts, no dramatization; Numbers and units avoid abbreviations without decoding.
Localization: variables → placeholders, translations are stored in resources; numbers/dates - by locale.

10) Actions from notifications (Actionable)

Ack/Snooze with time parameters.
Assign/Invite to the incident thread.
Runbook-Open solution steps with context autocomplete.
One-click remediation (where safe): switch route, raise limit, restart job (with confirmation and audit).
Create ticket (Jira/GitHub) with autocomplete fields.

11) Signal quality: metrics and targets

Precision ≥ 80% for P1/P2.
Recall (the proportion of detected incidents among all incidents) ≥ 70%.
Noise: average signals/hour per user (target ceiling).
Ack-time p50/p95, Escalation rate, Snooze rate (as a noise indicator).
MTTD/MTTA/MTTR (in terms of domains and channels).
Silenced-but-should-alert (gaps due to rules) is a separate dashboard.

12) Noise control: techniques

Hysteresis and sliding windows for thresholds.
Anti-aliasing (EWMA) before detection.
Aggregation: instead of 30 small ones - one batch/digest with top contributors.
Context limits: maximum N notifications/hour/channel/user.
Auto-feedback: if the user clicks Snooze for 3 × in a row → suggest raising the threshold/changing the channel.

13) Security, privacy, compliance

HMAC signature for webhooks, rotation of secrets, 'X-Key-Id'.
RBAC/ABAC: signal visibility by role/tenant.
PII minimization, masks in logs, auditing actions (ack/assign/runbook).
Consent and reasons for notification (rule/policy) - in payload.
Retention/TTL notification logs, Legal Hold on incidents.

14) Schemes and payloads

Event (internal)

json
{
"id": "sig_01HX",
"domain": "payments",
"severity": "high",
"priority": "P2",
"title": "The 3DS failure graph has grown to 8. 2% (+3. 1 pp), "
"occurred_at": "2025-11-03T17:55:00Z",
"context": { "psp": "PSP_X", "country": "TR", "release_id": "rel_241103_1820" },
"metrics": { "baseline": 5. 1, "current": 8. 2, "delta_pp": 3. 1 },
"dedup_key": "payments    PSP_X    TR    3DS_FAILURE",
"runbook": "rbk_psp_3ds_spike",
"slo": { "ack_deadline_sec": 600 }
}

Notification (agnostic channel)

json
{
"notification_id": "ntf_91ab",
"signal_id": "sig_01HX",
"targets": ["oncall_payments"],
"channels": ["inapp","slack","webhook"],
"cta": [
{"id": "ack," "label": "Confirm (30 min)," "payload": {"ttl ":" 30m"}},
{"id": "runbook," "label": "Open playbook," "payload": {"id ": "rbk _ psp _ 3ds _ spike"}},
{"id": "fallback," "label": "Enable fallback, PSP_Y" "confirm": true}
],
"hmac": "sha256=AbCd..."
}

15) UX patterns in the product

Inboxes: Critical/High/Other tabs, quantity badges.

Incident feed: correlated signals, timeline of actions, "what was done."

Filters: role, domain, region, time, "only unanswered."

Quick actions in the list (ack/snooze/assign).
Explain: "why you see it" (rule, thresholds, data).
Digests: morning/evening, localized by TZ.

16) Test plan

Unit: dedup keys, hysteresis, flapping, serialization of payloads.
Integration: routing, quiet hours, escalations, retrays of channels.
E2E: scenario P1 from anomaly to ticket closure; P2 in quiet hours → digest.
Chaos: link loss (SMTP/SMS), delays, signal avalanche, clock-skew.
A11y/i18n: screen-readers, keyboard ack/snooze, localization of numbers/dates.

17) Dashboards of quality

Precision/Recall by domain.
Ack time p50/p95 and share of timely confirmed.
Noise per user/hour and top noise rules.

Escalation rate and "false escalations."

Suppressed vs Delivered (how much is suppressed/digested).
User feedback :/messages, comments on noise.

18) Checklists

Design

Event taxonomy and levels are consistent
Quiet hours/escalation policies are described
Dedup/Correlation/Flapping configured
Channels, Retras, Webhook Idempotency
Preference Center (opt-in/out, snooze)
Content templates and localization
Playbooks and one-click actions (audited)
Quality metrics and dashboards

Operation

Threshold Optimization Quarterly
A/B rules (threshold, windows, digest)
Regular "top noise" and CAPA reviews
Channel secret rotation (HMAC, SMTP, SMS)
Scheduled game days test

19) Implementation plan (3 iterations)

Iteration 1 - Baseline (2-3 weeks)

Taxonomy, severity/priority, preference center (in-app + email).
Dedup, simple key/time correlation, quiet hours.
Message templates, playbooks, ack/snooze/assign.

Iteration 2 - Reliability and Noise Reduction (3-4 weeks)

Flapping/hysteresis, digests, chat integrations, and webhooks (HMACs, retrays).
Escalation according to SLA, quality dashboards (precision/recall, noise).
One-click remediation (with confirmation and audit).

Iteration 3 - Optimization and Scale (Continuous)

Correlation by topology/releases, auto-suggestions of thresholds.

A/B rules, forecast "when the threshold will work."

Noise reviews and regular game days.

20) Mini-FAQ

How to deal with alert fatigue?
Dedup, correlation, hysteresis, digests and preference centers + regular noise and A/B threshold reviews.

Is ML needed for anomalies?
Useful, but start with deterministic rules and explainable thresholds. ML is like an add-on, always with Explain.

Why do users get "extra" emails?
Check rules matches, quiet hours, "why delivered" audits, set channel/hour limits and digests.

Total

A strong signal system is smart filtering and correct prioritization + one-click actions. Formalize taxonomy and policies, implement dedup/correlation/hysteresis, give users control (preferences, snooze), provide reliable delivery and transparency "why I got it." Then the signals will become a control tool, not a noise source.

Alarm and notification system

Notification (agnostic channel)

Operation

Iteration 2 - Reliability and Noise Reduction (3-4 weeks)

Iteration 3 - Optimization and Scale (Continuous)

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects