Alarm and notification system
1) Role and goals
The signal system is not "sending messages," but a decision-making circuit: it highlights deviations in time, offers actions and maintains a balance between timeliness and silence.
Objectives:- Reduce MTTD/MTTR through prioritization and clear playbooks.
- Reduce alert fatigue through noise cancellation.
- Give actions directly from the notification (ack, snooze, runbook, auto-action).
- Observe privacy and consent (opt-in/opt-out, log storage).
2) Taxonomy of events and levels
2. 1 Event types
Metrics/anomalies (SRE, product, finance).
Business rules (limits, fraud, KYC, payments).
System (deploy, degradation, licenses).
User (behavioral triggers, RG/responsible game).
2. 2 Severity levels
Critical - immediate response, risk of loss/safety.
High - significant deterioration of KPI/SLO.
Medium - Action required during business hours.
Low/Info - observation/context, auto-convolution into digests.
2. 3 Priority
'Impact × Urgency'matrix → P1..P4. Link to channels and SLA reactions.
3) Architecture and threads
Producers of signals → Sheena of events → Normalization (enrich, dedup) → the Correlation → Corrected (policy engine) → Routing → Canala deliveries → the Center of preferences → Logs/analytics.
Key components:- Enricher: adds tenant, role, region, playbook links.
- Deduper-Group recurring events by key.
- Correlator: Glue related signals into an incident.
- Policy Engine: YAML/DSL rules, quiet hours, escalations.
- Delivery: in-app, email, push, SMS, webhook, chat integration.
4) Rules and policies (YAML example)
yaml policies:
- id: p_sre_critical match: { domain: "infra", severity: "critical" }
route:
primary: { channel: "pager", targets: ["oncall_sre"] }
fallback: { channel: "sms", delay: "2m" }
suppress:
flapping: {window: "10m," threshold: 5} # suppressing frequent twitching duplicates: {key: ["service, ""cluster,"" error _ code"], ttl: "15m"}
escalate:
after: "10m"
to: ["sre_manager"]
auto_assign: true
- id: p_product_medium match: { domain: "product", severity: "medium", kpi: "conversion" }
route:
primary: { channel: "inapp", audience: "product_owners" }
digest:
window: "1h"
max_items: 10 quiet_hours:
tz: "Europe/Kyiv"
ranges: ["22: 00-07: 00"] # only P1 digests/pager at this time
5) Deduplication, correlation, suppression of flapping
Dedup: group ID 'dedup _ key = hash (service' metric 'dim)'; TTL ≥ Flapping window.
Correlation: combine related signals by topology (servis→zavisimost), time (± N min) and context (release, incident).
Flapping: thresholds "N events per M minutes" → one signal "flapping detected" with a proposal to raise hysteresis or suppress.
6) Routing and RACI
Responsible: who gets the first notification/drag.
Accountable: who escalates after SLA.
Consulted: who to mention in the thread/chat channel.
Informed: who will leave the digest/results.
Assign by role and context (tenant, region, product stream).
7) Delivery channels and nuances
Retrai: 5xx/429/timeout → backoff + jitter; 'Retry-After' respect. Idempotence: 'X-Notification-Id' on webhooks.
8) Preferences Center
Opt-in/Opt-out by event type, level, channel.
Quiet hours, manual snooze for 15/30/60 min.
Threshold/sensitivity (e.g. ≥ 3 σ anomaly).
Language/locale, time/currency format.
Role binding: presets for SRE/Product/Finance.
Transparency: show why the user received the signal (link to the rule).
9) Content design: message structure
Pattern for critical signal (P1):- Title: Brief, with trigger: "[P1] [PSP _ TR] Sharp rise in 3DS failures (+ 12%)."
- Context: period, affected segments/region, data source.
- Reason/hypothesis: "Associated with the release of PSP_X 18:20 UTC."
- SLA/deadline: "Escalation in 10 min."
- CTA: "Open playbook," "Enable fallback, PSP_Y" "Ack (30 min)."
- Links: graph, incident-thread, metrics, runbook.
- Metadata: 'trace _ id', 'incident _ id', 'dedup _ key'.
Tone: facts, no dramatization; Numbers and units avoid abbreviations without decoding.
Localization: variables → placeholders, translations are stored in resources; numbers/dates - by locale.
10) Actions from notifications (Actionable)
Ack/Snooze with time parameters.
Assign/Invite to the incident thread.
Runbook-Open solution steps with context autocomplete.
One-click remediation (where safe): switch route, raise limit, restart job (with confirmation and audit).
Create ticket (Jira/GitHub) with autocomplete fields.
11) Signal quality: metrics and targets
Precision ≥ 80% for P1/P2.
Recall (the proportion of detected incidents among all incidents) ≥ 70%.
Noise: average signals/hour per user (target ceiling).
Ack-time p50/p95, Escalation rate, Snooze rate (as a noise indicator).
MTTD/MTTA/MTTR (in terms of domains and channels).
Silenced-but-should-alert (gaps due to rules) is a separate dashboard.
12) Noise control: techniques
Hysteresis and sliding windows for thresholds.
Anti-aliasing (EWMA) before detection.
Aggregation: instead of 30 small ones - one batch/digest with top contributors.
Context limits: maximum N notifications/hour/channel/user.
Auto-feedback: if the user clicks Snooze for 3 × in a row → suggest raising the threshold/changing the channel.
13) Security, privacy, compliance
HMAC signature for webhooks, rotation of secrets, 'X-Key-Id'.
RBAC/ABAC: signal visibility by role/tenant.
PII minimization, masks in logs, auditing actions (ack/assign/runbook).
Consent and reasons for notification (rule/policy) - in payload.
Retention/TTL notification logs, Legal Hold on incidents.
14) Schemes and payloads
Event (internal)
json
{
"id": "sig_01HX",
"domain": "payments",
"severity": "high",
"priority": "P2",
"title": "The 3DS failure graph has grown to 8. 2% (+3. 1 pp), "
"occurred_at": "2025-11-03T17:55:00Z",
"context": { "psp": "PSP_X", "country": "TR", "release_id": "rel_241103_1820" },
"metrics": { "baseline": 5. 1, "current": 8. 2, "delta_pp": 3. 1 },
"dedup_key": "payments PSP_X TR 3DS_FAILURE",
"runbook": "rbk_psp_3ds_spike",
"slo": { "ack_deadline_sec": 600 }
}
Notification (agnostic channel)
json
{
"notification_id": "ntf_91ab",
"signal_id": "sig_01HX",
"targets": ["oncall_payments"],
"channels": ["inapp","slack","webhook"],
"cta": [
{"id": "ack," "label": "Confirm (30 min)," "payload": {"ttl ":" 30m"}},
{"id": "runbook," "label": "Open playbook," "payload": {"id ": "rbk _ psp _ 3ds _ spike"}},
{"id": "fallback," "label": "Enable fallback, PSP_Y" "confirm": true}
],
"hmac": "sha256=AbCd..."
}
15) UX patterns in the product
Inboxes: Critical/High/Other tabs, quantity badges.
Incident feed: correlated signals, timeline of actions, "what was done."
Filters: role, domain, region, time, "only unanswered."
Quick actions in the list (ack/snooze/assign).
Explain: "why you see it" (rule, thresholds, data).
Digests: morning/evening, localized by TZ.
16) Test plan
Unit: dedup keys, hysteresis, flapping, serialization of payloads.
Integration: routing, quiet hours, escalations, retrays of channels.
E2E: scenario P1 from anomaly to ticket closure; P2 in quiet hours → digest.
Chaos: link loss (SMTP/SMS), delays, signal avalanche, clock-skew.
A11y/i18n: screen-readers, keyboard ack/snooze, localization of numbers/dates.
17) Dashboards of quality
Precision/Recall by domain.
Ack time p50/p95 and share of timely confirmed.
Noise per user/hour and top noise rules.
Escalation rate and "false escalations."
Suppressed vs Delivered (how much is suppressed/digested).
User feedback :/messages, comments on noise.
18) Checklists
Design
- Event taxonomy and levels are consistent
- Quiet hours/escalation policies are described
- Dedup/Correlation/Flapping configured
- Channels, Retras, Webhook Idempotency
- Preference Center (opt-in/out, snooze)
- Content templates and localization
- Playbooks and one-click actions (audited)
- Quality metrics and dashboards
Operation
- Threshold Optimization Quarterly
- A/B rules (threshold, windows, digest)
- Regular "top noise" and CAPA reviews
- Channel secret rotation (HMAC, SMTP, SMS)
- Scheduled game days test
19) Implementation plan (3 iterations)
Iteration 1 - Baseline (2-3 weeks)
Taxonomy, severity/priority, preference center (in-app + email).
Dedup, simple key/time correlation, quiet hours.
Message templates, playbooks, ack/snooze/assign.
Iteration 2 - Reliability and Noise Reduction (3-4 weeks)
Flapping/hysteresis, digests, chat integrations, and webhooks (HMACs, retrays).
Escalation according to SLA, quality dashboards (precision/recall, noise).
One-click remediation (with confirmation and audit).
Iteration 3 - Optimization and Scale (Continuous)
Correlation by topology/releases, auto-suggestions of thresholds.
A/B rules, forecast "when the threshold will work."
Noise reviews and regular game days.
20) Mini-FAQ
How to deal with alert fatigue?
Dedup, correlation, hysteresis, digests and preference centers + regular noise and A/B threshold reviews.
Is ML needed for anomalies?
Useful, but start with deterministic rules and explainable thresholds. ML is like an add-on, always with Explain.
Why do users get "extra" emails?
Check rules matches, quiet hours, "why delivered" audits, set channel/hour limits and digests.
Total
A strong signal system is smart filtering and correct prioritization + one-click actions. Formalize taxonomy and policies, implement dedup/correlation/hysteresis, give users control (preferences, snooze), provide reliable delivery and transparency "why I got it." Then the signals will become a control tool, not a noise source.