GH GambleHub

Preventing an overabundance of alerts

1) Problem and purpose

Alert fatigue occurs when the system sends too many irrelevant or not actionable notifications. The bottom line is ignoring pages, growing MTTA/MTTR and skipping real incidents.
The goal: to make signals rare, meaningful and executable by linking them to SLOs and playbooks.


2) Signal taxonomy (channel = consequences)

Page (P0/P1) - wakes up a person; only when manual action is required now and there is a runbook.
Ticket (P2) - asynchronous work in hours/day; does not wake up, but is tracked by SLA.
Dash-only (P3) - observation/trend without active actions; does not create noise.
Silent Sentry - metrics/audit in the background (for RCA/post-mortems).

💡 Rule: the signal is a step lower - it has not yet been proven that it is needed higher.

3) Designing the "correct" alert

Each alert must have:
  • Objective/hypothesis (what we protect: SLO, security, money, compliance).
  • Trigger conditions (threshold, window, source quorum).
  • Runbook/Playbook (short step ID + link).
  • Owner (team/role group).
  • Completion criteria (when to close, auto-resolution).
  • Vulnerability class (user-impact/platform/security/cost).

4) SLO-oriented monitoring

SLI/SLO → primary signals: availability, latency, success of business operations.

Burn-rate alerts: two windows (short + long), e.g.:
  • Short: 5% of the budget in 1 hour → Page.
  • Long: 2% of budget in 6 hours → Ticket.
  • Cohort: Alerts by Region/Provider/VIP Segment - Fewer False Global Alarms.

5) Noise reduction techniques

1. Quorum probes: triggered only if ≥2 independent sources (different regions/providers) confirm the problem.
2. Deduplication - aggregation keys: service + region + code.
3. Hysteresis/duration: "in the red zone ≥ N minutes" to filter out the spikes.
4. Rate-limit: no more than X alerts/hour/service; if exceeded, one page + summary.
5. Auto-snooze/intelligent suppression: a repeated alert in the T window → translation to Ticket until the root is eliminated.
6. Event correlation: one "master alert" instead of dozens of symptoms (e.g. "DB unavailable" jamming 5xx from microservices).
7. Maintenance windows: scheduled work automatically suppresses the expected signals.
8. Anomaly + guardrails: anomalies - only as Ticket, if there is no confirmation by the SLO signal.


6) Routing and priorities

Priorities: P0 (Page, 15 min updates), P1 (Page, 30 min), P2 (Ticket, 4-8 h), P3 (observation).
Routing by labels: service/env/region/tenant → corresponding to on-call.
Time escalation: no ack in 5 min → P2 → Duty Manager/IC.
Quiet Hours: Night Hours for the Non-Critical; Page is not allowed for P2/P3.
Fatigue policy: if the engineer has> N pages/shift - redistribute to P2, escalate signal contamination.


7) Quality of alerts: arrangements

Actionability ≥ 80%: the vast majority of pages lead to runbook action.
False Positive ≤ 5% for Page signals.
Time-to-Fix-Alert ≤ 7 days - defective alert must be corrected/removed.
Ownership 100% - each alert has an owner and a repository with its definition.


8) Alert as Code life cycle

1. Create PR (purpose description, conditions, runbook, owner, test plan).
2. Sandbox/Shadow: shadow alert writes to chat/log, but does not page.
3. Canary: limited audience on-call, measure FP/TP.
4. Prod: inclusion with rate-limit + observation 2-4 weeks.
5. Weekly review: quality metrics, edits/withdrawals.
6. Deprecate: if the signal duplicates a higher one or is not actionable.


9) Maturity metrics (show on dashboard)

Alerts per on-call hour (median/95-percentile).
% actionable (there are steps completed) and false-positive rate.
MTTA/MTTR around pages and page→ticket rate (should not be high).
Top-talkers (services/rules that generate ≥20% noise).
Mean time to fix alert.
Burn-rate coverage: the share of services with SLO-alerts in two windows.


10) Checklist "Hygiene of alerts"

  • Alert is tied to SLO/SLI or business/security.
  • There is a runbook and owner; contact and war-room channel are specified.
  • Two windows (short/long) and a quorum of sources are configured.
  • Dedup, rate-limit, auto-resolve, and auto-snooze are included.
  • Windows maintenance and suppression are specified for releases/migrations.
  • Shadow/Canary passed; measured FP/TP.
  • Alert quality metrics report included.

11) Mini templates

Alert specification (YAML idea)

yaml id: payments-slo-burn severity: P1 owner: team-payments@sre purpose: "Защитить SLO успеха платежей"
signal:
type: burn_rate sli: payment_success_ratio windows:
short: {duration: 1h, threshold: 5%}
long: {duration: 6h, threshold: 2%}
confirmations:
quorum:
- synthetic_probe: eu,us
- rum: conversion_funnel routing:
page: oncall-payments escalate_after: 5m controls:
dedup_key: "service=payments,region={{region}}"
rate_limit: "1/10m"
auto_snooze_after: "3 pages/1h"
runbook: "rb://payments/slo-burn"
maintenance:
suppress_when: [ "release:payments", "db_migration" ]

Standard update text (to reduce noise)


Импакт: падение success_ratio платежей в EU (-3.2% к SLO, 20 мин).
Диагностика: подтвержден кворумом (EU+US синтетика), RUM — рост отказов на 2 шаге.
Действия: переключили 30% трафика на PSP-B, включили degrade-UX, след. апдейт 20:30.

12) Processes: Weekly "Alert Review"

Agenda (30-45 min):

1. Top-talkers → edit/delete.

2. FP/TP on Page signals → adjust thresholds/windows/quorum.

3. Applicants for downgrade (Page→Ticket) and vice versa.

4. Time-to-Fix-Alert status - delays are escalated to service owners.

5. Checking coverage with SLO alerts and the presence of runbooks.


13) Link to releases and operations

Release annotations automatically add temporary suppressions.
Change windows: in the first 30 minutes after the release - only SLO signals.
Playbooks contain a "lower/suppress non-key alert" step to concentrate on the root.


14) Safety and compliance

Security signals (hacking/leakage/abnormal accesses) - separate channels, without quiet hours.
Audit log of all suppressions/quiet windows: who, when, why, deadline.
Immutability requirement for critical alerts (event signature).


15) Anti-patterns

"Every graph = alert" → avalanche.
Threshold "! = 0 errors" in sales.
One probe/one region as source of truth.
Page without runbook/owner.
Perpetual "temporary suppressions" with no term.
"Fix it later" defective alerts - accumulate for years.
Mixing release noise with production incidents.


16) Implementation Roadmap (4-6 weeks)

1. Inventory: unload all alerts, put down owners and channels.
2. SLO kernel: introduce burn-rate rules with double windows for critical services.
3. Noise control: enable quorum, deadup and rate-limit, start a weekly review.
4. Runbook coverage: close 100% of Page signals with playbooks.
5. Fatig policy: page limits/shift, Quiet Hours, load redistribution.
6. Automation: Alert-as-Code, Shadow/Canary, reporting on quality metrics.


17) The bottom line

Silence is not a lack of monitoring, but well-designed signals associated with SLO and processes. Quorum, double windows, dedup and strict routing turn alerts into rare, accurate and executable. The team is asleep, users are happy, incidents are under control.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.