GH GambleHub

Alerts and Notifications: PagerDuty, Opsgenie

Alerts and Notifications: PagerDuty, Opsgenie

1) Why a separate platform of alerts

The goal is to deliver an immediate and relevant signal to the right person/team and start the incident process: recognition (ack), escalation, communication, postmortem. PagerDuty and Opsgenie give:
  • Routing by services/tags/environments.
  • Escalation and schedules (on duty, follow-the-sun).
  • Event deduplication/correlation.
  • Quiet windows (maintenance/freeze) and music rules.
  • Integrations with monitoring, CI/CD and ChatOps.

Support: SLO-threshold → alert → person/machine → runbook → rollback/fix → postmortem.

2) Signal model and severity

Recommended scale:
  • critical (page) - SLO violation/money path error (deposit/withdrawal), drop in availability, burn-rate.
  • high (page/ticket) - significant degradation without obvious SLO breakdown.
  • medium (ticket) - capacity, degradation of the back, retray.
  • low (inform) - trends, warnings.

Rule: page by SLO or explicit business trigger only.

3) Routing architecture

1. Source (Prometheus/Alertmanager, Grafana, cloud monitoring, own webhooks).
2. Шлюз (PagerDuty/Opsgenie service/integration).
3. Policies: routes by tags ('service', 'env', 'region'), severity, payload.
4. Escalation: sequence of duty levels (L1→L2→menedzher).
5. Communications: ChatOps channels, status pages, mailings.

Example of key tags (standardize)

'service ',' env ',' region ',' version ',' runbook ',' release _ id ',' route ',' tenant '(if B2B/multi-tenant).

4) On-call and escalation schedules

Schedules: primary/secondary, роли (SRE, DBRE, Sec).
Rotations: day/night, follow-the-sun, weekend.
Overrides: Leave/illness.
Escalation: ack-timeout 5-10 min → next layer. By working hours - to the profile department; outside - on-call platform.

Tip: Keep short escalation steps at night (less fatigue), and longer during the day (there is context).

5) Integration with Alertmanager (basic pattern)

yaml receivers:
- name: pagerduty pagerduty_configs:
- routing_key: ${PAGERDUTY_ROUTING_KEY}
severity: '{{ if eq. Labels. severity "critical" }}critical{{ else }}error{{ end }}'
class: '{{.Labels. service }}'
component: '{{.Labels. env }}'
group: '{{.Labels. region }}'
description: '{{.Annotations. summary }}'
details:
service: '{{.Labels. service }}'
env: '{{.Labels. env }}'
runbook: '{{.Annotations. runbook }}'
release: '{{.Annotations. release }}'
route:
receiver: pagerduty group_by: ["service","env","region"]
group_wait: 30s group_interval: 5m repeat_interval: 2h

Opsgenie (webhook)

yaml receivers:
- name: opsgenie opsgenie_configs:
- api_key: ${OPSGENIE_API_KEY}
responders:
- name: "SRE Primary"
type: team priority: '{{ if eq. Labels. severity "critical" }}P1{{ else }}P3{{ end }}'
details:
trace: '{{.Labels. trace_id }}'
runbook: '{{.Annotations. runbook }}'

6) Noise, deadup and correlation

Dedup key: use a stable fingerprint (for example, service + route + code).
Grouping: 'group _ by' by service/environment so that the 5xx cascade does not spawn dozens of pages.
Mutes/quiet windows: during migrations/releases/load tests.
Suppression for a reason: if there is already a P1 incident for 'api-gateway @ prod', suppress child P2/P3.

Anti-pattern: Page by CPU/Memory with no confirmed effect on SLO.

7) Connection with releases and auto-actions

With canary depression, PagerDuty/Opsgenie receive an alert from the SLO gate → webhook in CI/CD → pause/rollback (Argo Rollouts/Helm).
Alert contains: 'release _ id', 'image. tag ', reference to the pipeline and the rollback runbook.

Example of runbook link in annotations


runbook: https://runbooks. company/rollback/api-gateway#canary

8) ChatOps and Communications

Auto-creating an incident channel in Slack/Teams, linking to a ticket.
Slash-команды: `ack`, `assign @user`, `status set`, `postmortem start`.
Status page - Updates automatically on P1/P2.

9) Incident lifecycle (minimum)

1. Trigger (alert from SLO/sensors).
2. Page (primary on-call).
3. Ack (confirmation, TTA).
4. Communicate (channel/status).
5. Mitigate (rollback/feature-flag/isolation).
6. Resolve (TTR).
7. Postmortem (timeline, reasons, actions, lessons, task owner).

Role-kit: IC (incident commander), Ops lead, Comms, Scribe.

10) Payload fields (normalize)

json
{
"service": "payments-api",
"env": "prod",
"region": "eu-central-1",
"severity": "critical",
"event_class": "slo_burn",
"summary": "Withdraw 5xx > 0. 5% for 10m",
"runbook": "https://runbooks/payments/withdraw-5xx",
"release_id": "rel-2025-11-03-14-20",
"image": "ghcr. io/org/payments:1. 14. 2",
"trace_id": "8a4f0c2e9b1f42d7",
"annotations": { "canary": "25%" }
}

11) Integration of signal sources

Prometheus/Alertmanager is the main source of SLO/RED.
Grafana Alerting is easier for dashboards/business metrics.
OpenTelemetry/SpanMetrics - latency/error by route.
K8s events - cluster failures (control-plane, PDB violations).
DB/Queues - lag/locks/replication.
Application webhooks - domain signals (PSP error, fraud surge).

12) Policies and Compliance

RBAC to create/modify policies, schedules, mutas.
Audit: who recognized/appointed/changed the status, timestamps.
PII minimization in payloads (ticket ID instead of user's email/phone).
DR-plan: what do we do when PagerDuty/Opsgenie is unavailable (fallback channel).

13) Case Studies (PagerDuty vs Opsgenie)

OpportunityPagerDutyOpsgenie
Escalations/SchedulesMature, flexibleMature, flexible
Incident Roles/TemplatesStrong Incident WorkflowsIncident Templates/Stakeholders
Auto-channels/commsGood integrationsDeep Slack/MS Teams
Pricing/LicensesOften more expensive, many add-onsUsually cheaper at the start
Tag RoutingStrong (Service Directory)Strong (Routing Rules)
Both platforms cover 95% of the same scenarios; choose by cost, UX, and your stack integrations.

14) Quiet windows and frosts

Freeze: Banning paging in planned release windows, leaving only P1.
Tag memorization: 'env = stage', 'region = dr', 'service = batch'.
Temporary mute: when migrating databases/load tests - with an explicit owner.

15) Performance metrics (SRE/DORA for alerts)

MTTA/MTTR (broken down by teams/services/shifts).
% of alerts with runbook (target ≥ 95%).
Share of page-alerts by SLO (target ≥ 90%).
Ratio of useful/noisy (goal ≥ 3:1).
% of auto-actions (pause/rollback via webhook) - grow.
Burn-down postmortem action items in 14/30 days.

16) Anti-patterns

Page by hardware (CPU, disk) without affecting the user.
Absence of 'group _ by' → "storm" of alerts.
There are no quiet windows - releases paint everything red.
Payloads without 'service/env/runbook' - cannot be routed/acted upon.
There is no single severity scale and rules (each source is different).
"Eternal" warnings that no one repairs (alert debt).

17) Implementation checklist (0-45 days)

0-10 days

Align severity scale and standardize tags/annotations.
Create services in PagerDuty/Opsgenie, configure schedules and basic escalations.
Bind Alertmanager/Grafana, enable 'group _ by' and deadup.

11-25 days

Enter SLO alerts (multi-window burn), add a link runbook.
Configure ChatOps: auto channels, ack/assign commands.
Enable quiet windows on releases/migrations.

26-45 days

Integrate auto-pause/rollback for canaries (webhooks).
Enter MTTA/MTTR reports and alert hygiene (noise cleaning).
Standardize postmortem and control over action items.

18) Ready snippets

Grafana Alerting → PagerDuty (JSON body mapping)

json
{
"routing_key": "${PAGERDUTY_ROUTING_KEY}",
"event_action": "trigger",
"payload": {
"summary": "{{.RuleName }}: {{ index. Labels \"service\" }}",
"severity": "{{ if eq (index. Labels \"severity\") \"critical\" }}critical{{ else }}error{{ end }}",
"source": "grafana",
"component": "{{ index. Labels \"env\" }}",
"group": "{{ index. Labels \"region\" }}"
},
"links": [
{ "href": "{{.DashboardURL }}", "text": "Dashboard" },
{ "href": "{{ index. Labels \"runbook\" }}", "text": "Runbook" }
]
}

Webhook from alert → Argo Rollouts pause

bash curl -X POST "$ARGO_API/rollouts/pause" \
-H "Authorization: Bearer $TOKEN" \
-d '{"name":"api-gateway","namespace":"prod"}'

Opsgenie - Routing Rule (pseudo)

yaml if:
tags: ["service:payments","env:prod"]
severity: ["P1","P2"]
then:
route_to: "SRE-Payments"
notify: ["Primary OnCall","Secondary"]

19) Conclusion

A strong contour of alerts is a process + discipline: SLO-oriented stratification, competent routing and escalation, uniform tags and payloads, quiet windows, ChatOps and automatic actions (pause/rollback). Choose PagerDuty or Opsgenie on budget and UX, but stick to the same rules of noise, duty and responsibility - then the page will be rare, accurate and useful, and the incidents will be short and manageable.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.