Alerts and Notifications: PagerDuty, Opsgenie

Alerts and Notifications: PagerDuty, Opsgenie

1) Why a separate platform of alerts

The goal is to deliver an immediate and relevant signal to the right person/team and start the incident process: recognition (ack), escalation, communication, postmortem. PagerDuty and Opsgenie give:

Routing by services/tags/environments.
Escalation and schedules (on duty, follow-the-sun).
Event deduplication/correlation.
Quiet windows (maintenance/freeze) and music rules.
Integrations with monitoring, CI/CD and ChatOps.

Support: SLO-threshold → alert → person/machine → runbook → rollback/fix → postmortem.

2) Signal model and severity

Recommended scale:

critical (page) - SLO violation/money path error (deposit/withdrawal), drop in availability, burn-rate.
high (page/ticket) - significant degradation without obvious SLO breakdown.
medium (ticket) - capacity, degradation of the back, retray.
low (inform) - trends, warnings.

Rule: page by SLO or explicit business trigger only.

3) Routing architecture

1. Source (Prometheus/Alertmanager, Grafana, cloud monitoring, own webhooks).
2. Шлюз (PagerDuty/Opsgenie service/integration).
3. Policies: routes by tags ('service', 'env', 'region'), severity, payload.
4. Escalation: sequence of duty levels (L1→L2→menedzher).
5. Communications: ChatOps channels, status pages, mailings.

Example of key tags (standardize)

'service ',' env ',' region ',' version ',' runbook ',' release _ id ',' route ',' tenant '(if B2B/multi-tenant).

4) On-call and escalation schedules

Schedules: primary/secondary, роли (SRE, DBRE, Sec).
Rotations: day/night, follow-the-sun, weekend.
Overrides: Leave/illness.
Escalation: ack-timeout 5-10 min → next layer. By working hours - to the profile department; outside - on-call platform.

Tip: Keep short escalation steps at night (less fatigue), and longer during the day (there is context).

5) Integration with Alertmanager (basic pattern)

yaml receivers:
- name: pagerduty pagerduty_configs:
- routing_key: ${PAGERDUTY_ROUTING_KEY}
severity: '{{ if eq. Labels. severity "critical" }}critical{{ else }}error{{ end }}'
class: '{{.Labels. service }}'
component: '{{.Labels. env }}'
group: '{{.Labels. region }}'
description: '{{.Annotations. summary }}'
details:
service: '{{.Labels. service }}'
env: '{{.Labels. env }}'
runbook: '{{.Annotations. runbook }}'
release: '{{.Annotations. release }}'
route:
receiver: pagerduty group_by: ["service","env","region"]
group_wait: 30s group_interval: 5m repeat_interval: 2h

Opsgenie (webhook)

yaml receivers:
- name: opsgenie opsgenie_configs:
- api_key: ${OPSGENIE_API_KEY}
responders:
- name: "SRE Primary"
type: team priority: '{{ if eq. Labels. severity "critical" }}P1{{ else }}P3{{ end }}'
details:
trace: '{{.Labels. trace_id }}'
runbook: '{{.Annotations. runbook }}'

6) Noise, deadup and correlation

Dedup key: use a stable fingerprint (for example, service + route + code).
Grouping: 'group _ by' by service/environment so that the 5xx cascade does not spawn dozens of pages.
Mutes/quiet windows: during migrations/releases/load tests.
Suppression for a reason: if there is already a P1 incident for 'api-gateway @ prod', suppress child P2/P3.

Anti-pattern: Page by CPU/Memory with no confirmed effect on SLO.

7) Connection with releases and auto-actions

With canary depression, PagerDuty/Opsgenie receive an alert from the SLO gate → webhook in CI/CD → pause/rollback (Argo Rollouts/Helm).
Alert contains: 'release _ id', 'image. tag ', reference to the pipeline and the rollback runbook.

Example of runbook link in annotations


runbook: https://runbooks. company/rollback/api-gateway#canary

8) ChatOps and Communications

Auto-creating an incident channel in Slack/Teams, linking to a ticket.
Slash-команды: `ack`, `assign @user`, `status set`, `postmortem start`.
Status page - Updates automatically on P1/P2.

9) Incident lifecycle (minimum)

1. Trigger (alert from SLO/sensors).
2. Page (primary on-call).
3. Ack (confirmation, TTA).
4. Communicate (channel/status).
5. Mitigate (rollback/feature-flag/isolation).
6. Resolve (TTR).
7. Postmortem (timeline, reasons, actions, lessons, task owner).

Role-kit: IC (incident commander), Ops lead, Comms, Scribe.

10) Payload fields (normalize)

json
{
"service": "payments-api",
"env": "prod",
"region": "eu-central-1",
"severity": "critical",
"event_class": "slo_burn",
"summary": "Withdraw 5xx > 0. 5% for 10m",
"runbook": "https://runbooks/payments/withdraw-5xx",
"release_id": "rel-2025-11-03-14-20",
"image": "ghcr. io/org/payments:1. 14. 2",
"trace_id": "8a4f0c2e9b1f42d7",
"annotations": { "canary": "25%" }
}

11) Integration of signal sources

Prometheus/Alertmanager is the main source of SLO/RED.
Grafana Alerting is easier for dashboards/business metrics.
OpenTelemetry/SpanMetrics - latency/error by route.
K8s events - cluster failures (control-plane, PDB violations).
DB/Queues - lag/locks/replication.
Application webhooks - domain signals (PSP error, fraud surge).

12) Policies and Compliance

RBAC to create/modify policies, schedules, mutas.
Audit: who recognized/appointed/changed the status, timestamps.
PII minimization in payloads (ticket ID instead of user's email/phone).
DR-plan: what do we do when PagerDuty/Opsgenie is unavailable (fallback channel).

13) Case Studies (PagerDuty vs Opsgenie)

Opportunity	PagerDuty	Opsgenie
Escalations/Schedules	Mature, flexible	Mature, flexible
Incident Roles/Templates	Strong Incident Workflows	Incident Templates/Stakeholders
Auto-channels/comms	Good integrations	Deep Slack/MS Teams
Pricing/Licenses	Often more expensive, many add-ons	Usually cheaper at the start
Tag Routing	Strong (Service Directory)	Strong (Routing Rules)
Both platforms cover 95% of the same scenarios; choose by cost, UX, and your stack integrations.

14) Quiet windows and frosts

Freeze: Banning paging in planned release windows, leaving only P1.
Tag memorization: 'env = stage', 'region = dr', 'service = batch'.
Temporary mute: when migrating databases/load tests - with an explicit owner.

15) Performance metrics (SRE/DORA for alerts)

MTTA/MTTR (broken down by teams/services/shifts).
% of alerts with runbook (target ≥ 95%).
Share of page-alerts by SLO (target ≥ 90%).
Ratio of useful/noisy (goal ≥ 3:1).
% of auto-actions (pause/rollback via webhook) - grow.
Burn-down postmortem action items in 14/30 days.

16) Anti-patterns

Page by hardware (CPU, disk) without affecting the user.
Absence of 'group _ by' → "storm" of alerts.
There are no quiet windows - releases paint everything red.
Payloads without 'service/env/runbook' - cannot be routed/acted upon.
There is no single severity scale and rules (each source is different).
"Eternal" warnings that no one repairs (alert debt).

17) Implementation checklist (0-45 days)

0-10 days

Align severity scale and standardize tags/annotations.
Create services in PagerDuty/Opsgenie, configure schedules and basic escalations.
Bind Alertmanager/Grafana, enable 'group _ by' and deadup.

11-25 days

Enter SLO alerts (multi-window burn), add a link runbook.
Configure ChatOps: auto channels, ack/assign commands.
Enable quiet windows on releases/migrations.

26-45 days

Integrate auto-pause/rollback for canaries (webhooks).
Enter MTTA/MTTR reports and alert hygiene (noise cleaning).
Standardize postmortem and control over action items.

18) Ready snippets

Grafana Alerting → PagerDuty (JSON body mapping)

json
{
"routing_key": "${PAGERDUTY_ROUTING_KEY}",
"event_action": "trigger",
"payload": {
"summary": "{{.RuleName }}: {{ index. Labels \"service\" }}",
"severity": "{{ if eq (index. Labels \"severity\") \"critical\" }}critical{{ else }}error{{ end }}",
"source": "grafana",
"component": "{{ index. Labels \"env\" }}",
"group": "{{ index. Labels \"region\" }}"
},
"links": [
{ "href": "{{.DashboardURL }}", "text": "Dashboard" },
{ "href": "{{ index. Labels \"runbook\" }}", "text": "Runbook" }
]
}

Webhook from alert → Argo Rollouts pause

bash curl -X POST "$ARGO_API/rollouts/pause" \
-H "Authorization: Bearer $TOKEN" \
-d '{"name":"api-gateway","namespace":"prod"}'

Opsgenie - Routing Rule (pseudo)

yaml if:
tags: ["service:payments","env:prod"]
severity: ["P1","P2"]
then:
route_to: "SRE-Payments"
notify: ["Primary OnCall","Secondary"]

19) Conclusion

A strong contour of alerts is a process + discipline: SLO-oriented stratification, competent routing and escalation, uniform tags and payloads, quiet windows, ChatOps and automatic actions (pause/rollback). Choose PagerDuty or Opsgenie on budget and UX, but stick to the same rules of noise, duty and responsibility - then the page will be rare, accurate and useful, and the incidents will be short and manageable.

Alerts and Notifications: PagerDuty, Opsgenie

Opsgenie (webhook)

Example of runbook link in annotations

11-25 days

26-45 days

Webhook from alert → Argo Rollouts pause

Opsgenie - Routing Rule (pseudo)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects