Operations and → Management Incident Prediction
Predicting incidents
1) Why do you need it
Incidents rarely "explode out of nowhere." Before failure, the platform gives signals: accelerated growth of p99, slow burnout of the error budget, queue lags, growth of retrays on a specific downstream, approaching provider quotas. Systemic incident prediction translates the response from "firefighting" to "early intervention," reducing MTTR, Change Failure Rate and revenue losses.
Objectives:- Identify precursor patterns and automatically initiate preventive actions.
- Reduce the P1/P2 share by shifting to the left (pre-incident detect rate).
- Build predictions into the processes of releases, feilover and capacity preemptions.
2) Lead indicators
Platform/infra:- Acceleration p95/p99 (gradient), "tails" of delays, increase in variation.
- Queues/streams: growth 'lag' and positive derivative lag; HPA at maximum.
- DB/cache: 'active _ conns/max _ conns', 'replication _ lag', 'evictions', drop 'cache _ hit'.
- Network: mTLS/handshake errors, 5xx/timeout growth outward.
- 'outbound _ error _ rate '/' retry _ rate'to a specific provider,' circuit _ open ',' quota _ usage> 0. 9`.
- SLA provider: planned windows, degradation.
- Abnormal load (campaigns/matches), RPS/TPS jumps, unusual regional/channel mixes.
- Deposit/rate conversion drops with p99 growth → quasi-proxy incident.
- Burn-rate error-budget> threshold (for example,> 4 × for 10-15 minutes).
- Frequent minor violations of SLO (micro-degradation) as a marker of approaching failure.
3) Data sources and data marts
Online teleemetry: Prometheus/OTel (metrics, logs, trails).
Incident events: tickets/statuses/postmortems (true for target).
Change plan/facts: releases, phicheflags, migrations, provider windows.
Directories: dependency map, quotas, owners.
DWH-snapshots: units for training/validation (synchronous window!).
Quality requirements: ≥99% completeness, hour/minute TZ alignment, uniform p95/p99 definitions.
4) Prediction approaches
4. 1 Non-parametric/rules (quick start)
Threshold alerts for change rate: 'deriv (p99)', 'z-score' for short windows.
Composite conditions: 'lag↑ + HPA = max + circuit_open (to = "PSP-X")'.
SLO-burn gates: release/canary stop at burn-rate> X.
4. 2 Anomaly detection
Seasonal baselines (STL/Prophet-like ideas), rolling median + MAD.
Multivariate: joint anomaly 'p99 + retry + open_circuit + quota'.
Change-point detection: CUSUM/BOCPD for trend shifts.
4. 3 ML-models (supervised)
Classification "incident in T + K?" by feature window (for example, 10-30 min before).
Characteristics: statistics, derivatives, seasonal residuals, one-hot providers/regions, release flags.
Labels: 'incident{severity∈[P1,P2]}' in interval [t, t + K].
Explainability: SHAP/Permutation importance for trust and operability.
4. 4 SRE-first hybrid
Model → scoring risk (0-1) → action policy (phicheflags/feilover/pre-scale), with HITL for criticism.
5) Feature engineering
Sliding windows (1/5/15 min): mean, p95/p99, std, max, slope.
Relative indicators: 'p99/baseline _ 1d', 'error _ rate _ delta'.
Cohort features: provider, region, game/match type, device channel.
"Load" features: RPS, payload size, number of opened WS.
System: 'hpa _ desired/max', 'db _ conn _ ratio', 'redis _ evictions> 0'.
Event flags: "release in progress," "canary 10%," "provider window."
6) Prediction mechanics and actions
Decision chain:1. Scoring risk every N seconds by domain (Payments/Bets/Games/KYC).
2. Alert Policy:- risk ≥ 0. 8 + confirmation signals → domain owner page;
- 0. 6–0. 8 → warning + preparation of measures.
- pre-scan (HPA minReplicas↑), enabling caches, limiting heavy functions;
- Switch to backup provider/route
- pause/rollback canaries;
- the retray limit to the "narrow" downstream.
4. HITL: A person confirms measures of the "change in business behavior" level.
7) Integration into daily processes
Releases: predictive gates on canaries (before/after comparison and risk scoring).
Feilover: automatic preparation/warming up of the backup route at the risk of the provider.
Capacity: "early uplift" with headroom falling and lags rising.
Alerts: separate feed "pre-incident" + annotations in dashboards.
8) Observability and dashboards
Risk Overview: risk by domain and provider, trends, feature contribution.
Lead Signals: top-N harbingers (p99 gradient, lag, open breakers).
Actions & Outcomes: what turned on, effect on p95/error, canceled incidents.
Model Health: precision/recall/latency, drift of signs, frequency of auto-actions.
9) Prediction quality metrics
Recall @ P1/P2 (critical incident sensitivity).
Precision (fewer "false pages").
Lead Time (median "how many minutes before the fact").
Intervention Win-rate (the proportion of cases where the action reduced risk/cost).
Alert Fatigue Index (alert/shift/person).
Drift Score (stat. differences in the distribution of features vs of the training period).
The default targets are Recall (P1) ≥ 0. 7, Precision ≥ 0. 6, Lead Time median ≥ 8-10 min.
10) Model Risk Management (ML Ops/Governance)
Data/code/artifact versioning, reproducibility.
Champion/Challenger: the new model runs in parallel, offline/online comparison.
Drift: PSI/KL-divergence, auto-enumeration of thresholds, alert "the model is outdated."
Explainability: for each solution, store the importance of the features and the link to the data.
Security/ethics: access, PII masking, control of auto-actions by politicians.
11) Sample rules and policies
SLO-burn and canary (concept):
policy:
if slo_burn_rate{service="payments"} > 4 for 10m and release_phase in ["canary", "post-deploy_30m"]:
action: pause_release_and_rollback notify: squad-payments
Provider composite risk:
risk_psp_x = sigmoid(
1. 2z(outbound_p99_ms) +
1. 5z(outbound_error_rate) +
0. 8z(retry_rate) +
1. 0I(quota_usage>0. 9) +
0. 7I(circuit_open=1)
)
if risk_psp_x > 0. 8 for 5m -> route_to_psp_y + reduce_features
Lag storm in streaming:
if (consumer_lag > 5e6 and deriv(consumer_lag) > 5e4) and hpa_desired == hpa_max:
action: scale_consumers + throttle_producers + enable_batching
12) Implementation checklist (30-60 days)
- Catalog of signals and "truths" by incidents (severity, timelines).
- Baseline and seasonality for key metrics (pre/post release).
- Early signal rules (p99, lag, burn-rate gradients).
- Risk/Lead Signals/Actions dashboards.
- Integration with phicheflags/canaries, pre-scale HPA.
- ML classifier pilot on the same domain (e.g. Payments).
- HITL Policies and Auto Activity Log.
- Quality metrics and alerts to model drift/health.
13) Anti-patterns
"Crystal balls": a complex ML model without baselines and simple rules.
No actionability: we predict "bad," but we don't do anything automatically.
Ignoring seasonality/calendar of events (matches/tournaments) → false alarms.
Mixing time zones → incorrect metrics/incident windows.
Lack of explainability → mistrust, disabling the predictor with commands.
A single global threshold for all domains/regions → low accuracy.
14) Domain specificity (iGaming)
Payments: providers/quotas, growth 'retry _ rate' and 'circuit _ open' → early fake.
Bets: delay in updating coefficients, WS fan out growth → broadcast limit.
Games/Live: connection spikes, studio limits → UI degradation/caches.
KYC/AML: webhook delays, verification queues → HITL and deferred processing.
15) Examples of metrics and alerts (ideas)
ALERT PreIncidentRiskHigh
IF risk_score{domain="payments"} > 0. 8 FOR 5m
LABELS {severity="critical", team="payments"}
ALERT LeadSignalP99Slope
IF deriv(api_p99_ms{service="bets"}[5m]) > 15 AND api_p99_ms > baseline_1d 1. 2 FOR 10m
LABELS {severity="warning", team="bets"}
ALERT ProviderEarlyQuota
IF usage_quota_ratio{provider="psp_x"} > 0. 85 FOR 10m
LABELS {severity="info", team="integrations"}
ALERT StreamLagStorm
IF (kafka_consumer_lag{topic="ledger"} > 5e6 AND rate(kafka_consumer_lag[5m]) > 5e4)
AND hpa_desired == hpa_max FOR 10m
LABELS {severity="critical", team="streaming"}
16) Prediction Program KPI
Pre-Incident Detect Rate.
Avg Lead Time prior to the incident.
Reduction in P1/P2 QoQ
MTTR (expected ↓ due to early context).
False Alarm Rate/Alert Fatigue (stable ↓).
Cost Avoidance.
17) Fast start (recipe)
1. Enable gradient rules on p99/lag and SLO-burn;
2. Add composite conditions for providers;
3. Link the predicate to the phicheflags and pre-scale;
4. Prediction → action → effect report;
5. ML pilot in one domain; scale after Precision/Recall grows.
18) FAQ
Q: Where to start without ML?
A: Seasonal baselines + gradients + composite rules. This gives a noticeable increase in Recall without complications.
Q: How not to drown in folk positives?
A: Combine signals, enter hysteresis and confirmation time, adjust per-domain/region thresholds, evaluate Precision and Alert Fatigue.
Q: Which actions to automate first?
A: Safe and reversible: pre-scale, enabling caches/degradation, pause/rollback canaries, switching provider on confirmed signals.