Detection of anomalies in operations

1) Why

Anomalies are early markers of incident and financial loss. In iGaming, these are drops in successful authorizations, bursts of timeouts, growth in queues, failures in KYC conversion, jumps in bet deviations, errors of game providers. The goal is to detect before the user, localize the cause and launch automatic/operator reactions.

2) Signals and observation domains

Payments/finance: success-rate authorizations by PSP/banks/GEO, soft/hard declines, clearing time, chargeback-early indicators.
Game core: p95/p99 bets and sets, error-rate, balance discrepancy, outliers in odds/lines.
Infrastructure: latency/5xx API, saturation (CPU/RAM/IO), replication lag DB, consumer-lag queues, cache-hit/eviction.
KYC/AML: verification queues, TAT (turnaround time), manual check share.
Front/RUM: TTFB/LCP, JS errors, geo-specific degradation.
Security/fraud: bursts of inputs/registrations/outputs, velocity anomalies, atypical patterns.

3) Types of anomalies

Point: one-time spike/dip (e.g. 20% drop in auth-success in EU).
Contextual: "abnormal for this hour/day/event" (night peak - ok, daytime - no).
Collective: a sequence of small deviations that forms an incident (creeping growth p99).
Change-point: a new level of series (after release/configuration/provider).

4) Detection methods (simple to complex)

1. Threshold rules are static or dynamic (sliding window percentile, median ± k· MAD).
2. Seasonal decomposition (STL): trend/seasonality → residual analysis (residual) and IQR/MAD.
3. Control charts (CUSUM/EWMA): sensitive to small mean/dispersion shifts.
4. Change Point Detection: BOCPD, ruptures/PELT; fix the moments of the change of mode.
5. Multidimensional anomalies: Mahalanobis, Isolation Forest/LOF by sets of features (latency, error-rate, lag, hit-ratio).
6. Stream methods (stream): ADWIN, SSD, sketch statistics; low-latency and with limited memory.
7. Forecast + delta: ARIMA/ETS/Prophet/GBM → comparison of fact with confidence interval (especially for business series).
8. Semi-controlled ML: training on the "norm" (One-Class SVM/Autoencoder), useful in poor markup.

Practice: combine 2-3 methods and aggregate by voting or by priority (rule-of-thumb: seasonal STL + CUSUM + forecast tape).

5) Pipeline anomalies: from data to action

1. Collection → normalization: unified series (OTel/metrics), single granularity (10-60 sec).
2. Features and context: GEO/PSP/bank/channel, "working hour? , ""match/tournament? , "releases/phicheflags, planned work.
3. Seasonality and calendar: aware models about weekends/prime time/matches/holidays.
4. Detector: selected methods (threshold/statistics/ML/stream) with parameters per-segment.
5. Noise suppression: hysteresis and confirmation with several windows (N-of-M), incident deadlock.
6. Information and prioritization: impact assessment (SLO, money/min, audience share), P1-P4 assignment.
7. Reaction: auto-actions (PSP feilover, feature degradation, autoscaling by lag), creating an incident and var-room, updating a status page.
8. Logging and auditing: what worked/why, thresholds/model versions, communication.

6) Calibration of thresholds and quality

Precision/Recall/F1 for "anomaly ↔ incident."

Time-to-Detect (TTD): the goal is before the MTTA of users/support.
False Alarm Rate: target ≤ 5-10% for P1/P2.
Lead Time: the window between the detection and the SLO violation - gives a chance for auto-actions.
Drift monitoring: retraining/recalibration on a schedule and when changing season/architecture.

7) Anomaly catalog (iGaming-examples)

7. 1 Payments

Auth-success failure of PSP-X in TR/EU: context - specific BIN bank, window 5-10 min.
Soft-decline growth with normal traffic: possible 3DS/issuer problem.
Clearing delays: Risk of cash gaps.
Reactions: routing to an alternative PSP (health × fee × conversion), retray with jitter, inclusion of a simplified 3DS, comm package to partners.

7. 2 Betting/Gaming

Betting table p99 jump replica/cache/queue.
The gap between the expected GGR and the norm: contextual anomalies for tournaments/sports events.
Reactions: cache warmup, load redistribution, holding part of non-critical feature.

7. 3 Infra/Data

Replication lag↑ and lock-waits: database overload.
Consumer-lag jumps: party misunderstanding or hot key.
Reactions: autoscaling, parsing, producer limits.

7. 4 KYC/AML

verifikatsii↑ time: the provider is degrading.
Reactions: fallback provider/manual queue, Compliance notification.

7. 5 Front/RUM

LCP/JS errors in a specific browser/version: release regression.
Reactions: canaries rollback, feature-flag off, message on status page.

8) SLO-aware alert

The anomaly signal becomes an alert if it affects the error budget or predicts its burn-rate.
Two windows: fast (1 h) and slow (6-24 h); "immediate pager" for high-impact P1 only.
Any alert is bound to the runbook and owner role.

9) Solution architecture

Injection: OTel/metrics → Kafka/stream → processing framework (Flink/Spark/Kafka Streams).
Physical engineering: aggregates, seasonal indicators, one-hot by PSP/banks/GEO.
Detectors: libraries of statistics + models (on-line/mini-batch) with versioning.
Results repository: "anoma-line" (events) with context, connection with incident management.
Decision service: prioritization, auto-reactions, publishing to status page/channels.
Observability: graphs of model quality, drift alarms, injection cost.

10) Cost and privacy

Cost-aware: sampling of input series, downsampling of history, aggregation; separate QoS classes.
PII: do not log userId in metrics; for analysis - tokenization/masks and SoD access; export - via workflow with TTL/encryption.

11) Processes and roles

Responsible: SRE/Observability/Payments Risk in their domains.
Accountable: Head of Ops/SRE.
Consulted: Data Science, Product, Compliance, Security.
Informed: Support, Partner Management, Finance.
Rituals: weekly calibration of thresholds/rules, monthly retro on false/missed signals.

12) Dashboards

Exec: anomaly map by domain, trends false/true alarms, TTD and lead time, revenue/SLO impact.
Ops/SRE: detect tapes with context (releases/flags/planned work), STL residue distributions, change-points cards.
Payments/Risk: PSP heat cards × GEO × bank, failure funnels, auto-routing and the effect of measures.
Front/RUM: browser × version × GEO, release regressions, VIP experience.

13) KPI/KRI functions

TTD (min) and Lead Time (min) before SLO violation.
Incident linkage Precision/Recall/F1.
False Alarm Rate and pager quota (on-call fatigue).
Proportion of auto-reactions that closed the problem without manual intervention.
Reduced MTTR after implementation.
Cost/value: $/alert and savings from avoided losses.

14) Implementation Roadmap (8-12 weeks)

Ned. 1-2: SLI/KPI inventory, selection of priority series (payments/rates/queues/DB), basic thresholds and STL.
Ned. 3-4: streaming (Kafka + Flink/Streams), context (GEO/PSP/releases), hysteresis and dedup.
Ned. 5-6: change-point + CUSUM, predictive tapes for business series, incident-platform communication, runbooks.
Ned. 7-8: auto-reactions (PSP-feilover, feature degradation, autoscaling by lag), dashboards and quality metrics.
Ned. 9-10: multivariable models (Isolation Forest/IForest/AE) in pilot domains, drift monitoring.
Ned. 11-12: cost optimization, A/B threshold calibration, monthly review regulation, and team training.

15) Artifact patterns

Anomaly Spec: signal, segmentation (GEO/PSP/bank), method, thresholds, windows, hysteresis, owner, runbook, auto-reactions.
Change-Point Report: time, component, before/after levels, correlations (releases/feature flags/works).
Quality Dashboard Definition: quality metrics, target boundaries, review period.
Auto-Action Policy: auto-action conditions and limits, return criteria, audit.

16) Antipatterns

Universal static thresholds without seasonality and segmentation.

No hysteresis → flapping and "pager fatigue."

Alerts outside the SLO/money context → a lot of noise, little use.
ML "black box" without explainability and logging.
No connection with releases/phicheflags/planned works.
Ignore injection/storage cost for auxiliary rows.

Total

Anomaly detection is a process and a platform, not just a model: the right signals and context → sustainable methods (STL/CUSUM/CPD/forecast) → noise reduction and prioritization by SLO/revenue → auto-reaction and understandable runbooks → a closed cycle of quality and cost. Such a circuit catches problems before users, reduces MTTR and protects business flows of iGaming platforms.

Detection of anomalies in operations

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects