Anomaly detection

Anomaly Detection is the identification of unusual observations, patterns, or changes in data that deviate from the "norm" and can signal failures, fraud, security incidents, data errors, or rare business events. Below is a systematic view: from the formulation of tasks to the operation and management of alerts.

1) Types of anomalies and statements

Point anomalies: single observations outside the norm (a surge in deposits for one user).
Contextual: context-sensitive deviations (high load at night - ok, during the day - anomaly).
Collective: a group of ordinary points in an unusual sequence (a series of small transactions).
Structural: change-point; new seasonality).
Data quality anomalies: omissions, duplicates, glues, misalignment of time stamps, "flat" sensors.

Training modes:

Supervisory: there are marked anomalies (rare, expensive).
Semi-supervisory (one-class): we teach the "norm," everything else is abnormal.
Non-supervisory: we are looking for "rare/distant" without tags.

2) Data and preparation

Normal limits: horizons and seasonality (hour/day/week), calendar events, weekends, promotions.
Features: lags, sliding statistics (mean/median/EMA), quantile features, category encodings, rarity counters, window aggregates 7/30/90.
Cleaning: deduplication, time zone correction, frequency equalization, skip handling (interpolation/forward-fill/recovery models).
Standardization/robustness: RobustScaler/ranks/vinzorization for emission resistance.
Point-in-time correctness: no future leaks when generating features.

3) Detection methods

3. 1. Statistics and rules

z-score/robust z (median, MAD), IQR/box plot, exponential smoothing with confidence corridors.
Control cards (Shewhart, CUSUM, EWMA): for production processes and flow metrics.
Quantile thresholds (dynamic by windows), seasonally quantile thresholds.

3. 2. Distances, densities, clusters

kNN distance, Local Outlier Factor (LOF) is a local rarity.
DBSCAN/HDBSCAN - noise points outside clusters.
PCA/Robust PCA - anomalies → high residual error/SPE statistics; Hotelling’s T².

3. 3. Ensembles and trees

Isolation Forest - isolates rare points in short ways.
Randomized Thresholding/Bagging on basic rules - fast baselines for food.

3. 4. Reconstruction and probabilistic

Autoencoder/VAE (including LSTM/Transformer for sequences): anomaly = high reconstruction error.
Probabilistic forecasting: going beyond the predicted intervals - signal.
Bayesian models/streams of normalizing transformations - explicit uncertainty.

3. 5. Time series and mode changes

ARIMA/ETS/Prophet/TBATS - forecast + deviation.
Change-point detection: BOCPD, RuLSIF/Divergence criteria, Pruned Exact Linear Time (PELT).

Matrix Profile/Discord discovery - search for "the most dissimilar subsequences."

3. 6. Multidimensional and graph

Multivariate TS: VAR, TCN/TFT, LSTM-VAE; cross-correlations and joint confidence intervals.
Columns: abnormal sub-paths/nodes (for example, in network traffic or payment chains).

4) Method selection: practical matrix

Scenario	Data	Recommendation
Sales metrics, telemetry	Flow, seasonality	EWMA/CUSUM + quantile corridors; then Isolation Forest as the second layer
Fraud/transactions	Imbalance plate	LOF/Isolation Forest as a baseline → Autoencoder/VAE; add domain rules
Sales/Market	Daily rows	Prophet/TBATS + quantile intervals; change-point for tweaks
Data quality	Raw logs	Quality rules + statistics; alerts to schemas/NULL/duplicates
Event flows	Real time	Online versions of CUSUM/EWMA + lightweight one-class models; delay limit

5) Quality assessment for rare anomalies

Imbalance: ROC-AUC may be misleading; focus on PR-AUC, precision @ k, recall@FPR≤x%, F1, Matthews CC.

Time metric: Average Time To Detect (ATTD), proportion of "early detections."

Stability: percentage of flapping (frequent on/off alert), average length of "quiet" periods.
Cost-based: cost matrix (false positive/false negative), value of incidents averted.
Validation: time splits, out-of-time windows, group splits (by user/device), back tests.

6) Threshold strategies and calibration

Static thresholds: Simple, but break when seasonally.
Dynamic: per-segment/per-hour quantile, adapt to loads and quiet hours.
Percentile by speed: 99. 5th/99. 9th for high-precision; can be done per-bucket by category.
Scoring calibration: isotonic/temperature for probabilities; alert smoothing (debounce, "N of M").
Hysteresis: different thresholds for entering/exiting the anomaly state.

7) Interpretability and RCA (root cause analysis)

Global: gain/permutation, PCA loads, segment profiles, component contribution to reconstruction error.
Locally: SHAP/LIME on ramps or on auxiliary models.
Series attribution: contribution of trend/seasonality/regressors (holidays, campaigns).

Detail: "abnormal segment → abnormal feature → abnormal objects."

Causality: difference-in-differences/контрфакты for office of marketing effect of the "true" anomaly.

8) Production and MLOps

Serving: synchronous (low latency, gRPC/REST) and asynchronous (batch/microbatch).
Fichestor: online/offline consistency, point-in-time, SLA for feature generation.

Versioning: models, thresholds, schemes, configs; store artifacts and data "casts."

Alerting: prioritization (P1-P3), deduplication, suppression of windows (night/holidays), auto-closing during normalization.
Fail-safe: automatic degradation to rules/simple detectors, timeouts, QPS limitation.
Shadow/Canary: comparison of the new detector with the current one, offline- →shadow - →canary - →full.
Feedback loop: alert marking interface, semi-automatic relaying and training.

9) Alert-fatigue reduction

Bundling: Group alerts close in time/segment into one incident.
SLO on alerts: target for precision/number of alerts per shift.
Escalation policy: growth of priority at duration/scale.
Rate limiting: no more than N alerts per window; "quiet period" after triggering.
Two-level scheme: cheap coarse detector (high recall) + expensive precision verifier.

10) Implementation checklist

Types of anomalies and business value of their detection identified
Seasonality/calendar taken into account; Context characteristics built
Fast baseline + potentially more complex method selected
Threshold strategy (dynamic/per-segment) and hysteresis
Metrics: PR-AUC, ATTD, cost-metrics, segment reports
Interpretation Plan and RCA; dashboards Drill-down
Alert policies, suppression, deduplication
Logging scoring, version, input features; replay back tests
Retraining procedures and drift control (PSI/JS-div)
Documentation: Data Contracts, SLOs, Runibooks

11) Typical patterns

"Forecast + deviation": we train the probabilistic forecast (quantiles 5-95%), the signal when going beyond the interval.
"Reconstructor": Autoencoder/Robust PCA → alert for high reconstruction error.
"Insulator": Isolation Forest for tabular/multific; fast, few settings.
"Local rarity": LOF/kNN-distance - good for segments with different densities.
"Regime change": BOCPD/PELT + cause validation (release, promotion, incident).
"Two-stage": rule-based filter → ML-verifier (false reduction).

12) Detector monitoring

Quality: PR-AUC/precision @ k/ATTD in sliding window, share of confirmed alerts.
Data: omissions, lags, unusual cardinality, bursts of events.
Drift: PSI/KL/JS by key features and speed, target drift (if marked).
Operating system: delay in inference, QPS, fault tolerance, share of degradation.

13) Marking and active training

Marking strategies: top-k in speed, variety in clusters, "border" cases.
Synthetics: anomaly injections (controlled) for stress tests.
Active learning: we ask analysts for labels for controversial incidents.
Weak supervision: rules/heuristics as weak labels + label aggregators.

14) Safety, Ethics, Compliance

Privacy: minimizing fields, pseudonymization, role access.
Transparency: explainability of alert causes and automation actions.
Audit: decision log, reproducibility of thresholds/versions/data.
Fairness: bias control by segment (especially for anti-fraud/scoring).

Mini Glossary

Change-point: the moment of the series distribution/mode change.
PR-AUC: area under the precision-recall curve, stable at rare positive.
PSI: population stability index, distribution drift metric.
Matrix Profile/Discord: A way to find the "most dissimilar" subsequence.

Total

An effective anomaly detection loop is not one "smart" algorithm, but a combination: the correct context (seasonality/calendar), robust features, a well-thought-out threshold policy interpreted by RCA, a rigid operating system (SLO/alert policies) and a cycle of improvements through feedback. This approach reduces false alarms and increases the real benefits of anomalies - from early detection of failures to loss prevention.

Anomaly detection