Model monitoring

1) Why

The goal is to maintain the quality and safety of the model's solutions in the sale while complying with SLA/SLO, RG/AML/Legal and budgets. Monitoring should detect early degradation (data, calibration, latency, cost), minimize the expected cost of errors and ensure reproducibility/audit.

2) Monitoring areas (map)

1. Availability and performance: latency p95/p99, error-rate, RPS, autoscale.
2. Prediction quality: PR-AUC/KS (on online labels), calibration (ECE), expected-cost @ threshold.
3. Drift and stability: PSI/KL by features and speed, change of distributions/categories.
4. Coverage and completeness: the share of successfully served requests, the share of "empty" features, hit-rate caches.
5. Slice/Fairness: metrics by market/provider/device/account age.
6. Guardrails (RG/AML): policy violations, intervention frequencies, false positives/negatives.
7. Cost: cost/request, cost/feature, GPU/CPU-clock, small-files/IO (for batch/near-RT).
8. Data/contracts: feature scheme, versions, online/offline equivalence.

3) SLI/SLO (landmarks for iGaming)

Latency p95: personalization ≤ 150 ms, RG/AML alerts ≤ 5 with e2e.
Availability: ≥ 99. 9%.
Error-rate 5xx: ≤ 0. 5% in 5 min window.
Coverage: ≥ 99% of requests received a valid speed and solution.
Freshness of labels for online assessment: D + 1 (daily), for fast proxies - ≤ 1 hour.
Drift PSI: Feature/Rate <0. 2 (warning с 0. 1).
ECE calibration: ≤ 0. 05.
Expected-cost_live: not higher than the base model + X% (target X is chosen by the business).

4) Signals and formulas

4. 1 Drift

PSI: summarize by bin the difference in distributions (train vs prod).
KL-divergence: sensitive to "thin" tails; monitor for key features/speed.
KS for rates (if labels are present): CDF difference for positives/negatives.

4. 2 Calibration

ECE (Expected Calibration Error):	predicted-prob − empirical-rate	on baskets.
Reliability curve: accuracy graph vs probability.

4. 3 Expected-Cost

Minimize (C = c_{fp}\cdot FPR + c_{fn}\cdot FNR) at the working threshold; online count in a sliding window with delayed labels.

5) Label sources

Online labels (fast proxies): 7-day deposit event, click/conversion, completed RG case.
Delayed labels: chargeback/fraud (45-90 days), long-term churn/LTV.

Rules: keep as-of-time; do not use events "from the future."

6) Dashboards (minimum composition)

1. Operating: RPS, p50/p95/p99 latency, 4xx/5xx, saturation, autoscaling.
2. Quality: score-distribution, PR-AUC (on proxy labels), ECE, expected-cost, KS.
3. Drift: PSI/KL by top features, novelty categories, missing-rate, feature-fetch latency.
4. Slice/Fairness: PR-AUC/ECE/expected-cost by market/provider/device.
5. Guardrails: RG/AML violations, interventions/1k requests, false-stop rate.
6. Cost: cost/request, CPU/GPU time, cache hit-rate, external lookups.

7) Alerting (example rules)

HighP95Latency: p95> 150 ms (5 min) → page SRE/MLOps.
ErrorBurst: 5xx > 0. 5% (5 min) → rollback script is available.
PSI_Drift: PSI(amount_base) > 0. 2 (15 min) → warm-up retrain.
ECE_Bad: ECE > 0. 07 (30 min) → rebuild calibration/thresholds.
ExpectedCost_Up: + X% to the benchmark (1 day) → consider rollback/overload.
Slice_Failure: PR-AUC in the R market fell> Y% (1 day) → the owner of the ticket domain.
Guardrails_Breach: share of aggressive offers> cap → immediate kill-switch.

8) Logging and tracing

Query logs (minimum): 'request _ id', 'trace _ id', 'model _ id/version', 'feature _ version', 'feature _ stats' (missing%, extremes), 'score', 'decision', 'threshold', 'policy _ id', 'guard _ mask', 'latency _ ms', 'cost _ estimate', (optional) explanations (SHAP top-k).
OTel-трейсы: спаны `feature_fetch` → `preprocess` → `score` → `postprocess` → `guardrail`.
PII: aliases/tokens only; policy masking, key residency.

9) Online quality assessment

Sliding windows for PR-AUC/KS by fast labels (hour/day).
Retained labels: D + 7/D + 30/D + 90 retrospective reports, expected-cost adjustments.
Calibration: Isotonic/Platt re-evaluation on D + 1, auto-refresh artifact.

10) Decision threshold and policy

We keep the threshold as a config in the register; online we consider expected-cost and adjust within the permissible range (rate-limited).
Safety-caps: upper/lower limits of actions; manual override for compliance.
Backtesting thresholds: nightly simulation on yesterday's data.

11) Slice & Fairness

Segments: market/jurisdiction, provider, device/ASN, account age, deposit power.
Metrics: PR-AUC, ECE, expected-cost, FPR/TPR differences (equalized odds), disparate impact.
Actions: calibration/threshold for slices, retraining with scales, revision of the feature.

12) Equivalence online/offline

Equality test feature: MAE/MAPE on the control sample; alert when diverging> threshold.
Versioning: 'feature _ spec _ version', 'logic _ version'; WORM archive.
Circuit contracts: breaking-change is not allowed without a double entry (v1/v2).

13) Guardrails (RG/AML)

Pre-/Post-filter actions, frequency limits, cooldown, lists of prohibitions.
Логи `policy_id/propensity/mask/decision`; report violations.
Time-to-interview and false-intervention rate metrics.

14) Incidents and runbook

Scenarios and steps:

1. Latency↑/5xx↑: check external feature providers → enable cache/timeouts → scale → rollback if necessary.

2. PSI/ECE/Expected-cost deteriorated: freeze traffic (canary↓), enable fallback thresholds/model, run retrain.

3. Slice failure: temporary slice-specific threshold, ticket to the domain owner.

4. Guardrails breach: kill-switch, case audit, post-sea.

15) Cost and performance

Profiling: Fraction of time in feature-fetch vs score vs IO.
Cache strategies: TTL/eviction, hot features in RAM, cold ones - lazy.
Model quantization/optimization: FP16/INT8 while maintaining quality.
Chargeback: cost/request, cost/feature by team/market.

16) Examples (fragments)

Expected-cost threshold (pseudocode):

python thr_grid = np. linspace(0. 01, 0. 99, 99)
costs = [expected_cost(y_true, y_prob >= t, c_fp, c_fn) for t in thr_grid]
thr_best = thr_grid[np. argmin(costs)]

Prometheus (metric ideas):

text model_inference_latency_ms_bucket feature_fetch_latency_ms_bucket model_request_total{code}
model_score_distribution_bucket psi_feature_amount_base ece_calibration expected_cost_live slice_pr_auc{slice="EEA_mobile"}

Alert (idea):

text
ALERT DriftDetected
IF psi_feature_amount_base > 0. 2 FOR 15m

17) Processes and RACI

R (Responsible): MLOps (observability/alerts/registry), Data Science (quality metrics/calibration/threshold), Data Eng (features/contracts/equivalence).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/RG/AML/DSAR), Security (KMS/Audit), SRE (SLO/Incidents), Finance (Cost).
I (Informed): Product/Marketing/Operations/Support.

18) Roadmap

MVP (2-4 weeks):

1. Basic SLI/SLO (latency/5xx/coverage) + dashboard.

2. PSI for top 10 features and score-distribution; ECE and expected-cost on proxy labels.

3. Decision logs + OTel trails; online/offline equivalence test.

4. Alerts HighP95Latency/PSI_Drift/ECE_Bad + runbook 'and.

Phase 2 (4-8 weeks):

Slice/fairness panels, nightly backfill metrics on delayed labels.
Auto-recalibration and threshold simulator.
Cost-dashboard and quotas/limits on features/replays.

Phase 3 (8-12 weeks):

Auto-release/retrain drift with canary control.
WORM archives of quality reports and artifacts.
Chaos monitoring tests and DR exercises.

19) Delivery checklist

SLI/SLO agreed and monitored at shadow/canary ≥ 24 hours.
PSI/KL, ECE, expected-cost and PR-AUC are considered online; thresholds and alerts are specified.
Slice/fairness panels are enabled; segment owners are assigned.
Logs/trails complete (decisions, thresholds, masks), PII masking, and residency met.
Equivalence test online/offline green; feature diagrams under the contract.
Runbook 'and one-click rollback tested; kill-switch для guardrails.
Cost fits into budgets; cache/quotas/limits are active.
WORM archive of metrics/artifacts and quality reports is saved.

20) Anti-patterns and risks

Lack of online labels and retrospective evaluation.
ROC-AUC only monitoring without expected-cost and calibration.
Ignore slice/fairness → hidden failures in regions/devices.

There is no equivalence online/offline feature → "double reality."

Zero guardrails: Toxic offers, RG/AML violations.
No rollback/DR plans, no WORM archive.

21) The bottom line

Model monitoring is an early warning and risk/cost management system rather than "look once a week." Enter SLO, measure drift/calibration/expected-cost, track slices and guardrails, hold rollback/kill-switch buttons, automate reports and retrains. So the models will remain useful, ethical and compliant with any turbulence of data and traffic.

Model monitoring

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects