Model monitoring
1) Why
The goal is to maintain the quality and safety of the model's solutions in the sale while complying with SLA/SLO, RG/AML/Legal and budgets. Monitoring should detect early degradation (data, calibration, latency, cost), minimize the expected cost of errors and ensure reproducibility/audit.
2) Monitoring areas (map)
1. Availability and performance: latency p95/p99, error-rate, RPS, autoscale.
2. Prediction quality: PR-AUC/KS (on online labels), calibration (ECE), expected-cost @ threshold.
3. Drift and stability: PSI/KL by features and speed, change of distributions/categories.
4. Coverage and completeness: the share of successfully served requests, the share of "empty" features, hit-rate caches.
5. Slice/Fairness: metrics by market/provider/device/account age.
6. Guardrails (RG/AML): policy violations, intervention frequencies, false positives/negatives.
7. Cost: cost/request, cost/feature, GPU/CPU-clock, small-files/IO (for batch/near-RT).
8. Data/contracts: feature scheme, versions, online/offline equivalence.
3) SLI/SLO (landmarks for iGaming)
Latency p95: personalization ≤ 150 ms, RG/AML alerts ≤ 5 with e2e.
Availability: ≥ 99. 9%.
Error-rate 5xx: ≤ 0. 5% in 5 min window.
Coverage: ≥ 99% of requests received a valid speed and solution.
Freshness of labels for online assessment: D + 1 (daily), for fast proxies - ≤ 1 hour.
Drift PSI: Feature/Rate <0. 2 (warning с 0. 1).
ECE calibration: ≤ 0. 05.
Expected-cost_live: not higher than the base model + X% (target X is chosen by the business).
4) Signals and formulas
4. 1 Drift
PSI: summarize by bin the difference in distributions (train vs prod).
KL-divergence: sensitive to "thin" tails; monitor for key features/speed.
KS for rates (if labels are present): CDF difference for positives/negatives.
4. 2 Calibration
4. 3 Expected-Cost
Minimize (C = c_{fp}\cdot FPR + c_{fn}\cdot FNR) at the working threshold; online count in a sliding window with delayed labels.
5) Label sources
Online labels (fast proxies): 7-day deposit event, click/conversion, completed RG case.
Delayed labels: chargeback/fraud (45-90 days), long-term churn/LTV.
Rules: keep as-of-time; do not use events "from the future."
6) Dashboards (minimum composition)
1. Operating: RPS, p50/p95/p99 latency, 4xx/5xx, saturation, autoscaling.
2. Quality: score-distribution, PR-AUC (on proxy labels), ECE, expected-cost, KS.
3. Drift: PSI/KL by top features, novelty categories, missing-rate, feature-fetch latency.
4. Slice/Fairness: PR-AUC/ECE/expected-cost by market/provider/device.
5. Guardrails: RG/AML violations, interventions/1k requests, false-stop rate.
6. Cost: cost/request, CPU/GPU time, cache hit-rate, external lookups.
7) Alerting (example rules)
HighP95Latency: p95> 150 ms (5 min) → page SRE/MLOps.
ErrorBurst: 5xx > 0. 5% (5 min) → rollback script is available.
PSI_Drift: PSI(amount_base) > 0. 2 (15 min) → warm-up retrain.
ECE_Bad: ECE > 0. 07 (30 min) → rebuild calibration/thresholds.
ExpectedCost_Up: + X% to the benchmark (1 day) → consider rollback/overload.
Slice_Failure: PR-AUC in the R market fell> Y% (1 day) → the owner of the ticket domain.
Guardrails_Breach: share of aggressive offers> cap → immediate kill-switch.
8) Logging and tracing
Query logs (minimum): 'request _ id', 'trace _ id', 'model _ id/version', 'feature _ version', 'feature _ stats' (missing%, extremes), 'score', 'decision', 'threshold', 'policy _ id', 'guard _ mask', 'latency _ ms', 'cost _ estimate', (optional) explanations (SHAP top-k).
OTel-трейсы: спаны `feature_fetch` → `preprocess` → `score` → `postprocess` → `guardrail`.
PII: aliases/tokens only; policy masking, key residency.
9) Online quality assessment
Sliding windows for PR-AUC/KS by fast labels (hour/day).
Retained labels: D + 7/D + 30/D + 90 retrospective reports, expected-cost adjustments.
Calibration: Isotonic/Platt re-evaluation on D + 1, auto-refresh artifact.
10) Decision threshold and policy
We keep the threshold as a config in the register; online we consider expected-cost and adjust within the permissible range (rate-limited).
Safety-caps: upper/lower limits of actions; manual override for compliance.
Backtesting thresholds: nightly simulation on yesterday's data.
11) Slice & Fairness
Segments: market/jurisdiction, provider, device/ASN, account age, deposit power.
Metrics: PR-AUC, ECE, expected-cost, FPR/TPR differences (equalized odds), disparate impact.
Actions: calibration/threshold for slices, retraining with scales, revision of the feature.
12) Equivalence online/offline
Equality test feature: MAE/MAPE on the control sample; alert when diverging> threshold.
Versioning: 'feature _ spec _ version', 'logic _ version'; WORM archive.
Circuit contracts: breaking-change is not allowed without a double entry (v1/v2).
13) Guardrails (RG/AML)
Pre-/Post-filter actions, frequency limits, cooldown, lists of prohibitions.
Логи `policy_id/propensity/mask/decision`; report violations.
Time-to-interview and false-intervention rate metrics.
14) Incidents and runbook
Scenarios and steps:1. Latency↑/5xx↑: check external feature providers → enable cache/timeouts → scale → rollback if necessary.
2. PSI/ECE/Expected-cost deteriorated: freeze traffic (canary↓), enable fallback thresholds/model, run retrain.
3. Slice failure: temporary slice-specific threshold, ticket to the domain owner.
4. Guardrails breach: kill-switch, case audit, post-sea.
15) Cost and performance
Profiling: Fraction of time in feature-fetch vs score vs IO.
Cache strategies: TTL/eviction, hot features in RAM, cold ones - lazy.
Model quantization/optimization: FP16/INT8 while maintaining quality.
Chargeback: cost/request, cost/feature by team/market.
16) Examples (fragments)
Expected-cost threshold (pseudocode):python thr_grid = np. linspace(0. 01, 0. 99, 99)
costs = [expected_cost(y_true, y_prob >= t, c_fp, c_fn) for t in thr_grid]
thr_best = thr_grid[np. argmin(costs)]
Prometheus (metric ideas):
text model_inference_latency_ms_bucket feature_fetch_latency_ms_bucket model_request_total{code}
model_score_distribution_bucket psi_feature_amount_base ece_calibration expected_cost_live slice_pr_auc{slice="EEA_mobile"}
Alert (idea):
text
ALERT DriftDetected
IF psi_feature_amount_base > 0. 2 FOR 15m
17) Processes and RACI
R (Responsible): MLOps (observability/alerts/registry), Data Science (quality metrics/calibration/threshold), Data Eng (features/contracts/equivalence).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/RG/AML/DSAR), Security (KMS/Audit), SRE (SLO/Incidents), Finance (Cost).
I (Informed): Product/Marketing/Operations/Support.
18) Roadmap
MVP (2-4 weeks):1. Basic SLI/SLO (latency/5xx/coverage) + dashboard.
2. PSI for top 10 features and score-distribution; ECE and expected-cost on proxy labels.
3. Decision logs + OTel trails; online/offline equivalence test.
4. Alerts HighP95Latency/PSI_Drift/ECE_Bad + runbook 'and.
Phase 2 (4-8 weeks):- Slice/fairness panels, nightly backfill metrics on delayed labels.
- Auto-recalibration and threshold simulator.
- Cost-dashboard and quotas/limits on features/replays.
- Auto-release/retrain drift with canary control.
- WORM archives of quality reports and artifacts.
- Chaos monitoring tests and DR exercises.
19) Delivery checklist
- SLI/SLO agreed and monitored at shadow/canary ≥ 24 hours.
- PSI/KL, ECE, expected-cost and PR-AUC are considered online; thresholds and alerts are specified.
- Slice/fairness panels are enabled; segment owners are assigned.
- Logs/trails complete (decisions, thresholds, masks), PII masking, and residency met.
- Equivalence test online/offline green; feature diagrams under the contract.
- Runbook 'and one-click rollback tested; kill-switch для guardrails.
- Cost fits into budgets; cache/quotas/limits are active.
- WORM archive of metrics/artifacts and quality reports is saved.
20) Anti-patterns and risks
Lack of online labels and retrospective evaluation.
ROC-AUC only monitoring without expected-cost and calibration.
Ignore slice/fairness → hidden failures in regions/devices.
There is no equivalence online/offline feature → "double reality."
Zero guardrails: Toxic offers, RG/AML violations.
No rollback/DR plans, no WORM archive.
21) The bottom line
Model monitoring is an early warning and risk/cost management system rather than "look once a week." Enter SLO, measure drift/calibration/expected-cost, track slices and guardrails, hold rollback/kill-switch buttons, automate reports and retrains. So the models will remain useful, ethical and compliant with any turbulence of data and traffic.