Machine learning in iGaming
1) Business cases and value
Product/revenue: LTV forecast, churn (outflow), propensities to deposit/purchase, dynamic missions/quests, next-best-action/offer.
Marketing/CRM: look-alike, segmentation, real-time triggers, bonus optimization (ABO - Abuse-resistant Bonus Optimization).
Risk/Compliance: anti-fraud/AML (velocity, structuring, graph characteristics), Responsible Gaming (RG) - risk rate, intervention triggers.
Operations/SRE: incident prediction, capacity/traffic forecasting, provider anomalies.
Finance: GGR/NGR forecast, Fx sensitivity, counterparty manipulation detection.
Effect guidelines: + 3-7% to Net Revenue due to personalization, − 20-40% to fraud-loss, − 10-25% to churn, SLA response RG <5 s when online.
2) Feature Engineering
Sources: gameplay, payments/PSP, authentication, devices/ASN/geo, RG/KYC/KYB, marketing UTM, provider logs, support/texts.
Basic features:- Behavioral windows: N rates/deposits and amounts per 10 min/hour/day, recency/frequency/monetary.
- Sequences: chains of games, time with the last activity, session characteristics.
- Geo/device: country/market, ASN, device/browser type.
- Graph: player-card-device-IP connections, components/centralities (fraud rings).
- Contextual: time of day/day of the week/market holidays, provider/genre/game volatility.
- RG/AML: limits, self-exclusions, screening flags, PEP/sanctions (via cache/asynchron).
- Normalize currencies and time (UTC + market locale).
- Historize dimensions (SCD II).
- Agree on online/offline transformation (single Feature Store code).
3) Architecture: offline ↔ online
3. 1 Offline loop
Lakehouse: Bronze→Silver (normalization/enrichment) →Gold (datasets).
Feature Store (offline): formula register, point-in-time join, materialization of training sets.
Training: containers with fixed dependencies; tracking experiments (metrics/artifacts/data).
Validation: k-fold/temporal split, backtest, off-policy evaluation.
3. 2 Online circuit
Ingest → Stream Processing: Flink/Spark/Beam with windows/watermarks, idempotency.
Feature Store (online): low-patent cache (Redis/Scylla) + offline casts.
Serving: REST/gRPC endpoints, scoring graph, AB routing, canary releases.
Real-time storefronts: ClickHouse/Pinot for panels/rules.
4) Model models and approaches
Classification/scoring: churn/deposit/fraud/RG (LogReg, XGBoost/LightGBM, TabNet, CatBoost).
Ranking/recommendations: factorization/list-ranking (LambdaMART), seq2rec (RNN/Transformers), contextual bandits.
Anomalies: Isolation Forest, One-Class SVM, AutoEncoder, Prophet/TSfresh for time series.
Graph: Node2Vec/GraphSAGE/GNN for fraud rings.
Causality: uplift models, T-learner/X-learner, DoWhy/CausalML.
NLP/ASR: tickets/chats, classification of complaints, sentiment, topics.
5) Quality metrics
Classification: ROC-AUC/PR-AUC, F1 at operational thresholds, expected cost (weighted FP/FN), KS for risk scoring.
Recommendations: NDCG @ K, MAP @ K, coverage/diversity, CTR/CVR online.
TS/Forecast: MAPE/SMAPE, WAPE, P50/P90 error, PI coverage.
RG/AML: precision/recall in SLA, mean time-to-interval.
Economy: uplift in Net Revenue, fraud saved, ROI campaigns,% bonus abuse.
6) Evaluation and experiments
Offline: temporal split, backtest by week/market/tenant.
Online: A/B/n, CUPED/diff-in-diff, sequential tests.
Off-policy: IPS/DR for personalization policies.
Stat. power: calculation of sample size considering variance and MDE.
python cost_fp = 5. 0 # false alarm cost_fn = 50. 0 # missed fraud threshold = pick_by_expected_cost (scores, labels, cost_fp, cost_fn)
7) Privacy, ethics, compliance
PII minimization: aliases, mapping isolation, CLS/RLS.
Residency: separate EEA/UK/BR contours; without cross-regional joins without foundation.
DSAR/RTBF: deletion/editing in features and logs; Legal Hold for cases/reporting.
Fairness/bias: audit of features, disparate impact, control of proxy variables.
Explainability: SHAP/feature importance, model cards (owner, date, data, metrics, risks).
Security: KMS/CMK, secrets outside the logs, WORM archives of releases.
8) MLOps: lifecycle
1. Data & Features: schemes/contracts, DQ rules (completeness/uniqueness/range/temporal), lineage.
2. Training: containers, autotuning, tracking experiments.
3. Validation: circuit compatibility tests, bias/fairness, performance tests.
4. Release (CI/CD/CT): canary/phased rollouts, feature flags, "dark launch."
5. Serving: autoscaling, caching, gRPC/REST, timeouts/retrays.
6. Monitoring: data/prediction drift (PSI/KL), latency p95, error-rate, coverage, "silent metrics."
7. Re-train: schedule/triggers on drift/degradation of metrics.
8. Incidents: runbook, model rollback, fallback (rule/simple model).
9) Feature Store (consistency kernel)
Offline: point-in-time computation, anti-leakage, formula version feature.
Online: low latency (≤ 10-30 ms), TTL, consistency with offline.
Contracts: name/description, owner, SLA, formula, online/offline compliance tests.
yaml name: deposits_sum_10m owner: ml-risk slo: {latency_ms_p95: 20, availability: 0. 999}
offline:
source: silver. payments transform: "SUM(amount_base) OVER 10m BY user_pseudo_id"
online:
compute: "streaming_window: 10m"
tests:
- compare_online_offline_max_abs_diff: 0. 5
10) Online scoring and rules
Hybrid ML + Rules: model → speed + explanations; rules - hard-guard/ethics/law.
Stitching: CEP patterns (structuring/velocity/device switch) + ML scoring.
SLA: p95 end-to-end 50-150ms for personalization, ≤ 2-5s for RG/AML alerts.
python features = feature_store. fetch(user_id)
score = model. predict(features)
if score > T_RG:
trigger_intervention(user_id, reason="RG_HIGH_RISK", score=score)
elif score > T_BONUS:
send_personal_offer(user_id, offer=choose_offer(score, seg))
11) Training data: samples and labels
Event windows: t0 - reference, t0 + Δ - label (deposit/black/fraud).
Leakage-control: point-in-time join, exclusion of future events.
Balancing: stratification/class weights, focal loss for rare classes.
Ethics: exclude sensitive attributes/proxies, control influence.
12) Economics and productivity
Cost features: count cost/feature and cost/request, avoid heavy online-joins.
Cash: hot features in RAM, cold - lazy.
Materialization: offline aggregation; online only critical.
Quotas: limits on replays, backtests on time windows; chargeback by team.
13) SQL/Pseudo Code Examples
Point-in-time sample for churn (30 days of silence):sql
WITH base AS (
SELECT user_pseudo_id, MIN(event_time) AS first_seen
FROM silver. fact_bets
GROUP BY user_pseudo_id
),
agg AS (
SELECT user_pseudo_id,
DATE(t. event_time) AS asof,
SUM(amount_base) FILTER (WHERE type='deposit' AND event_time >= t. event_time - INTERVAL '30' DAY AND event_time < t. event_time) AS dep_30d,
COUNT() FILTER (WHERE type='bet' AND event_time >= t. event_time - INTERVAL '7' DAY) AS bets_7d
FROM silver. fact_events t
GROUP BY user_pseudo_id, DATE(t. event_time)
)
SELECT a. user_pseudo_id, a. asof, a. dep_30d, a. bets_7d,
CASE WHEN NOT EXISTS (
SELECT 1 FROM silver. fact_events e
WHERE e. user_pseudo_id=a. user_pseudo_id AND e. event_time > a. asof AND e. event_time <= a. asof + INTERVAL '30' DAY
) THEN 1 ELSE 0 END AS label_churn_30d
FROM agg a;
Online Deposit Window (Flink SQL, 10 min):
sql
SELECT user_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS sum_10m
FROM stream. payments
GROUP BY user_id, TUMBLE(event_time, INTERVAL '10' MINUTE);
14) Implementation Roadmap
MVP (4-6 weeks):1. Catalog of signals and Feature Store v1 (5-10 features for Payments/Gameplay).
2. Basic churn/deposit model (XGBoost) + A/B for 10-20% of traffic.
3. Online surfing with cache (p95 <150 ms) and canary releases.
4. Drift/quality monitoring, model card, rollback runbook.
Phase 2 (6-12 weeks):- RG/AML scoring, graph features, real-time triggers.
- Uplift models for bonuses, contextual bandits, off-policy assessment.
- Auto-re-train by drift/calendar, documentation automation.
- Personalization of the catalog of games (seq2rec), multi-objective optimization (income/responsibility).
- Multi-regional surfing, SLAs/quotas, chargeback on features/inference.
- Fairness audits and stress tests, DR drills and WORM release repositories.
15) RACI
R (Responsible): MLOps (platform/serving), Data Science (models/experiments), Data Eng (features/pipelines).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/RG/AML/DSAR), Security (KMS/secrets), SRE (SLO/value), Finance (effect/ROI), Legal.
I (Informed): Product/Marketing/Operations/Support.
16) Pre-sale checklist
- Features agreed online/offline, transit tests passed.
- Model card (owner, data, metrics, risks, fairness) is filled in.
- Canary release/fichflag; SLA and latency/error/drift alerts.
- PII/DSAR/RTBF/Legal Hold policies enforced; the logs are impersonal.
- Incident/rollback runbook; fallback strategy.
- Experiments are formalized (hypotheses, metrics, duration, MDE).
- The cost of inference and feature is included in the budget; quotas and limits are included.
17) Anti-patterns
Discrepancy online/offline feature → inaccessibility.
Synchronous external APIs in the "hot path" without cache and timeouts.
Opaque metric formulas/no model cards.
Retraining/drift without monitoring and overtraining.
PII in analytics and training without CLS/RLS/minimization.
"One big model for everything" without domain decomposition.
18) The bottom line
ML in iGaming is not a set of "magic" models, but a discipline: consistent data and features, reproducible offline training, reliable online surfing, strict MLOps, transparent metrics and ethics/compliance. By following this guide, you will build a system that consistently increases revenue and retention, reduces risk, and complies with regulatory requirements - at scale, quickly, and predictably.