GH GambleHub

Teaching with and without a teacher

1) Why and when

Supervised: there is a label → we predict the probability/class/value. We use it when the "correct answer" is clear and there is a story: churn, deposit of 7 days, RG/AML risk, probability of response to an offer, LTV forecast.
Unsupervised: there are no marks → we find structures/clusters/anomalies/latent factors: segmentation of players, fraud rings, thematic profiles of games, detection of provider failures, compression of signs.

Selection rule: if the business decision depends on a specific probabilistic forecast → supervised; if the goal is to open unknown patterns/signals or to reduce the dimension of the data → unsupervised. In practice, they are combined.

2) Typical iGaming cases

Supervised

Churn/reactivation: binary classification (go/not go), uplift models for impact.
Propensity to deposit/purchase: probability of event in horizon T.
RG/AML: risk rate, structuring probability, suspicious session.
Bonus anti-abuse: the likelihood of fraudulent use of promo.
Recommendations (ranking): probability of click/bet on the game (listwise/pointwise).

Unsupervised

Player segmentation: k-means, GMM, HDBSCAN by RFM/behavior/genre.
Anomalies: Isolation Forest, LOF, AutoEncoder on payments/game patterns.
Graph analysis: clustering in the "player-device-card-IP" column.
Downsize: PCA/UMAP for visualization and feature engineering.
Thematic models: NMF/LDL for game descriptions/support chats.

3) Data and features

Point-in-time connections to exclude data leakage.
Characteristic windows: 10 min/1 h/1 day/7 days/30 days (recency, frequency, monetary).
Context: market/jurisdiction/DST/holidays, provider/genre, device/ASN.
Graph features: the number of unique cards/IP/devices, centrality.
Currency/time zone normalization, SCD II for users/games/providers.

4) Algorithms and metrics

With the teacher

Algorithms: LogReg, XGBoost/LightGBM/CatBoost, TabNet; for ranking - LambdaMART/GBDT; time series - Prophet/ETS/Gradient Boosted TS.
Metrics: ROC-AUC/PR-AUC, F1 @ operational threshold, KS (risk), NDCG/MAP @ K (recommendations), MAPE/WAPE (projections), expected cost with FP/FN weights.

Without a teacher

Clustering: k-means/GMM (number of clusters - elbow/silhouette), HDBSCAN (density).
Anomalies: Isolation Forest/LOF/AutoEncoder; metrics - precision @ k on expert markup, AUCPR on synthetic anomalies.
Dimension: PCA/UMAP for feature design and visualizations.

5) Combined approaches

Semi-Supervised: pseudo-bubbles for the part of unallocated data (self-training), consistency regulation.
Self-Supervised: contrasting/masked tasks (session/game embeddings) → use downstream in supervised.
Active Learning: the system offers marking candidates (maximum uncertainty/diversity) → saves the work of AML/RG experts.
Weak Supervision: heuristics/rules/distant markup form "weak" labels, then calibrate.

6) Process: from offline to online surfing

1. Offline: collecting/preparing → split by time/markets → training/validation → backtest.
2. Metrics semantics: uniform formulas (for example, churn_30d) and fixed time windows.
3. Feature Store: uniform feature formulas online/offline; compliance tests.
4. Online surfing: gRPC/REST endpoints, SLA by latency, AB routing/canary releases.
5. Monitoring: data/prediction drift (PSI/KL), latency p95, business metrics error, alerts.

7) Privacy and compliance

PII minimization: pseudonymization, mapping isolation, CLS/RLS.
Residency: individual pipelines/encryption keys by region (EEA/UK/BR).
DSAR/RTBF: delete/edit features and logs; keep the legal grounds for the exceptions.
Legal Hold: Freezing Investigative/Reporting Artifacts.
Fairness: Audit Proxy Feature, Impact Reports (SHAP), RG Intervention Policy.

8) Economics and productivity

The cost of calculating the feature (cost/feature) and inference (cost/request).
Materialization of offline aggregates; online - only critical windows.
Cache of permissions/scoring results for short TTL, asynchronous lookups with timeouts.
Quotas and budgets for replays/backtests; chargeback by command/model.

9) Examples (fragments)

9. 1 Point-in-time selection for churn_30d

sql
WITH base AS (
SELECT user_pseudo_id, DATE(event_time) AS asof
FROM silver. fact_events
GROUP BY user_pseudo_id, DATE(event_time)
),
feat AS (
SELECT b. user_pseudo_id, b. asof,
SUM(CASE WHEN e. type='deposit' AND e. event_time>=b. asof - INTERVAL '30' DAY
AND e. event_time<b. asof THEN amount_base ELSE 0 END) AS dep_30d,
COUNT(CASE WHEN e. type='bet' AND e. event_time>=b. asof - INTERVAL '7' DAY
AND e. event_time<b. asof THEN 1 END) AS bets_7d
FROM base b
JOIN silver. fact_events e USING (user_pseudo_id)
GROUP BY b. user_pseudo_id, b. asof
),
label AS (
SELECT f. user_pseudo_id, f. asof,
CASE WHEN NOT EXISTS (
SELECT 1 FROM silver. fact_events x
WHERE x.user_pseudo_id=f. user_pseudo_id
AND x.event_time>f. asof AND x.event_time<=f. asof + INTERVAL '30' DAY
) THEN 1 ELSE 0 END AS churn_30d
FROM feat f
)
SELECT FROM feat JOIN label USING (user_pseudo_id, asof);

9. 2 Payment anomalies (pseudocode, Isolation Forest)

python
X = build_features (payments_last_7d) # sum/frequency/novelty/BIN/ASN/time model = IsolationForest (contamination = 0. 01). fit(X_train)
scores = -model. decision_function(X_test)
alerts = where (scores> THRESHOLD) # AML case candidates

9. 3 Segmentation of k-means (RFM + genres)

python
X = scale(np. c_[R, F, M, share_slots, share_live, share_sports])
km = KMeans(n_clusters=8, n_init=20, random_state=42). fit(X)
segments = km. labels_

9. 4 Cost threshold for binary model

python threshold = pick_by_expected_cost(scores, labels, cost_fp=5. 0, cost_fn=50. 0)

10) Evaluation, validation and experiments

Offline: temporal split (train/val/test by time/markets), backtesting, bootstrap trust.
Online: A/B/n, sequential tests, CUPED/diff-in-diff.
Off-policy: IPS/DR for personalization policies.
Calibration: Platt/Isotonic for correct probabilities.
Degradation control: alerts by business metrics and PR-AUC/KS.

11) RACI

R (Responsible): Data Science (models/experiments), MLOps (platform/serving), Data Eng (features/pipelines).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/DPO (PII/RG/AML), Security (KMS/secrets), SRE (SLO/value), Finance (ROI).
I (Informed): Product/Marketing/Operations/Support.

12) Implementation Roadmap

MVP (4-6 weeks):

1. Catalog of targets/labels and signals (churn_30d, propensity_7d, risk_rg).

2. Feature Store v1 (5-10 features), basic XGBoost models, offline metrics dashboards.

3. Segmentation of k-means (8 clusters) + description of segments; Isolation Forest for payments.

4. Online surfing with cache, p95 <150 ms; A/B for 10-20% of traffic.

Phase 2 (6-12 weeks):
  • Active/Semi-Supervised for Label Scarcity (AML/RG), self-supervised game/session embeddings.
  • Canary releases, drift monitoring, auto retraining.
  • A single semantic layer of metrics and online/offline matching feature.
Phase 3 (12-20 weeks):
  • Graph signs and fraud rings; uplift bonus models.
  • Multi-regional serving, quotas/chargeback; WORM archive of releases.
  • Fairness audit, stress tests, runbooks incidents.

13) Pre-sale checklist

  • Point-in-time sampling and anti-leakage tests.
  • Probability calibration; Select the expected cost threshold.
  • Model cards (owner, data, metrics, risks, fairness).
  • Feature Store Online/Offline Compliance Test.
  • Drift/latency/error monitoring, alerts and auto-rollback.
  • PII/DSAR/RTBF/Legal Hold policies; logging is impersonal.
  • Plan A/B and statistical power calculated; The rollback runbook is ready.

14) Anti-patterns

Mixing new events into labels (leakage) and the absence of point-in-time.
"One model for all" instead of domain decomposition.
Some librated probabilities → incorrect business thresholds.
Blind flight: no online drift/quality monitoring.
Online overcomplication (heavy external-joins without cache and timeouts).
Segments without business interpretation and owner.

15) The bottom line

Supervised learning provides measurable prognosis and risk/income management; without a teacher - structure and signals where there are no marks. Their combination (semi/self-supervised, active learning) in data discipline (point-in-time, Feature Store), compliance and MLOps gives the iGaming platform a steady increase in Net Revenue, a decrease in fraud and timely RG interventions - with reproducibility, cost control and readiness for audit.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.