Data clustering
1) Why cluster an iGaming platform
Personalization without tags: we group players by behavior in order to target offers, limits, UX.
Operations and risk: we identify "thin files," atypical payment patterns, fraud clusters.
Product and content: segments on favorite providers/mechanics (crash/slots/live), life cycles.
Analytics and strategic insights: how the mix of segments changes by market/campaign/season.
2) Data and tag space
2. 1 Sources
Gaming behavior: frequency/length of sessions, bets/min, volatility, favorite genres/providers.
Payments: frequency/amounts of deposits/withdrawals, methods (Papara/PIX/card), chargeback/deviations.
Marketing/CRM: attraction channels, reaction to bonuses/quests, push responses.
Devices/platforms: OS, version, client stability, network type.
RG/compliance: self-exclusion flags, limits, support calls (without PII).
2. 2 Engineering feature
Aggregates by windows: 7/28/90 days; rationing "for an active day."
Standardization/robast scaling: z-score/robust-scaler (IQR), log-scale for "long tails."
Categories → embeddings/one-hot: providers/channels/countries.
Dimension reduction: PCA/UMAP for noise and visualization, but store a "raw" vector for interpretation.
Zero-PII: tokens instead of identifiers, we prohibit personal fields.
3) Algorithms and when to take them
k-means/Mini-Batch k-means - fast baseline for big data; sphericity assumption.
GMM - soft affiliation (probabilities), useful for "border" players.
DBSCAN/HDBSCAN - finds free-form clusters and "noise" (anomalies); is sensitive to 'eps'.
Hierarchical (Ward/average) - dendrograms for the "tree" of segments, good at average N.
Spectral - for non-spherical clusters; road on big N.
SOM (Kohonen maps) - interpretable 2D maps of behavioral patterns.
Mixed data: k-prototypes, k-modes, Gower distance.
Hint: Start with Mini-Batch k-means (speed) + HDBSCAN (noise/anomalies) and compare stability.
4) How to choose k and evaluate quality
Internal metrics: Silhouette (higher is better), Davies-Bouldin (lower is better), Calinski-Harabasz.
Stability: re-clustering on bootstrap samples, Rand Index/NMI between partitions.
External validity: distinguishability of KPIs (GGR/NET, retention, conversion of offers, FPR) between clusters.
Business interpretation: Clusters should have clear profiles and actions. If not, override features/scale/algorithm.
5) Profiles and explainability
Cluster profile: medians/quantiles feature, top games/providers, devices, payment methods, channels.
The difference with the population: Δ in p-points/ σ, visualization by radar.
Local explainers: SHAP/Permutation importance for boundaries between clusters (through the trained classifier "cluster_id").
We call clusters: "High-rollers crash," "Bonus-hunters slots," "Casual weekend live."
6) Operation (online/offline)
Offline clustering once a day/week → the publication of "passports" of segments.
Online assignment: nearest center (k-means), probability (GMM), "noise" (HDBSCAN) → fallback rules.
Drift: monitor PSI/KC by key features, migration between clusters, "noise" frequencies.
Life cycle: revision every 1-3 months; MAJOR when changing features/standards.
7) Integrations and actions
Personalization: offers/frequency limits, selection of providers and tournament mechanics.
CRM/channels: fluff/email frequencies, time windows, language/tonality.
Marketing: budget by segment, creatives, LTV forecast; "nudge" vs "value" of the strategy.
RG/risk: mild interventions for risk cluster, "manual" review for anomalies.
Antifraud: clusters of atypical payment paths/devices → increased scoring.
8) Privacy and compliance
k-anonymity of reports (minimum N objects per slice).
Zero-PII in fiches/logs/dashboards, tokenization; DSAR deletion by token.
Geo/tenant-isolation: train/store segments in license region.
Fairness check: we check the differences by sensitive measurements (country/payment method/device).
Usage: "aggressive" offers for RG cluster (policies) are prohibited.
9) Success metrics
Operating: share of online attributions <X ms, stability of centers, migration/underapproval.
Business: uplift conversion of offers, ARPPU/LTV by segment, decrease in anti-fraud FPR, RG reaction speed.
Model quality: silhouette ↑, DB ↓, stability ↑, distinguishable KPI between clusters.
10) Pipeline (reference)
Bronze → Silver → Gold → Serve
1. Ingest events/payments/devices → cleaning/joyns.
2. Feature Store: window calculation (7/28/90d), standardization, masks/tokens.
3. Dim-reduction (PCA/UMAP) for visualizations (not for surfing).
4. Clustering (offline), evaluation of metrics, generation of "passports."
5. Online assignment API: nearest center/probabilities/" noise."
6. Monitoring: drift, migrations, frequency of "noise," KPI by segment.
7. Release: semver, shadow/canary, rollback; Segment directory in BI
11) Segment examples (iGaming)
Bonus-hunters slots: high share of freespins/cashback, short sessions, many output failures - soft promo limits, transparent conditions.
Crash-risk takers: Short intense sessions, rapid rate build-up - frequency limits/cooling.
Live-social: long evening sessions on live, high CTR on social campaigns - curation of streams and live events.
Thin-file newcomers: 1-2 deposits, few rounds - welcome tutorials, KYC support.
Anomaly-payments: frequent change of wallets/methods, geo-races - enhanced anti-fraud.
12) Artifact patterns
12. 1 Segment catalog (fragment)
yaml version: 1. 4. 0 segments:
- id: s_high_roller_crash name: "High-rollers crash"
size_share: 0. 07 centroid:
stake_per_min_z: 2. 1 volatility_z: 1. 8 session_len_min: 6. 4 actions: ["limit_bet_growth","vip_care","rg_cooldown_soft"]
- id: s_bonus_hunter_slots name: "Bonus-hunters slots"
size_share: 0. 19 centroid:
bonus_usage_rate: 0. 63 withdraw_decline_rate: 0. 21 actions: ["clear_terms","frequency_cap","onboarding_quest"]
12. 2 The Politics of Surfing
yaml serving:
assigner: "nearest_centroid" # or gmm_prob p95_latency_ms: 50 min_confidence: 0. 6 unknown_policy: "fallback_rules"
privacy:
pii_in_features: false min_group_size: 50 monitoring:
drift_psi_max: 0. 2 migration_rate_warn: 0. 25
12. 3 Cluster passport (BI)
yaml cluster_id: s_live_social share: 0. 23 kpi:
d30_retention: 0. 42 arppu: 27. 4 behavior:
sessions_evening_share: 0. 68 provider_top: ["Evolution","Pragmatic Live"]
crm:
push_ctr: 0. 11 promo_sensitivity: "medium"
rg_flags: ["cooldown_hint"]
13) Implementation Roadmap
0-30 days (MVP)
1. Assemble display cases (7/28/90d), standardize, cut out PII.
2. Mini-Batch k-means for 5-9 clusters + basic HDBSCAN for "noise."
3. Passport of clusters, online assigner, migration/drift dashboard.
4. Two product experiments: segment offers and fluff frequency.
30-90 days
1. GMM for soft-accessory; mixed types (k-prototypes).
2. Auto-reassembly every N days, shadow → canary; alert on PSI/migrations.
3. Interpretability (SHAP cards), segment BI catalog and CRM/recommendation API.
3-6 months
1. Geo/tenant-specific segments; combining with device/payment graph.
2. Long-term cohorts + transition matrices (Markov) for LTV planning.
3. Segment-level RG/AML policies; external privacy/ethics audit.
14) Anti-patterns
Choosing k "by eye" and evaluating only silhouette without business checks.
Mixing PII and behavioral features; lack of k-anonymity in reports.
There is no online assigner → segments "hang" in BI without action.
Retraining for the season/share; lack of monitoring of migrations.
Using clusters for "aggressive" marketing without RG guard rules.
One set of segments for all countries/brands without local features.
15) RACI
Data Platform (R): showcase feature, pipeline, monitoring, version register.
Data Science (R): algorithm choice, k/metrics, interpretation.
Product/CRM (A): segment activities, experiments.
Risk/RG (C): restriction and HITL policies for "heavy" segments.
Security/DPO (A/R): privacy, tokenization, k-anonymity.
BI (C): dashboards, catalogs, documentation.
16) Related Sections
Segmented Targeting, Recommendation Systems, Player Profiling, Reducing Bias, Performance Benchmarking, Analytics and Metrics API, MLOps: Model Exploitation, Data Ethics and Transparency.
Total
Clustering is not just a UMAP graph, but a production tool: pure features without PII, stable metrics and understandable segment passports, online-assigner and actions in CRM/product/RG. When regularly audited and monitored for drift, it turns "behavior chaos" into manageable strategies for growth, safety and responsibility.