Data clustering

1) Why cluster an iGaming platform

Personalization without tags: we group players by behavior in order to target offers, limits, UX.
Operations and risk: we identify "thin files," atypical payment patterns, fraud clusters.
Product and content: segments on favorite providers/mechanics (crash/slots/live), life cycles.
Analytics and strategic insights: how the mix of segments changes by market/campaign/season.

2) Data and tag space

2. 1 Sources

Gaming behavior: frequency/length of sessions, bets/min, volatility, favorite genres/providers.
Payments: frequency/amounts of deposits/withdrawals, methods (Papara/PIX/card), chargeback/deviations.
Marketing/CRM: attraction channels, reaction to bonuses/quests, push responses.
Devices/platforms: OS, version, client stability, network type.
RG/compliance: self-exclusion flags, limits, support calls (without PII).

2. 2 Engineering feature

Aggregates by windows: 7/28/90 days; rationing "for an active day."

Standardization/robast scaling: z-score/robust-scaler (IQR), log-scale for "long tails."

Categories → embeddings/one-hot: providers/channels/countries.
Dimension reduction: PCA/UMAP for noise and visualization, but store a "raw" vector for interpretation.
Zero-PII: tokens instead of identifiers, we prohibit personal fields.

3) Algorithms and when to take them

k-means/Mini-Batch k-means - fast baseline for big data; sphericity assumption.
GMM - soft affiliation (probabilities), useful for "border" players.
DBSCAN/HDBSCAN - finds free-form clusters and "noise" (anomalies); is sensitive to 'eps'.
Hierarchical (Ward/average) - dendrograms for the "tree" of segments, good at average N.
Spectral - for non-spherical clusters; road on big N.
SOM (Kohonen maps) - interpretable 2D maps of behavioral patterns.
Mixed data: k-prototypes, k-modes, Gower distance.

Hint: Start with Mini-Batch k-means (speed) + HDBSCAN (noise/anomalies) and compare stability.

4) How to choose k and evaluate quality

Internal metrics: Silhouette (higher is better), Davies-Bouldin (lower is better), Calinski-Harabasz.
Stability: re-clustering on bootstrap samples, Rand Index/NMI between partitions.
External validity: distinguishability of KPIs (GGR/NET, retention, conversion of offers, FPR) between clusters.
Business interpretation: Clusters should have clear profiles and actions. If not, override features/scale/algorithm.

5) Profiles and explainability

Cluster profile: medians/quantiles feature, top games/providers, devices, payment methods, channels.
The difference with the population: Δ in p-points/ σ, visualization by radar.
Local explainers: SHAP/Permutation importance for boundaries between clusters (through the trained classifier "cluster_id").

We call clusters: "High-rollers crash," "Bonus-hunters slots," "Casual weekend live."

6) Operation (online/offline)

Offline clustering once a day/week → the publication of "passports" of segments.
Online assignment: nearest center (k-means), probability (GMM), "noise" (HDBSCAN) → fallback rules.
Drift: monitor PSI/KC by key features, migration between clusters, "noise" frequencies.
Life cycle: revision every 1-3 months; MAJOR when changing features/standards.

7) Integrations and actions

Personalization: offers/frequency limits, selection of providers and tournament mechanics.
CRM/channels: fluff/email frequencies, time windows, language/tonality.
Marketing: budget by segment, creatives, LTV forecast; "nudge" vs "value" of the strategy.
RG/risk: mild interventions for risk cluster, "manual" review for anomalies.
Antifraud: clusters of atypical payment paths/devices → increased scoring.

8) Privacy and compliance

k-anonymity of reports (minimum N objects per slice).
Zero-PII in fiches/logs/dashboards, tokenization; DSAR deletion by token.
Geo/tenant-isolation: train/store segments in license region.
Fairness check: we check the differences by sensitive measurements (country/payment method/device).
Usage: "aggressive" offers for RG cluster (policies) are prohibited.

9) Success metrics

Operating: share of online attributions <X ms, stability of centers, migration/underapproval.
Business: uplift conversion of offers, ARPPU/LTV by segment, decrease in anti-fraud FPR, RG reaction speed.
Model quality: silhouette ↑, DB ↓, stability ↑, distinguishable KPI between clusters.

10) Pipeline (reference)

Bronze → Silver → Gold → Serve

1. Ingest events/payments/devices → cleaning/joyns.
2. Feature Store: window calculation (7/28/90d), standardization, masks/tokens.
3. Dim-reduction (PCA/UMAP) for visualizations (not for surfing).

4. Clustering (offline), evaluation of metrics, generation of "passports."

5. Online assignment API: nearest center/probabilities/" noise."

6. Monitoring: drift, migrations, frequency of "noise," KPI by segment.

7. Release: semver, shadow/canary, rollback; Segment directory in BI

11) Segment examples (iGaming)

Bonus-hunters slots: high share of freespins/cashback, short sessions, many output failures - soft promo limits, transparent conditions.
Crash-risk takers: Short intense sessions, rapid rate build-up - frequency limits/cooling.
Live-social: long evening sessions on live, high CTR on social campaigns - curation of streams and live events.
Thin-file newcomers: 1-2 deposits, few rounds - welcome tutorials, KYC support.
Anomaly-payments: frequent change of wallets/methods, geo-races - enhanced anti-fraud.

12) Artifact patterns

12. 1 Segment catalog (fragment)

yaml version: 1. 4. 0 segments:
- id: s_high_roller_crash name: "High-rollers crash"
size_share: 0. 07 centroid:
stake_per_min_z: 2. 1 volatility_z: 1. 8 session_len_min: 6. 4 actions: ["limit_bet_growth","vip_care","rg_cooldown_soft"]
- id: s_bonus_hunter_slots name: "Bonus-hunters slots"
size_share: 0. 19 centroid:
bonus_usage_rate: 0. 63 withdraw_decline_rate: 0. 21 actions: ["clear_terms","frequency_cap","onboarding_quest"]

12. 2 The Politics of Surfing

yaml serving:
assigner: "nearest_centroid"  # or gmm_prob p95_latency_ms: 50 min_confidence: 0. 6 unknown_policy: "fallback_rules"
privacy:
pii_in_features: false min_group_size: 50 monitoring:
drift_psi_max: 0. 2 migration_rate_warn: 0. 25

12. 3 Cluster passport (BI)

yaml cluster_id: s_live_social share: 0. 23 kpi:
d30_retention: 0. 42 arppu: 27. 4 behavior:
sessions_evening_share: 0. 68 provider_top: ["Evolution","Pragmatic Live"]
crm:
push_ctr: 0. 11 promo_sensitivity: "medium"
rg_flags: ["cooldown_hint"]

13) Implementation Roadmap

0-30 days (MVP)

1. Assemble display cases (7/28/90d), standardize, cut out PII.

2. Mini-Batch k-means for 5-9 clusters + basic HDBSCAN for "noise."

3. Passport of clusters, online assigner, migration/drift dashboard.
4. Two product experiments: segment offers and fluff frequency.

30-90 days

1. GMM for soft-accessory; mixed types (k-prototypes).
2. Auto-reassembly every N days, shadow → canary; alert on PSI/migrations.
3. Interpretability (SHAP cards), segment BI catalog and CRM/recommendation API.

3-6 months

1. Geo/tenant-specific segments; combining with device/payment graph.
2. Long-term cohorts + transition matrices (Markov) for LTV planning.
3. Segment-level RG/AML policies; external privacy/ethics audit.

14) Anti-patterns

Choosing k "by eye" and evaluating only silhouette without business checks.
Mixing PII and behavioral features; lack of k-anonymity in reports.
There is no online assigner → segments "hang" in BI without action.
Retraining for the season/share; lack of monitoring of migrations.
Using clusters for "aggressive" marketing without RG guard rules.
One set of segments for all countries/brands without local features.

15) RACI

Data Platform (R): showcase feature, pipeline, monitoring, version register.
Data Science (R): algorithm choice, k/metrics, interpretation.
Product/CRM (A): segment activities, experiments.
Risk/RG (C): restriction and HITL policies for "heavy" segments.
Security/DPO (A/R): privacy, tokenization, k-anonymity.
BI (C): dashboards, catalogs, documentation.

16) Related Sections

Segmented Targeting, Recommendation Systems, Player Profiling, Reducing Bias, Performance Benchmarking, Analytics and Metrics API, MLOps: Model Exploitation, Data Ethics and Transparency.

Total

Clustering is not just a UMAP graph, but a production tool: pure features without PII, stable metrics and understandable segment passports, online-assigner and actions in CRM/product/RG. When regularly audited and monitored for drift, it turns "behavior chaos" into manageable strategies for growth, safety and responsibility.

Data clustering

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects