GH GambleHub

Data segmentation

Data segmentation

Segmentation is the division of many objects (users, transactions, products, events) into homogeneous groups for targeting, personalization, analysis and risk management. Good segmentation increases margins, reduces costs, and makes decisions explicable.

1) Goals and objectives

Marketing and growth: personalized offers, contact frequency, anti-spam policy.
Monetization: price discrimination, bundles, VIP service.
Risk and compliance: control levels, KYC/AML triggers, scoring of suspicious patterns.
Product and experience: onboarding by scripts, content/game recommendations, dynamic limits.
Operations: prioritization of support, distribution of limits and quotas.

We formulate the segmentation unit (user/session/merchant), horizon (7/30/90 days), conversion frequency (online/daily/weekly) and target KPIs.

2) Segment taxonomy

Demographics/geo: country, language, platform.
Behavioral: activity, frequency, depth, time of day, favorite categories.
Value-based: ARPU/ARPPU, LTV quantiles, marginality.
Stage: onboarding, mature, "sleeping," returned.
RFM: Recency, Frequency, Monetary with bins/quantiles.
Cohort: by enrollment date/first payment/source.
Risk segments: chargeback-risk, bonus-abuse-risk, abnormal activity.
Life cycle: propensity-to-churn, propensity-to-buy, next-best-action.
Contextual: device/channel/regional rules.

3) Data and preparation

Point-in-time correctness: signs are counted from the available "past."

Aggregates by window: 7/30/90-day sums/frequencies/quantiles.
Normalization: robast scaling (median/MAD), log transformations for long tails.
Categories: one-hot/target/hash; control of "rare" values.
Quality: omissions, duplicates, drift of circuits, synchronization of time zones.
Semantics: explicit business rules (for example, deposit ≥1) before ML segmentation.

4) Segmentation methods

4. 1. White-box rules and thresholds

Simple conditions: "VIP if LTV ≥ X and frequency ≥ Y."

Pros: understandable, quickly implemented as a policy.
Cons: fragility when drifting, complexity of support when the number of rules grows.

4. 2. Clustering (unsupervised)

k-means/k-medoids: quick baseline on numeric features.
GMM: soft accessories, probabilistic segments.
HDBSCAN/DBSCAN: free-form clusters + "noise" as anomalies.
Spectral/EM on mixed types: for complex geometries.
Feature learning → cluster: first embeddings (autoencoder/transformer), then clustering in latent space.

4. 3. Supervise-segmentation (target-driven)

We train the model on the target KPI (for example, LTV/risk), and build segments according to prediction quantiles, SHAP profiles and decision trees.
Pros: segments are "tied" to a business goal, it is easy to check uplift.
Cons: risk of "fit"; rigorous validation is needed.

4. 4. Frequency motifs and rules

RFM matrices, associative rules (support/lift), frequent sequences (PrefixSpan) - especially for product navigation and bundles.

4. 5. Graph/Network Segments

Communication communities (devices, payment methods, referrals); GNN to enrich traits.

5) Choice of approach: fast matrix

SituationDataRecommendation
Need a managed policyTable + Business RulesRule-based + periodic revision
Search for "natural" groupsMany numerical featuresk-means/GMM, then describe the clusters
Strong nonlinearityMixed/High DimensionEmbeddings → HDBSCAN
Direct target (LTV/risk)There are tags/targetPrediction supervise segmentation
Networks/CommunicationsCountCommunity detection + graph features

6) Segmentation quality assessment

Internal metrics (no reference):
  • Silhouette/Davies-Bouldin/Calinski-Harabasz: compactness and separability.
  • Stability: Jaccard/ARI between restarts/bootstraps.
  • Informativity: intersegment variance of key features.
External/Business Metrics:
  • Homogeneity by KPI: differences in LTV/conversion/risk between segments.
  • Actionability: the proportion of segments for which the response to interventions differs.
  • Uplift/A/B: segment targeting gain vs total targeting.
  • Coverage:% of users in "live" segments (not just "noise").

7) Validation and robustness

Temporal CV: checking the stability of segments over time (rolling windows).
Group validation: do not mix users/devices between train/val.
Replication - Run in neighboring markets/channels.
Drift: PSI/JS-div by features and segment distribution; thresholds on alerts.
Stable sides/initialization: to compare segmentation versions.

8) Interpretability

Segment passports: description of rules/centroids, key features (top-SHAP/permutation), audience portrait, KPI profile.
Visualization: UMAP/t-SNE with segment colors, "lattice" of metrics by segment.
Rules for activation: human tabs ("High-Value Infrequent," "Risky Newcomers").

9) Operational implementation

Fichestor: uniform online/offline feature calculation functions.
Rescoring: SLA and frequency (online at entry, once daily, at event).
API/batch export: user ID → segment/probability/timestamps.
Versioning: 'SEG _ MODEL _ vX', data contract, training set freeze date.
Policies: for each segment - rules of action (offer/limits/support priority).
Fail-safe: default segment upon degradation (no feature/timeouts).

10) Experimentation and decision-making

A/B/n by segment: we test different offers/limits on the same segment grid.
Uplift: targeting effect vs control (Qini/AUUC, uplift @ k).
Budget allocation: we distribute the budget by segments by margin/risk limits.
Guardrails: FPR/FNR for risk segments, contact rate and audience fatigue.

11) Ethics, privacy, compliance

Data minimization: we use the required minimum, pseudonymization.
Fairness: compare errors and "rigidity" of policies by sensitive segments; exclude Protected Attributes from the rules, or apply fairness corrections.
Right to explain: Document segment assignment logic.
Audit: log of versions, input features, decisions and results of campaigns by segments.

12) Artifact patterns

Segment passport

Code/Version: 'SEG _ HVIF _ v3'

Description: "High value, rare activity"

Criteria/Center: 'LTV _ quantile ≥ 0. 9`, `Recency_days ∈ [15,45]`, `Frequency_30d ∈ [1,3]`

Size/reach: 4. 8% of users (last 30 days)

KPI profile: ARPPU ↑ 2. 4 × of median, Churn-risk average

Recommendations: soft re-engage offers, cross-sell premium products, frequency limit 1/7d

Risks: excessive discounts → "addiction"

Owner: CRM/Monetization

Date/validity: 2025-10-15; quarterly revision

Segmentation Contract

Source feature: 'fs. user_activity_v5`

Schedule: night batch 02:00 UTC; online update on the 'purchase' event

Service: 'segmentor. api/v1/score` (p95 ≤ 120 мс)

Logs: 'seg _ scoring _ log' (feature hash, version, speed, segment)

Alerts: "UNKNOWN" share> 2%; PSI by key features> 0. 2; segment imbalance> 10 pp per day

13) Pre-release checklist

  • Segmentation impact goals and KPIs agreed
  • Unit, windows and conversion frequency defined
  • There is a baseline (rule-based) and an ML variant; uplift comparison
  • Segment Documentation + Visualization and Human Tabs
  • Tuned A/B, guardrails and drift alerts
  • Versioning, data contracts, incident runibooks
  • Per-segment and default-fallback action policies

Total

Segmentation is not a "one-time clustering" but a control loop: correct data and windows, transparent segments, linkage to KPIs, rigorous validation, operational SLOs, and drift monitoring. Add complexity (embeddings, graphs, supervise approach) only where it gives a measurable uplift and remains explainable for business and compliance.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.