Data segmentation
Data segmentation
Segmentation is the division of many objects (users, transactions, products, events) into homogeneous groups for targeting, personalization, analysis and risk management. Good segmentation increases margins, reduces costs, and makes decisions explicable.
1) Goals and objectives
Marketing and growth: personalized offers, contact frequency, anti-spam policy.
Monetization: price discrimination, bundles, VIP service.
Risk and compliance: control levels, KYC/AML triggers, scoring of suspicious patterns.
Product and experience: onboarding by scripts, content/game recommendations, dynamic limits.
Operations: prioritization of support, distribution of limits and quotas.
We formulate the segmentation unit (user/session/merchant), horizon (7/30/90 days), conversion frequency (online/daily/weekly) and target KPIs.
2) Segment taxonomy
Demographics/geo: country, language, platform.
Behavioral: activity, frequency, depth, time of day, favorite categories.
Value-based: ARPU/ARPPU, LTV quantiles, marginality.
Stage: onboarding, mature, "sleeping," returned.
RFM: Recency, Frequency, Monetary with bins/quantiles.
Cohort: by enrollment date/first payment/source.
Risk segments: chargeback-risk, bonus-abuse-risk, abnormal activity.
Life cycle: propensity-to-churn, propensity-to-buy, next-best-action.
Contextual: device/channel/regional rules.
3) Data and preparation
Point-in-time correctness: signs are counted from the available "past."
Aggregates by window: 7/30/90-day sums/frequencies/quantiles.
Normalization: robast scaling (median/MAD), log transformations for long tails.
Categories: one-hot/target/hash; control of "rare" values.
Quality: omissions, duplicates, drift of circuits, synchronization of time zones.
Semantics: explicit business rules (for example, deposit ≥1) before ML segmentation.
4) Segmentation methods
4. 1. White-box rules and thresholds
Simple conditions: "VIP if LTV ≥ X and frequency ≥ Y."
Pros: understandable, quickly implemented as a policy.
Cons: fragility when drifting, complexity of support when the number of rules grows.
4. 2. Clustering (unsupervised)
k-means/k-medoids: quick baseline on numeric features.
GMM: soft accessories, probabilistic segments.
HDBSCAN/DBSCAN: free-form clusters + "noise" as anomalies.
Spectral/EM on mixed types: for complex geometries.
Feature learning → cluster: first embeddings (autoencoder/transformer), then clustering in latent space.
4. 3. Supervise-segmentation (target-driven)
We train the model on the target KPI (for example, LTV/risk), and build segments according to prediction quantiles, SHAP profiles and decision trees.
Pros: segments are "tied" to a business goal, it is easy to check uplift.
Cons: risk of "fit"; rigorous validation is needed.
4. 4. Frequency motifs and rules
RFM matrices, associative rules (support/lift), frequent sequences (PrefixSpan) - especially for product navigation and bundles.
4. 5. Graph/Network Segments
Communication communities (devices, payment methods, referrals); GNN to enrich traits.
5) Choice of approach: fast matrix
6) Segmentation quality assessment
Internal metrics (no reference):- Silhouette/Davies-Bouldin/Calinski-Harabasz: compactness and separability.
- Stability: Jaccard/ARI between restarts/bootstraps.
- Informativity: intersegment variance of key features.
- Homogeneity by KPI: differences in LTV/conversion/risk between segments.
- Actionability: the proportion of segments for which the response to interventions differs.
- Uplift/A/B: segment targeting gain vs total targeting.
- Coverage:% of users in "live" segments (not just "noise").
7) Validation and robustness
Temporal CV: checking the stability of segments over time (rolling windows).
Group validation: do not mix users/devices between train/val.
Replication - Run in neighboring markets/channels.
Drift: PSI/JS-div by features and segment distribution; thresholds on alerts.
Stable sides/initialization: to compare segmentation versions.
8) Interpretability
Segment passports: description of rules/centroids, key features (top-SHAP/permutation), audience portrait, KPI profile.
Visualization: UMAP/t-SNE with segment colors, "lattice" of metrics by segment.
Rules for activation: human tabs ("High-Value Infrequent," "Risky Newcomers").
9) Operational implementation
Fichestor: uniform online/offline feature calculation functions.
Rescoring: SLA and frequency (online at entry, once daily, at event).
API/batch export: user ID → segment/probability/timestamps.
Versioning: 'SEG _ MODEL _ vX', data contract, training set freeze date.
Policies: for each segment - rules of action (offer/limits/support priority).
Fail-safe: default segment upon degradation (no feature/timeouts).
10) Experimentation and decision-making
A/B/n by segment: we test different offers/limits on the same segment grid.
Uplift: targeting effect vs control (Qini/AUUC, uplift @ k).
Budget allocation: we distribute the budget by segments by margin/risk limits.
Guardrails: FPR/FNR for risk segments, contact rate and audience fatigue.
11) Ethics, privacy, compliance
Data minimization: we use the required minimum, pseudonymization.
Fairness: compare errors and "rigidity" of policies by sensitive segments; exclude Protected Attributes from the rules, or apply fairness corrections.
Right to explain: Document segment assignment logic.
Audit: log of versions, input features, decisions and results of campaigns by segments.
12) Artifact patterns
Segment passport
Code/Version: 'SEG _ HVIF _ v3'
Description: "High value, rare activity"
Criteria/Center: 'LTV _ quantile ≥ 0. 9`, `Recency_days ∈ [15,45]`, `Frequency_30d ∈ [1,3]`
Size/reach: 4. 8% of users (last 30 days)
KPI profile: ARPPU ↑ 2. 4 × of median, Churn-risk average
Recommendations: soft re-engage offers, cross-sell premium products, frequency limit 1/7d
Risks: excessive discounts → "addiction"
Owner: CRM/Monetization
Date/validity: 2025-10-15; quarterly revision
Segmentation Contract
Source feature: 'fs. user_activity_v5`
Schedule: night batch 02:00 UTC; online update on the 'purchase' event
Service: 'segmentor. api/v1/score` (p95 ≤ 120 мс)
Logs: 'seg _ scoring _ log' (feature hash, version, speed, segment)
Alerts: "UNKNOWN" share> 2%; PSI by key features> 0. 2; segment imbalance> 10 pp per day
13) Pre-release checklist
- Segmentation impact goals and KPIs agreed
- Unit, windows and conversion frequency defined
- There is a baseline (rule-based) and an ML variant; uplift comparison
- Segment Documentation + Visualization and Human Tabs
- Tuned A/B, guardrails and drift alerts
- Versioning, data contracts, incident runibooks
- Per-segment and default-fallback action policies
Total
Segmentation is not a "one-time clustering" but a control loop: correct data and windows, transparent segments, linkage to KPIs, rigorous validation, operational SLOs, and drift monitoring. Add complexity (embeddings, graphs, supervise approach) only where it gives a measurable uplift and remains explainable for business and compliance.