Data segmentation

Segmentation is the division of many objects (users, transactions, products, events) into homogeneous groups for targeting, personalization, analysis and risk management. Good segmentation increases margins, reduces costs, and makes decisions explicable.

1) Goals and objectives

Marketing and growth: personalized offers, contact frequency, anti-spam policy.
Monetization: price discrimination, bundles, VIP service.
Risk and compliance: control levels, KYC/AML triggers, scoring of suspicious patterns.
Product and experience: onboarding by scripts, content/game recommendations, dynamic limits.
Operations: prioritization of support, distribution of limits and quotas.

We formulate the segmentation unit (user/session/merchant), horizon (7/30/90 days), conversion frequency (online/daily/weekly) and target KPIs.

2) Segment taxonomy

Demographics/geo: country, language, platform.
Behavioral: activity, frequency, depth, time of day, favorite categories.
Value-based: ARPU/ARPPU, LTV quantiles, marginality.
Stage: onboarding, mature, "sleeping," returned.
RFM: Recency, Frequency, Monetary with bins/quantiles.
Cohort: by enrollment date/first payment/source.
Risk segments: chargeback-risk, bonus-abuse-risk, abnormal activity.
Life cycle: propensity-to-churn, propensity-to-buy, next-best-action.
Contextual: device/channel/regional rules.

3) Data and preparation

Point-in-time correctness: signs are counted from the available "past."

Aggregates by window: 7/30/90-day sums/frequencies/quantiles.
Normalization: robast scaling (median/MAD), log transformations for long tails.
Categories: one-hot/target/hash; control of "rare" values.
Quality: omissions, duplicates, drift of circuits, synchronization of time zones.
Semantics: explicit business rules (for example, deposit ≥1) before ML segmentation.

4) Segmentation methods

4. 1. White-box rules and thresholds

Simple conditions: "VIP if LTV ≥ X and frequency ≥ Y."

Pros: understandable, quickly implemented as a policy.
Cons: fragility when drifting, complexity of support when the number of rules grows.

4. 2. Clustering (unsupervised)

k-means/k-medoids: quick baseline on numeric features.
GMM: soft accessories, probabilistic segments.
HDBSCAN/DBSCAN: free-form clusters + "noise" as anomalies.
Spectral/EM on mixed types: for complex geometries.
Feature learning → cluster: first embeddings (autoencoder/transformer), then clustering in latent space.

4. 3. Supervise-segmentation (target-driven)

We train the model on the target KPI (for example, LTV/risk), and build segments according to prediction quantiles, SHAP profiles and decision trees.
Pros: segments are "tied" to a business goal, it is easy to check uplift.
Cons: risk of "fit"; rigorous validation is needed.

4. 4. Frequency motifs and rules

RFM matrices, associative rules (support/lift), frequent sequences (PrefixSpan) - especially for product navigation and bundles.

4. 5. Graph/Network Segments

Communication communities (devices, payment methods, referrals); GNN to enrich traits.

5) Choice of approach: fast matrix

Situation	Data	Recommendation
Need a managed policy	Table + Business Rules	Rule-based + periodic revision
Search for "natural" groups	Many numerical features	k-means/GMM, then describe the clusters
Strong nonlinearity	Mixed/High Dimension	Embeddings → HDBSCAN
Direct target (LTV/risk)	There are tags/target	Prediction supervise segmentation
Networks/Communications	Count	Community detection + graph features

6) Segmentation quality assessment

Internal metrics (no reference):

Silhouette/Davies-Bouldin/Calinski-Harabasz: compactness and separability.
Stability: Jaccard/ARI between restarts/bootstraps.
Informativity: intersegment variance of key features.

External/Business Metrics:

Homogeneity by KPI: differences in LTV/conversion/risk between segments.
Actionability: the proportion of segments for which the response to interventions differs.
Uplift/A/B: segment targeting gain vs total targeting.
Coverage:% of users in "live" segments (not just "noise").

7) Validation and robustness

Temporal CV: checking the stability of segments over time (rolling windows).
Group validation: do not mix users/devices between train/val.
Replication - Run in neighboring markets/channels.
Drift: PSI/JS-div by features and segment distribution; thresholds on alerts.
Stable sides/initialization: to compare segmentation versions.

8) Interpretability

Segment passports: description of rules/centroids, key features (top-SHAP/permutation), audience portrait, KPI profile.
Visualization: UMAP/t-SNE with segment colors, "lattice" of metrics by segment.
Rules for activation: human tabs ("High-Value Infrequent," "Risky Newcomers").

9) Operational implementation

Fichestor: uniform online/offline feature calculation functions.
Rescoring: SLA and frequency (online at entry, once daily, at event).
API/batch export: user ID → segment/probability/timestamps.
Versioning: 'SEG _ MODEL _ vX', data contract, training set freeze date.
Policies: for each segment - rules of action (offer/limits/support priority).
Fail-safe: default segment upon degradation (no feature/timeouts).

10) Experimentation and decision-making

A/B/n by segment: we test different offers/limits on the same segment grid.
Uplift: targeting effect vs control (Qini/AUUC, uplift @ k).
Budget allocation: we distribute the budget by segments by margin/risk limits.
Guardrails: FPR/FNR for risk segments, contact rate and audience fatigue.

11) Ethics, privacy, compliance

Data minimization: we use the required minimum, pseudonymization.
Fairness: compare errors and "rigidity" of policies by sensitive segments; exclude Protected Attributes from the rules, or apply fairness corrections.
Right to explain: Document segment assignment logic.
Audit: log of versions, input features, decisions and results of campaigns by segments.

12) Artifact patterns

Segment passport

Code/Version: 'SEG _ HVIF _ v3'

Description: "High value, rare activity"

Criteria/Center: 'LTV _ quantile ≥ 0. 9`, `Recency_days ∈ [15,45]`, `Frequency_30d ∈ [1,3]`

Size/reach: 4. 8% of users (last 30 days)

KPI profile: ARPPU ↑ 2. 4 × of median, Churn-risk average

Recommendations: soft re-engage offers, cross-sell premium products, frequency limit 1/7d

Risks: excessive discounts → "addiction"

Owner: CRM/Monetization

Date/validity: 2025-10-15; quarterly revision

Segmentation Contract

Source feature: 'fs. user_activity_v5`

Schedule: night batch 02:00 UTC; online update on the 'purchase' event

Service: 'segmentor. api/v1/score` (p95 ≤ 120 мс)

Logs: 'seg _ scoring _ log' (feature hash, version, speed, segment)

Alerts: "UNKNOWN" share> 2%; PSI by key features> 0. 2; segment imbalance> 10 pp per day

13) Pre-release checklist

Segmentation impact goals and KPIs agreed
Unit, windows and conversion frequency defined
There is a baseline (rule-based) and an ML variant; uplift comparison
Segment Documentation + Visualization and Human Tabs
Tuned A/B, guardrails and drift alerts
Versioning, data contracts, incident runibooks
Per-segment and default-fallback action policies

Total

Segmentation is not a "one-time clustering" but a control loop: correct data and windows, transparent segments, linkage to KPIs, rigorous validation, operational SLOs, and drift monitoring. Add complexity (embeddings, graphs, supervise approach) only where it gives a measurable uplift and remains explainable for business and compliance.

Data segmentation