Dimensionality reduction
1) Why the iGaming platform should reduce the dimension
ML speed and stability: fewer signs → faster fit/serve, lower risk of retraining.
Visualization: 2D/3D projections to detect segments, drift and anomalies.
Noise → signal: generalized factors (behavioral/payment) are more resistant to emissions.
Cost: less online features → cheaper to store/transport/speed up.
Privacy: replacing the original sensitive features with aggregated factors.
2) "Selection" vs "Construction" of signs
Feature selection: filters/wrappers/model weights - save a subset of the original features.
Feature extraction-Calculate new factors (projections/embeddings).
Combine: first, basic selection (leakage, constants, mutual information), then - the construction of factors.
3) Methods: short map
3. 1 Linear
PCA/SVD: orthogonal components, maximize explained variance. Fast, interpretable (loadings).
Factor Analysis (FA): latent factors + specific errors; good for behavioral "scales."
NMF: non-negative additive parts ("themes "/" motives "of payments/games); interpreted when ≥0.
3. 2 Non-linear
t-SNE: local structure and clusters in 2D/3D; rendering only (not serving).
UMAP: preserves the local + part of the global structure, faster than t-SNE; suitable for cluster preprocessing.
Autoencoders (AE/VAE): we train the encoder → a latent vector; can be online/incremental.
Isomap/LE: less common in proda (expensive and capricious).
3. 3 Categorical/mixed
Category embeddings (game/provider/channel/device) + PCA/UMAP over the embedding matrix.
Gower distance → MDS/UMAP for mixed types.
4) Pipeline (reference)
1. Data hygiene: PII masks, tokenization, filling in gaps, winsorizing tails.
2. Scaling: Standard/Robust scaler; for counters - log transforms.
3. remove near-zero variance, corr> 0. 95 (leave-one), mutual info.
4. Reduction method: PCA/UMAP/AE; fix random seed and config.
5. Rating: metrics (below), stability, visualizations.
6. Serve: serialize transforms (ONNX/PMML/registry warehouse), time-travel for re-projections.
7. Monitoring: latent factor drift, PSI, kNN-topology preservation.
5) Quality metrics
Explained Variance (PCA): select k with a threshold (for example, 90-95%).
Reconstruction error (AE/NMF): MSE/Poisson, SSIM for images (if CV).
Trustworthiness/Continuity (UMAP/t-SNE): 0 to 1 - how local neighbors are preserved.
kNN-preservation: proportion of common pre/post projection neighbors.
Downstream-impact: quality of clustering/classification after transformation (F1/AUC, silhouette).
Stability: Rand/NMI between restarts, seed/hyperparams sensitivity.
6) Practical recipes for tasks
6. 1 Player clustering
UMAP → HDBSCAN: well reveals segments "live/social," "bonus-hunters," "crash-risk."
PCA-baseline for quick interpretation (loadings show "rates/min," "volatility," "evening pattern").
6. 2 Antifraud and payments
NMF on the matrix (player × payment method) reveals the "motives" of the routes; then k-means/GMM.
AE on deposit/withdrawal behavior - latent vector to anomaly model (IForest/OC-SVM).
6. 3 Recommendation systems
SVD/ALS embeddings (igrok↔igra/provayder) + PCA/UMAP for noise filtering and similarity scoring.
6. 4 Texts/reviews
Sentence embeddings → UMAP: visualization of themes and bursts of negativity (see Sentiment analysis).
NMF on TF-IDF: interpretable complaint "themes" (conclusions, KYC, lags).
7) Online, incrementality and drift
IncrementalPCA/Streaming AE: Update components without complete retraining.
Warm-start UMAP: update on new batches (careful with distortion of globics).
Drift: monitor PSI/KC by factors, drift topology kNN; thresholds → canary/rollback.
Versioning: 'projection @ MAJOR. MINOR. PATCH`; MAJOR - incomparable, keep dual-serve.
8) Privacy and compliance
Zero-PII input; reduced factors are stored separately from the source.
k-anonymity of shop windows (minimum N objects per slice).
Differents. privacy (optional) in PCA/AE: noise in gradients/coordinates.
DSAR: the ability to clear the subject's contribution (delete lines, recalculate factors at the next batch).
9) Interpretation of factors
Loadings (PCA/FA): top-features → human-readable names ("betting intensity," "night activity," "bonus sensitivity").
NMF parts: sets of features with positive weights → "motive of payments/games."
AE: linear approximation around a point (Jacobian) + surrogate-model for local explainability.
10) Integrations
Clustering: UMAP/PCA space → HDBSCAN/k-means.
Anomalies: AE-reconstruction/Latent distance → alerts.
Recommendations: Compact embeddings for similarity and ANN search.
API analytics: we give aggregates and factors instead of "raw" sensitive features.
11) Templates (ready to use)
11. 1 Config PCA
yaml projection:
method: "pca"
n_components: "auto_0. 95" # cumulative variance ≥95%
scaler: "robust"
random_state: 42 serve:
format: "onnx"
p95_latency_ms: 5 monitoring:
drift_psi_max: 0. 2 privacy:
pii_in: false
11. 2 Config UMAP→HDBSCAN
yaml umap:
n_neighbors: 30 min_dist: 0. 05 metric: "cosine"
random_state: 42 cluster:
method: "hdbscan"
min_cluster_size: 120 min_samples: 15 evaluate:
metrics: ["silhouette","trustworthiness","knn_preservation"]
11. 3 AE (servering)
yaml autoencoder:
encoder: [256,128,64]
latent_dim: 16 activation: "gelu"
dropout: 0. 1 optimizer: "adamw"
loss: "mse"
early_stop_patience: 10 serve:
route: "light heavy" # router by latent complexity cache_embeddings: true
11. 4 Projection data sheet (BI)
yaml version: "proj_pca_1. 3. 0"
explained_variance_cum: 0. 932 top_components:
- id: pc1, name: "rate intensity," top_features: ["bets _ per _ min, ""volatility,"" session _ len"]
- id: pc2, name: "night activity," top_features: ["evening _ share, ""dow _ weekend,"" live _ share"]
usage:
downstream: ["clusters_v4","fraud_iforest_v2","reco_ann_v3"]
12) Implementation Roadmap
0-30 days (MVP)
1. Hygiene feature (scaling, skipping, correlations), Zero-PII.
2. PCA with 95% variance threshold; 2D UMAP visualization for segment analysis.
3. Метрики: explained variance, trustworthiness, downstream uplift.
4. Registration of the transformation in the registry; dashboard drift factors.
30-90 days
1. AE for payments/behavior; NMF for review topics.
2. Incremental updates (IncrementalPCA/AE); canary at version change.
3. Integration with clustering/anti-fraud/recommendation; alerts kNN-topology drift.
3-6 months
1. Geo-/tenant-specific projections; budget-aware serving (INT8/FP16).
2. Factor interpretation reports for product teams.
3. DP variants for regulatory sensitive markets.
13) Anti-patterns
Use t-SNE for prod-serving (unstable and incomparable between runs).
Mix PII with factors; Log source features without masks.
Ignore scaling/skipping → "fake" components.
Choose k by eye without a dispersion/metric curve and downstream validation.
Rebuild the projection without versioning and dual-serve → "broken" models up the chain.
Interpret the UMAP picture as "ground truth" without stability testing.
14) RACI
Data Platform (R): pipelines, registry, drift monitoring.
Data Science (R): selection/tuning of methods, interpretation of factors.
Product/CRM (A): use of factors in segmentation/offers.
Risk/RG (C): rules for using factors, protection against "aggressive" targeting.
Security/DPO (A/R): privacy, k-anonymity, DSAR.
15) Related Sections
Data Clustering, Recommender Systems, Anomaly and Correlation Analysis, Feedback Sentient Analysis, NLP and Word Processing, DataOps Practices, MLOps: Model Exploitation, Data Ethics and Transparency.
Total
Dimension reduction is a tool of production ML, not just "beautiful point clouds": strict feature hygiene, structure preservation metrics, stable and versioned transformations. In iGaming, such projections speed up learning and surfing, improve segmentation and anomaly detection, save budget and help maintain privacy.