Feature Engineering and Feature Selection
1) Purpose and principles
Objective: to construct stable, interpretable and economical features agreed between offline and online.
Principles:- Point-in-time: features are calculated from the data available at the time of solution, without future (anti-leakage).
- Domain-first: features reflect business mechanics (deposits, sessions, game genres, RG/AML).
- Reuse & Contracts: Feature Store versions, owners, formulas and SLOs.
- Cost-aware: we consider latency and the cost of computing/storing → materialize only payback.
- Observability: monitor drift/stability/calibration; online/offline equivalence test.
2) Characteristic taxonomy for iGaming
RFM/behavioral: recency/frequency/monetary by windows (10m/1h/1d/7d/30d).
Session: durations, pauses, device changes/ASN, speed of action.
Financial: deposits/withdrawals/chargebacks, shares of payment methods, FX normalization.
Gaming: genre profiles, provider volatility, RTP clusters, win-streak.
Marketing: channels/UTM, campaign responses, saturation/cooldown.
RG/AML: limits, self-exclusion flags, velocity patterns, BIN/IP reuse.
Geo/time: local calendars/holidays, belt hour, evening/night.
Graph: user-card-device-ip links, centrality/components, fraud rings.
NLP/texts: themes and tone of tickets/chats; key complaints.
Operational: lag/provider errors, session stability (for SRE models).
3) Windows and aggregates (point-in-time)
Typical windows: 10m/1h/24h/7d/30d. For each window - count/sum/mean/std/last/max/min, ratio and rate.
SQL template (30d deposits, no future):sql
SELECT u.user_pseudo_id, t.asof,
SUM(CASE WHEN e.type='deposit'
AND e.event_time>=t.asof - INTERVAL '30' DAY
AND e.event_time< t.asof THEN e.amount_base ELSE 0 END) AS dep_30d,
COUNT(CASE WHEN e.type='bet'
AND e.event_time>=t.asof - INTERVAL '7' DAY
AND e.event_time< t.asof THEN 1 END) AS bets_7d
FROM silver.fact_events e
JOIN (SELECT user_pseudo_id, DATE(event_time) AS asof
FROM silver.fact_events GROUP BY 1,2) t USING(user_pseudo_id)
JOIN dim.users_scd u ON u.user_pseudo_id=t.user_pseudo_id
AND t.asof >= u.valid_from AND (u.valid_to IS NULL OR t.asof < u.valid_to)
GROUP BY 1,2;
4) Categorical encodings
One-Hot/Hashing: for rare/high-cardinal categories (games, providers).
Target Encoding (TE): Target averages with k-fold/leave-one-out and time-aware anti-leakage.
WOE/IV (risk scoring): monotonic bins with IV control and stability.
python for fold in time_folds:
train_idx, val_idx = split_by_time(fold)
te_map = target_mean(train[["provider_id","label"]])
val["provider_te"] = val["provider_id"].map(te_map).fillna(global_mean)
5) Normalization and scaling
Min-max/Robust/Z-score - by training window; save parameters in artifacts.
Log conversions for long sum/bet tails.
Box-Cox/Yeo-Johnson - when symmetrization is required.
6) Temporary and seasonal features
Calendar: day of the week, hour, market holiday (ref. calendar), pay-day.
Frequency: moving averages/expon. smoothing (EMA), deltas (t − t-1).
Event-based: time since the last deposit/win/loss, "cooling."
7) Graph features (fraud/AML)
Vertices: user/card/device/ip. Edges: transactions/sessions/joint characteristics.
Feature: component size, degree, betweeness, pagerank, triads, reappearance.
Pattern: nightly batch builds a graph → embedding/centrality → online cache.
8) NLP features (support/chats/reviews)
Basic: TF-IDF/NMF topics, sentiment, length, frequency of complaints.
Advanced: embeddings (Sentence-BERT) → averaging on tickets per window.
PII: pre- and post-masking (email, PAN, phones) by policy.
9) Geo/ASN and devices
IP→Geo/ASN: we cache and update; do not make synchronous requests online without a timeout/cache.
Features: ASN/DeviceID stability, shift frequency, distance between logins.
10) Anti-Leakage and online/offline reconciliation
Point-in-time join, no future events in windows/labels.
One transformation code (library) for offline and online.
Equivalence test: on sample T, we compare the online values of the feature with offline (MAE/MAPE).
yaml name: deposits_sum_10m owner: ml-risk slo: {latency_ms_p95: 20, availability: 0.999}
offline:
source: silver.payments transform: "SUM(amount_base) OVER 10m BY user_pseudo_id"
online:
compute: "streaming_window: 10m"
tests:
- compare_online_offline_max_abs_diff: 0.5
11) Feature selection
11. 1 Filter
11. 2 Wrapper
RFE/Sequential FS: on small sets/logistic regression.
Stability Selection: stability in bootstrap sampling.
11. 3 Embedded
L1/Lasso/ElasticNet: rarefaction.
Trees/GBDT: importance/SHAP for selection and business interpretation.
Group Lasso: group selection (sets of bin-features of one variable).
python
X = preprocess(raw) # one-hot/TE/scale
X = drop_const_and_corr(X, thr=0.95)
rank_mi = mutual_info_rank(X, y)
keep1 = topk(rank_mi, k=200)
model = LGBMClassifier(...)
model.fit(X[keep1], y)
shap_vals = shap.TreeExplainer(model).shap_values(X[keep1])
keep2 = stable_topk_by_shap(shap_vals, k=60, bootstrap=20)
final = keep2
12) Stability, drift and calibration
Drift: PSI/KS for features and speed; alerts when thresholds are exceeded.
Stability: watch for "fragile" TE/WOE (cardinality/shifts).
Calibration: Platt/Isotonic; reliability reports.
Slice analysis: markets/providers/devices - metrics and expected cost of errors.
13) Cost engineering and performance
Cost per Feature (CPF): CPU/IO/network/storage → model budget.
Materialization: heavy offline, light online; TTL/cache for hot features.
Remote lookups: async + cache only; p95 <20-30 ms on feature online.
Chargeback: accounting for the cost of feature/inference by command.
14) Feature Store (consistency kernel)
Registry: name, formula, owner, SLO, tests, versions.
Online/Offline synchronization: one transformation code, equality test.
Logs/audits: who changed the formula; effect of version on model metrics.
15) Examples
ClickHouse: minute betting aggregates:sql
CREATE MATERIALIZED VIEW mv_bets_1m
ENGINE = SummingMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (toStartOfMinute(event_time), user_pseudo_id)
AS
SELECT toStartOfMinute(event_time) AS ts_min,
user_pseudo_id,
sum(stake_base) AS stake_sum_1m,
count() AS bets_1m
FROM stream.game_events
GROUP BY ts_min, user_pseudo_id;
Anti-correlation drop (SQL idea):
sql
-- вычислить корреляции и удалить пары с ρ >0.95, сохранив более «дешевую» фичу
WOE binning (sketch):
python bins = monotonic_binning(x, y, max_bins=10)
woe = compute_woe(bins)
iv = compute_iv(bins)
16) Processes and RACI
R (Responsible): Data Eng (pipelines/Feature Store), Data Science (design feature/selection/metrics).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII, residency), Risk/AML/RG (policy), SRE (SLO/cost), Security.
I (Informed): Product/Marketing/Operations/Support.
17) Roadmap
MVP (3-5 weeks):1. Catalog of top 50 features (Payments/Gameplay) with point-in-time formulas.
2. Feature Store v1 (online/offline) + equivalence test.
3. Basic selection: constants/correlations → MI → L1/SHAP shortlist (up to 60 features).
4. Monitoring drift features and cost-dashboards.
Phase 2 (5-10 weeks):- TE/WOE with time-aware validation, graph and calendar features.
- Slice analysis and fairness, probability calibration.
- Materialization of heavy offline features, online cache, quotas.
- Auto-generation of documentation, stability-selection in CI.
- Auto-deactivation of "expensive and useless" features (CPF↑, vklad↓).
- A/B comparison of feature sets, expected-cost reports.
18) Pre-sale checklist
- All features have specifications (owner, formula, versions, SLO).
- Passed point-in-time and online/offline equivalence tests.
- Filter → embedded (SHAP/L1) → stability completed.
- Drift monitoring and reliability configured; thresholds and alerts are.
- CPF/latency fit into the budget; heavy features materialized.
- PII policies met (CLS/RLS, tokenization, residency).
- Documentation and use cases have been added to the catalog.
19) Anti-patterns and risks
Lakage (future events/aftermath promo).
Inconsistent online/offline formulas.
Oversupplied one-hot from high-cardinal categories without hashing/TE.
"Expensive" features without a measurable increase in quality.
Lack of slice/fairness analysis - hidden degradation.
TE/WOE without time-aware cross-validation → retraining.
20) The bottom line
Feature Engineering is a managed discipline: point-in-time, business sense, reproducibility, monitoring and economics. Strong features + strict selection (filter/wrapper/embedded) and a single Feature Store give stable, interpretable and cheap models that improve Net Revenue, reduce fraud and support RG - transparently and compliantly.