Feature Engineering and Feature Selection

1) Purpose and principles

Objective: to construct stable, interpretable and economical features agreed between offline and online.

Principles:

Point-in-time: features are calculated from the data available at the time of solution, without future (anti-leakage).
Domain-first: features reflect business mechanics (deposits, sessions, game genres, RG/AML).
Reuse & Contracts: Feature Store versions, owners, formulas and SLOs.
Cost-aware: we consider latency and the cost of computing/storing → materialize only payback.
Observability: monitor drift/stability/calibration; online/offline equivalence test.

2) Characteristic taxonomy for iGaming

RFM/behavioral: recency/frequency/monetary by windows (10m/1h/1d/7d/30d).
Session: durations, pauses, device changes/ASN, speed of action.
Financial: deposits/withdrawals/chargebacks, shares of payment methods, FX normalization.
Gaming: genre profiles, provider volatility, RTP clusters, win-streak.
Marketing: channels/UTM, campaign responses, saturation/cooldown.
RG/AML: limits, self-exclusion flags, velocity patterns, BIN/IP reuse.
Geo/time: local calendars/holidays, belt hour, evening/night.
Graph: user-card-device-ip links, centrality/components, fraud rings.
NLP/texts: themes and tone of tickets/chats; key complaints.
Operational: lag/provider errors, session stability (for SRE models).

3) Windows and aggregates (point-in-time)

Typical windows: 10m/1h/24h/7d/30d. For each window - count/sum/mean/std/last/max/min, ratio and rate.

SQL template (30d deposits, no future):

sql
SELECT u. user_pseudo_id, t. asof,
SUM(CASE WHEN e. type='deposit'
AND e. event_time>=t. asof - INTERVAL '30' DAY
AND e. event_time< t. asof THEN e. amount_base ELSE 0 END) AS dep_30d,
COUNT(CASE WHEN e. type='bet'
AND e. event_time>=t. asof - INTERVAL '7' DAY
AND e. event_time< t. asof THEN 1 END) AS bets_7d
FROM silver. fact_events e
JOIN (SELECT user_pseudo_id, DATE(event_time) AS asof
FROM silver. fact_events GROUP BY 1,2) t USING(user_pseudo_id)
JOIN dim. users_scd u ON u. user_pseudo_id=t. user_pseudo_id
AND t. asof >= u. valid_from AND (u. valid_to IS NULL OR t. asof < u. valid_to)
GROUP BY 1,2;

4) Categorical encodings

One-Hot/Hashing: for rare/high-cardinal categories (games, providers).
Target Encoding (TE): Target averages with k-fold/leave-one-out and time-aware anti-leakage.
WOE/IV (risk scoring): monotonic bins with IV control and stability.

TE (pseudocode, time-aware):

python for fold in time_folds:
train_idx, val_idx = split_by_time(fold)
te_map = target_mean(train[["provider_id","label"]])
val["provider_te"] = val["provider_id"].map(te_map). fillna(global_mean)

5) Normalization and scaling

Min-max/Robust/Z-score - by training window; save parameters in artifacts.
Log conversions for long sum/bet tails.
Box-Cox/Yeo-Johnson - when symmetrization is required.

6) Temporary and seasonal features

Calendar: day of the week, hour, market holiday (ref. calendar), pay-day.
Frequency: moving averages/expon. smoothing (EMA), deltas (t − t-1).

Event-based: time since the last deposit/win/loss, "cooling."

7) Graph features (fraud/AML)

Vertices: user/card/device/ip. Edges: transactions/sessions/joint characteristics.
Feature: component size, degree, betweeness, pagerank, triads, reappearance.
Pattern: nightly batch builds a graph → embedding/centrality → online cache.

8) NLP features (support/chats/reviews)

Basic: TF-IDF/NMF topics, sentiment, length, frequency of complaints.
Advanced: embeddings (Sentence-BERT) → averaging on tickets per window.
PII: pre- and post-masking (email, PAN, phones) by policy.

9) Geo/ASN and devices

IP→Geo/ASN: we cache and update; do not make synchronous requests online without a timeout/cache.
Features: ASN/DeviceID stability, shift frequency, distance between logins.

10) Anti-Leakage and online/offline reconciliation

Point-in-time join, no future events in windows/labels.
One transformation code (library) for offline and online.
Equivalence test: on sample T, we compare the online values of the feature with offline (MAE/MAPE).

YAML specs feature:

yaml name: deposits_sum_10m owner: ml-risk slo: {latency_ms_p95: 20, availability: 0. 999}
offline:
source: silver. payments transform: "SUM(amount_base) OVER 10m BY user_pseudo_id"
online:
compute: "streaming_window: 10m"
tests:
- compare_online_offline_max_abs_diff: 0. 5

11) Feature selection

11. 1 Filter

Variation/correlation: remove constants,	ρ	>0. 95 duplicates.
Mutual Information (MI) -Range nonlinear relationships.
IV/KS (risk): for binary targets in AML/RG.

11. 2 Wrapper

RFE/Sequential FS: on small sets/logistic regression.
Stability Selection: stability in bootstrap sampling.

11. 3 Embedded

L1/Lasso/ElasticNet: rarefaction.
Trees/GBDT: importance/SHAP for selection and business interpretation.
Group Lasso: group selection (sets of bin-features of one variable).

Pipeline (sketch):

python
X = preprocess(raw)        # one-hot/TE/scale
X = drop_const_and_corr(X, thr=0. 95)
rank_mi = mutual_info_rank(X, y)
keep1 = topk(rank_mi, k=200)
model = LGBMClassifier(...)
model. fit(X[keep1], y)
shap_vals = shap. TreeExplainer(model). shap_values(X[keep1])
keep2 = stable_topk_by_shap(shap_vals, k=60, bootstrap=20)
final = keep2

12) Stability, drift and calibration

Drift: PSI/KS for features and speed; alerts when thresholds are exceeded.
Stability: watch for "fragile" TE/WOE (cardinality/shifts).
Calibration: Platt/Isotonic; reliability reports.
Slice analysis: markets/providers/devices - metrics and expected cost of errors.

13) Cost engineering and performance

Cost per Feature (CPF): CPU/IO/network/storage → model budget.
Materialization: heavy offline, light online; TTL/cache for hot features.
Remote lookups: async + cache only; p95 <20-30 ms on feature online.
Chargeback: accounting for the cost of feature/inference by command.

14) Feature Store (consistency kernel)

Registry: name, formula, owner, SLO, tests, versions.
Online/Offline synchronization: one transformation code, equality test.
Logs/audits: who changed the formula; effect of version on model metrics.

15) Examples

ClickHouse: minute betting aggregates:

sql
CREATE MATERIALIZED VIEW mv_bets_1m
ENGINE = SummingMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (toStartOfMinute(event_time), user_pseudo_id)
AS
SELECT toStartOfMinute(event_time) AS ts_min,
user_pseudo_id,
sum(stake_base) AS stake_sum_1m,
count() AS bets_1m
FROM stream. game_events
GROUP BY ts_min, user_pseudo_id;

Anti-correlation drop (SQL idea):

sql
-- calculate correlations and remove pairs with    ρ    >0. 95, keeping the "cheaper" feature

WOE binning (sketch):

python bins = monotonic_binning(x, y, max_bins=10)
woe = compute_woe(bins)
iv = compute_iv(bins)

16) Processes and RACI

R (Responsible): Data Eng (pipelines/Feature Store), Data Science (design feature/selection/metrics).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII, residency), Risk/AML/RG (policy), SRE (SLO/cost), Security.
I (Informed): Product/Marketing/Operations/Support.

17) Roadmap

MVP (3-5 weeks):

1. Catalog of top 50 features (Payments/Gameplay) with point-in-time formulas.

2. Feature Store v1 (online/offline) + equivalence test.

3. Basic selection: constants/correlations → MI → L1/SHAP shortlist (up to 60 features).

4. Monitoring drift features and cost-dashboards.

Phase 2 (5-10 weeks):

TE/WOE with time-aware validation, graph and calendar features.
Slice analysis and fairness, probability calibration.
Materialization of heavy offline features, online cache, quotas.

Phase 3 (10-16 weeks):

Auto-generation of documentation, stability-selection in CI.
Auto-deactivation of "expensive and useless" features (CPF↑, vklad↓).
A/B comparison of feature sets, expected-cost reports.

18) Pre-sale checklist

All features have specifications (owner, formula, versions, SLO).
Passed point-in-time and online/offline equivalence tests.
Filter → embedded (SHAP/L1) → stability completed.
Drift monitoring and reliability configured; thresholds and alerts are.
CPF/latency fit into the budget; heavy features materialized.
PII policies met (CLS/RLS, tokenization, residency).
Documentation and use cases have been added to the catalog.

19) Anti-patterns and risks

Lakage (future events/aftermath promo).
Inconsistent online/offline formulas.
Oversupplied one-hot from high-cardinal categories without hashing/TE.
"Expensive" features without a measurable increase in quality.
Lack of slice/fairness analysis - hidden degradation.
TE/WOE without time-aware cross-validation → retraining.

20) The bottom line

Feature Engineering is a managed discipline: point-in-time, business sense, reproducibility, monitoring and economics. Strong features + strict selection (filter/wrapper/embedded) and a single Feature Store give stable, interpretable and cheap models that improve Net Revenue, reduce fraud and support RG - transparently and compliantly.

Feature Engineering and Feature Selection

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects