Model training
1) Purpose and principles
The goal of the training is to obtain a sustainable, reproducible and cost-effective model that improves business metrics (Net Revenue, churn↓, fraud↓) while complying with RG/AML/Legal.
Principles:- Problem→Metric→Data: first task and operational metric/error cost, then dataset.
- Point-in-time: No feature/label uses the future.
- Reproducibility: fixed seeds/versions, artifact control.
- Simplicity first: start with basic models/feature; complicate only with proven benefit.
- Privacy by design: PII-minimization, residency, audit.
2) Formalization of task and metrics
Classification: churn/deposit/fraud/RG → PR-AUC, F1 @ operas. threshold, KS, expected cost.
Regression/forecast: LTV/GGR → WAPE/SMAPE, P50/P90 error, PI coverage.
Ranking/recommendations: NDCG @ K, MAP @ K, coverage/diversity.
Online metrics: uplift Net Revenue, CTR/CVR, time-to-interview (RG), abuse-rate.
python best_thr = argmin_thr(cost_fp FPR(thr) + cost_fn FNR(thr))
3) Datasets and partitions
Point-in-time join and SCD-compatible measurements.
Class imbalance: stratified sampling, class_weight, focal loss, oversampling rare events.
Time/Market/Tenant Partitions: Gap train↔val↔test for leaks.
sql
SELECT FROM ds WHERE event_time < '2025-07-01' -- train
UNION ALL SELECT FROM ds WHERE event_time BETWEEN '2025-07-01' AND '2025-08-15' -- val
UNION ALL SELECT FROM ds WHERE event_time > '2025-08-15' -- test
4) Feature preparation
Windows and units: 10m/1h/1d/7d/30d, R/F/M, speeds/fractions.
Categories: hashing/one-hot; target encoding (time-aware).
Normalization/scaling: parameters from train, save in artifacts.
Graph/NLP/geo: build a batch, publish in the Feature Store (online/offline).
5) Basic algorithms
GBDT: XGBoost/LightGBM/CatBoost is a strong database for tabular data.
Logistic regression/ElasticNet: interpretable/cheap.
Advisory: LambdaMART, factorization, seq2rec.
Anomalies: Isolation Forest, AutoEncoder.
Time series: Prophet/ETS/GBDT-by calendar features.
6) Regularization and prevention of retraining
GBDT: `max_depth`, `num_leaves`, `min_data_in_leaf`, `subsample`, `colsample_bytree`, `lambda_l1/l2`.
NN: dropout/weight decay/early stopping.
Early stop: by metric on val with patience and minimal improvement.
7) Selection of hyperparameters
Grid/Random for draft search; Bayesian/Hyperband for fine tuning.
Limitations: iteration/time/cost budget, "no-overfit" on val (cross-checking on multiple time splits).
python for params in sampler():
model = LGBMClassifier(params, random_state=SEED)
model. fit(X_tr, y_tr, eval_set=[(X_val, y_val)],
eval_metric="aucpr", early_stopping_rounds=200)
log_trial(params, pr_auc=pr_auc(model, X_val, y_val), cost=cost())
8) Probability calibration
Platt/Isotonic на holdout; store the calibration function as an artifact.
Check the ECE/Relayability; renegotiate thresholds by expected cost.
9) Interpretability and explanations
Global: feature importance/SHAP, permutation contribution.
Local: SHAP for unit solutions (RG/AML cases).
Document the risks and acceptability of using explanations online.
10) Reproducibility and artifacts
Seed everywhere: data/model/fit/split.
Artifacts: data version, feature pipeline, weights, calibration, thresholds, configs.
Deterministic builds: fixed containers/dependencies.
11) Tracking experiments
We register: git-commit, dataset/feature versions, model config, metrics (off/online), artifacts and comments.
Rules for naming experiments, tags (domain/market/model).
12) Offline → online transfer
Unified transformation code (Feature Store); online/offline equivalence test.
Serving: REST/gRPC, timeouts/retrays/cache; canary/stage-by-stage withdrawals.
Threshold/policy: configurable (feature flags), audit and roll-back.
13) Monitoring and drift
Data/rate: PSI/KL; alerts when thresholds are exceeded.
Calibration and metrics: ECE, PR-AUC/KS on streaming labels.
Business metrics: uplift Net Revenue, fraud saved, RG interventions, SLA.
Retrain Triggers: By Drift/Seasonality/Releases/Expiration Date.
14) Privacy, residency, fairness
PII minimization: pseudonyms, CLS/RLS, individual mappings.
Residency: individual directories/keys (EEA/UK/BR); banning cross-regional joins without reason.
Fairness: slice analysis (market/device/account age), disparate impact, equalized odds; correction of features/thresholds/weights.
15) Cost-engineering
Training cost: CPU/GPU hours, I/O, number of runs.
Cost of inference: latency/cost per request; limits on online features and model size.
Materialization: heavy features - offline; online - fast, cached.
Chargeback: Experimental/replay budgets.
16) Examples (fragments)
LightGBM (classification, Python sketch):python params = dict(
objective="binary", metric="average_precision",
num_leaves=64, learning_rate=0. 05, feature_fraction=0. 8,
bagging_fraction=0. 8, lambda_l1=1. 0, lambda_l2=2. 0
)
model = lgb. train(params, train_data,
valid_sets=[valid_data],
early_stopping_rounds=200, verbose_eval=100)
save_artifacts(model, scaler, feature_spec, cal_model)
Point-in-time sampling (SQL idea):
sql
SELECT a. user_pseudo_id, a. asof, f. dep_30d, f. bets_7d, lbl. churn_30d
FROM features_at_asof f
JOIN asof_index a USING(user_pseudo_id, asof)
JOIN labels lbl USING(user_pseudo_id, asof);
Expected cost estimate and threshold selection:
python thr_grid = np. linspace(0. 01, 0. 99, 99)
costs = [expected_cost(y_val, y_proba >= t, cost_fp, cost_fn) for t in thr_grid]
t_best = thr_grid[np. argmin(costs)]
17) Processes and RACI
R (Responsible): Data Science (models/experiments), Data Eng (datasets/features/Feature Store), MLOps (serving/monitoring/CI-CD-CT).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/RG/AML/DSAR), Security (KMS/secrets/audit), SRE (SLO/value), Finance (ROI).
I (Informed): Product/Marketing/Operations/Support.
18) Implementation Roadmap
MVP (3-6 weeks):1. Directory of tasks and metrics (expected cost), point-in-time datasets.
2. Basic models (LogReg/GBDT) + calibration + model cards.
3. Tracking experiments, fixed seeds/artifacts, reproducible builds.
4. Canary online surfing, thresholds like config, alert metrics/drift.
Phase 2 (6-12 weeks):- Bayesian/Hyperband selection, slice analysis/fairness, retrain triggers.
- Economics feature/inference, cache/TTL, chargeback.
- Documentation of metric/threshold formulas, what-if simulations.
- Multi-regional pipelines, DR/exercises, WORM-archive of releases.
- Auto-generation of quality/calibration reports, auto-over-training by events.
- A/B/n experiments with sequential testing and automatic shutdown.
19) Pre-sale checklist
- Task and metric business aligned; calculated the cost of errors.
- Datacet point-in-time; Time/market partitioning no leukage.
- Selection/regularization, early stop, probability calibration.
- Model card: data, features, metrics, risks, fairness, owner.
- Artifacts saved (weights, feature pipeline, calibration, thresholds).
- Online/offline equivalence test passed; surfing with a feature flag.
- Monitoring drift/calibration/business metrics; retrain/rollback plans.
- PII/DSAR/RTBF policies, residency, and access auditing are followed.
- The cost of training/inference is included in the budget; SLA alerts.
20) Anti-patterns and risks
Lakage: features/labels from the future, uncoordinated SCD.
Tuning "to blue" on one shaft-sample: no temporary splits/cross-checking.
No calibration and cost thresholds.
Mismatch online/offline feature: different results on the prod.
Ignores fairness/slices: hidden failures in markets/devices.
Unlimited replays and expensive features: rising value without benefit.
21) The bottom line
Model training is a manageable process: clear task and metric, point-in-time discipline, intelligent tuning with regularization, calibration and reproducibility, transparent transfer to online, and continuous monitoring of quality, cost, and risk. By following this playbook, you get models that predictably improve product, retention and compliance - quickly, ethically and reliably.