Data Markup and Model Quality
1) Purpose and principles
The goal: to obtain reproducible labels and measurable quality of models without leukage and taking into account compliance.
Principles:- Schema-first: formalized ontologies, class dictionaries, and criteria.
- Point-in-time: Labels are built from information available at the time of the decision.
- Quality-as-code: instructions, tests, checklists and selections - in the repository.
- Privacy-by-design: PII minimization, DSAR/RTBF, residency.
- Cost-aware: consider the cost of markup and expected cost.
2) Ontology and label scheme
Define the markup object, classes, exceptions, and sources of truth: Example (AML/Antifraud):- Object: Transaction/Session.
- Классы: `legit`, `fraud_suspected`, `fraud_confirmed`, `unknown`.
- Exceptions: chargeback without evidence → 'unknown'.
- Sources: case management, chargeback registries, providers/bank.
yaml task: aml_classification object: "payment_transaction"
labels:
- legit
- fraud_suspected
- fraud_confirmed
- unknown guidelines_version: "1. 3. 0"
positive_class: "fraud_confirmed"
exclusions:
- "dispute opened but no evidence -> unknown"
sources_of_truth:
- "case_system. resolution"
- "issuer. chargeback_code"
3) Guidelines
Structure:1. Description of the task and business context.
2. Class definitions with positive/negative examples and borderline cases.
3. Source priority rules (true> heuristic> opinion).
4. 'Unknown'and escalation criteria.
5. Privacy policies (masking, tokens instead of ID).
6. FAQ and markup checklist.
Fragment of instructions (fraud):- 'fraud _ confirmed ': proven chargeback/closed case with FRAUD tag.
- 'fraud _ suspended ': deposit ≥3
- 'legit ': There are no flags and no confirmed cases in the 60 days window.
- 'unknown ': Conflicting characteristics or insufficient data.
4) Label sources and point-in-time rules
Auto labels: rules/cases, chargeback, self-exclusion (RG), outcome bets.
Ground: result of investigation/regulatory outcomes.
Point-in-time-Do not use events after the decision point (t0).
Delays: for example, chargeback appears after 45-90 days → the label "matures."
SQL "no future" template:sql
SELECT e. id, e. event_time AS asof,
CASE WHEN EXISTS (
SELECT 1 FROM cases c
WHERE c. tx_id = e. id
AND c. decision_time <= e. event_time + INTERVAL '90' DAY
AND c. result = 'FRAUD_CONFIRMED'
) THEN 'fraud_confirmed'
ELSE 'legit'
END AS label
FROM silver. payments e;
5) Samples: stratification and balance
Rare events: use stratified sampling by market/provider/date; oversampling rare classes or focal loss.
Validation layers: hold holdout by week/market/tenant.
Sanctions/PII: Exclude direct ID fields from training sets.
sql
-- Verification of class shares by market
SELECT market, label, COUNT() FROM dataset GROUP BY market, label;
6) Tracer Consistency (IRR)
Measure inter-annotation agreement: Cohen's κ (2 annotators )/Krippendorff's α (N annotators, different scale type).
Landmarks:- κ < 0. 4 - poor consistency → revise instructions/examples.
0. 4–0. 6 - acceptable for complex tasks;> 0. 6 - good;> 0. 8 is excellent.
- Coverage (how many are marked), κ/ α by class and slice, 'unknown' share, average time, top errors.
7) QA circuit and gold standards
Golden set: 1-5% marked - double-checked benchmark.
Honey-pot tasks: hidden known cases in the task stream.
Second look: escalations/arbitrage on controversial examples.
Marking regression tests: revalidation after updating guides.
8) Active, weak and semi-supervised learning
Active Learning: Selection of "uncertain" examples (maximum entropy/diversity).
Weak Supervision: heuristics/distant supervision + noise model for labels.
Semi-Supervised: pseudo-labels with a temperature threshold and subsequent verification.
python
U = unlabeled_pool()
scores, conf = model. predict(U)
C = pick_top_k_by_uncertainty(U, conf, k=500)
labels = annotate (C) # person train (model, L ∪ labels) # additional training
9) Anti-Leukage and Time Control
Point-in-time join for features and labels.
Banning labels/feature from the future (after 'asof').
Separate online/offline pipelines with transformation equivalence test.
Data and logic versioning ('logic _ version', 'data _ version', 'asof _ date').
10) Model quality metrics
Select metrics for the business cost of errors:- Classification: PR-AUC/ROC-AUC, F1 @ k, Recall @ k, expected cost (FP/FN weights).
- Risk scoring: KS/ROC-AUC, Brier, calibration (ECE), PSI/CSI for drift.
- Recommendations: NDCG/MAP @ K, coverage/diversity, novelty.
- Anomalies: Precision @ k, AUCPR on synthetic/gold set.
python best_thr = argmin_thr(cost_fpFPR(thr) + cost_fnFNR(thr))
11) Slice analysis and fairness
Slices: market, provider, device/ASN, account age, deposit size, time of day.
Fairness: disparate impact (ratio), equalized odds (разница FPR/TPR).
Actions: reassembly of features, calibration by slices, revision of thresholds, training weights.
12) Production quality monitoring
Data/prediction drift: PSI/KL over features/rates.
Calibration: ECE, reliability-charts.
Threshold stability: alert if expected cost ↑> X% or PR-AUC ↓.
Schemes/contracts: catch breaking changes (schema registry).
Feedback loop: fast manual incident labels (case-closings, RG-outcomes).
13) Privacy, Security, Compliance
PII minimization: pseudonyms, separate protected mapping.
Residency: Separate pipelines/keys (EEA/UK/BR); banning cross-regional joins without reason.
DSAR/RTBF: computable projections and selective edits.
Legal Hold: WORM archives for cases and reporting packages.
Logs: unalterable access/export audit.
14) Organization of marking process
Tools: task tracker, example queue, context preview, PII masking, hotkeys.
Speed and quality control: KPI of the annotator (speed, accuracy in golden), training and certification.
Versioning: 'guidelines _ version', 'annotator _ id', 'reviewer _ id', timestamps.
Documentation: set card (owner, source, windows, rules, metrics).
15) Sample templates
Datacet Card (YAML):yaml name: aml_tx_2025q1_pt owner: ml-risk asof_range: ["2024-10-01", "2024-12-31"]
positive_label: fraud_confirmed guidelines_version: "1. 3. 0"
feature_window: "[-30d, 0d)"
holdout: ["2024-12-15", "2024-12-31"]
pii_policy: "tokenized_user_ids; masked_pan; no_raw_ip"
QA marking rules:
yaml qa:
min_kappa: 0. 6 golden_accuracy_min: 0. 9 max_unknown_share: 0. 15 reannotation_on_disagreement: true
Confusion matrix (SQL idea):
sql
SELECT pred, label, COUNT() n
FROM eval_predictions
GROUP BY pred, label;
16) Implementation Roadmap
MVP (2-4 weeks):1. Ontology and v1 instructions, gold set (≥1000 examples per domain).
2. Annotation flow with PII masking, κ metric for each week.
3. Baseline model + offline estimate (PR-AUC, expected cost), point-in-time sampling.
4. Monitoring the drift of features/rates; register of datasets and guide versions.
Phase 2 (4-8 weeks):- Active/weak-supervision pipeline, auto-triage 'unknown'.
- Slice analysis and fairness reports, probability calibration.
- DSAR/RTBF procedures for marked sets, Legal Hold for cases.
- Full QA automation (golden/honey-pots), markup regression tests.
- Catalog of datasets and "model quality" cards; expected-cost thresholding.
- Chargeback by markup/inference cost, SLA by label updates.
17) RACI
R (Responsible): Data Science (ontology, metrics), Label Ops (process/QA), Data Eng (samples/PII/storage).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/residency/DSAR), Risk/AML/RG (policy), Security (KMS/audit).
I (Informed): Product/Marketing/Operations/Support.
18) Pre-sale checklist
- Ontology and guides approved, version fixed.
- Qualitative sample: stratification, holdout by time/market.
- κ/ α ≥ target threshold golden-accuracy complied with.
- Point-in-time collection of features and labels; test for the absence of leukage passed.
- Metrics selected by expected cost, slice analysis, and fairness.
- Drift/calibration monitoring on; alerts are set up.
- PII/DSAR/RTBF and Legal Hold policies enforced; audit enabled.
19) Anti-patterns and risks
Markup without clear criteria → low κ, noisy labels.
Lakage from the future (post-factual signs/labels).
Unbalanced samples, ROC-AUC metric excluding cost.
Lack of golden/QA and regression markup tests.
PII in unmasked and residency datasets.
No slice analysis → hidden degradation on regions/providers.
20) The bottom line
Model quality starts with label quality. Strict ontology, instructions with examples, point-in-time discipline, QA contours and metrics that take into account the cost of errors are the basis of reproducible ML in iGaming. By embedding these practices in the data pipeline and MLOps, you get sustainable, ethical and compliant models that improve business results without surprises.