Data Markup and Model Quality

1) Purpose and principles

The goal: to obtain reproducible labels and measurable quality of models without leukage and taking into account compliance.

Principles:

Schema-first: formalized ontologies, class dictionaries, and criteria.
Point-in-time: Labels are built from information available at the time of the decision.
Quality-as-code: instructions, tests, checklists and selections - in the repository.
Privacy-by-design: PII minimization, DSAR/RTBF, residency.
Cost-aware: consider the cost of markup and expected cost.

2) Ontology and label scheme

Define the markup object, classes, exceptions, and sources of truth: Example (AML/Antifraud):

Object: Transaction/Session.
Классы: `legit`, `fraud_suspected`, `fraud_confirmed`, `unknown`.
Exceptions: chargeback without evidence → 'unknown'.
Sources: case management, chargeback registries, providers/bank.

YAML diagram:

yaml task: aml_classification object: "payment_transaction"
labels:
- legit
- fraud_suspected
- fraud_confirmed
- unknown guidelines_version: "1. 3. 0"
positive_class: "fraud_confirmed"
exclusions:
- "dispute opened but no evidence -> unknown"
sources_of_truth:
- "case_system. resolution"
- "issuer. chargeback_code"

3) Guidelines

Structure:

1. Description of the task and business context.

2. Class definitions with positive/negative examples and borderline cases.

3. Source priority rules (true> heuristic> opinion).

4. 'Unknown'and escalation criteria.

5. Privacy policies (masking, tokens instead of ID).

6. FAQ and markup checklist.

Fragment of instructions (fraud):

'fraud _ confirmed ': proven chargeback/closed case with FRAUD tag.
'fraud _ suspended ': deposit ≥3
'legit ': There are no flags and no confirmed cases in the 60 days window.
'unknown ': Conflicting characteristics or insufficient data.

4) Label sources and point-in-time rules

Auto labels: rules/cases, chargeback, self-exclusion (RG), outcome bets.
Ground: result of investigation/regulatory outcomes.
Point-in-time-Do not use events after the decision point (t0).

Delays: for example, chargeback appears after 45-90 days → the label "matures."

SQL "no future" template:

sql
SELECT e. id, e. event_time AS asof,
CASE WHEN EXISTS (
SELECT 1 FROM cases c
WHERE c. tx_id = e. id
AND c. decision_time <= e. event_time + INTERVAL '90' DAY
AND c. result = 'FRAUD_CONFIRMED'
) THEN 'fraud_confirmed'
ELSE 'legit'
END AS label
FROM silver. payments e;

5) Samples: stratification and balance

Rare events: use stratified sampling by market/provider/date; oversampling rare classes or focal loss.
Validation layers: hold holdout by week/market/tenant.
Sanctions/PII: Exclude direct ID fields from training sets.

Sampling bias control:

sql
-- Verification of class shares by market
SELECT market, label, COUNT() FROM dataset GROUP BY market, label;

6) Tracer Consistency (IRR)

Measure inter-annotation agreement: Cohen's κ (2 annotators )/Krippendorff's α (N annotators, different scale type).

Landmarks:

κ < 0. 4 - poor consistency → revise instructions/examples.

0. 4–0. 6 - acceptable for complex tasks;> 0. 6 - good;> 0. 8 is excellent.

Marking quality card:

Coverage (how many are marked), κ/ α by class and slice, 'unknown' share, average time, top errors.

7) QA circuit and gold standards

Golden set: 1-5% marked - double-checked benchmark.
Honey-pot tasks: hidden known cases in the task stream.
Second look: escalations/arbitrage on controversial examples.
Marking regression tests: revalidation after updating guides.

8) Active, weak and semi-supervised learning

Active Learning: Selection of "uncertain" examples (maximum entropy/diversity).
Weak Supervision: heuristics/distant supervision + noise model for labels.
Semi-Supervised: pseudo-labels with a temperature threshold and subsequent verification.

Pipeline:

python
U = unlabeled_pool()
scores, conf = model. predict(U)
C = pick_top_k_by_uncertainty(U, conf, k=500)
labels = annotate (C) # person train (model, L ∪ labels) # additional training

9) Anti-Leukage and Time Control

Point-in-time join for features and labels.
Banning labels/feature from the future (after 'asof').
Separate online/offline pipelines with transformation equivalence test.
Data and logic versioning ('logic _ version', 'data _ version', 'asof _ date').

10) Model quality metrics

Select metrics for the business cost of errors:

Classification: PR-AUC/ROC-AUC, F1 @ k, Recall @ k, expected cost (FP/FN weights).
Risk scoring: KS/ROC-AUC, Brier, calibration (ECE), PSI/CSI for drift.
Recommendations: NDCG/MAP @ K, coverage/diversity, novelty.
Anomalies: Precision @ k, AUCPR on synthetic/gold set.

Expected-Cost (pseudocode):

python best_thr = argmin_thr(cost_fpFPR(thr) + cost_fnFNR(thr))

11) Slice analysis and fairness

Slices: market, provider, device/ASN, account age, deposit size, time of day.
Fairness: disparate impact (ratio), equalized odds (разница FPR/TPR).
Actions: reassembly of features, calibration by slices, revision of thresholds, training weights.

12) Production quality monitoring

Data/prediction drift: PSI/KL over features/rates.
Calibration: ECE, reliability-charts.
Threshold stability: alert if expected cost ↑> X% or PR-AUC ↓.
Schemes/contracts: catch breaking changes (schema registry).
Feedback loop: fast manual incident labels (case-closings, RG-outcomes).

13) Privacy, Security, Compliance

PII minimization: pseudonyms, separate protected mapping.
Residency: Separate pipelines/keys (EEA/UK/BR); banning cross-regional joins without reason.
DSAR/RTBF: computable projections and selective edits.
Legal Hold: WORM archives for cases and reporting packages.
Logs: unalterable access/export audit.

14) Organization of marking process

Tools: task tracker, example queue, context preview, PII masking, hotkeys.
Speed and quality control: KPI of the annotator (speed, accuracy in golden), training and certification.
Versioning: 'guidelines _ version', 'annotator _ id', 'reviewer _ id', timestamps.
Documentation: set card (owner, source, windows, rules, metrics).

15) Sample templates

Datacet Card (YAML):

yaml name: aml_tx_2025q1_pt owner: ml-risk asof_range: ["2024-10-01", "2024-12-31"]
positive_label: fraud_confirmed guidelines_version: "1. 3. 0"
feature_window: "[-30d, 0d)"
holdout: ["2024-12-15", "2024-12-31"]
pii_policy: "tokenized_user_ids; masked_pan; no_raw_ip"

QA marking rules:

yaml qa:
min_kappa: 0. 6 golden_accuracy_min: 0. 9 max_unknown_share: 0. 15 reannotation_on_disagreement: true

Confusion matrix (SQL idea):

sql
SELECT pred, label, COUNT() n
FROM eval_predictions
GROUP BY pred, label;

16) Implementation Roadmap

MVP (2-4 weeks):

1. Ontology and v1 instructions, gold set (≥1000 examples per domain).

2. Annotation flow with PII masking, κ metric for each week.

3. Baseline model + offline estimate (PR-AUC, expected cost), point-in-time sampling.

4. Monitoring the drift of features/rates; register of datasets and guide versions.

Phase 2 (4-8 weeks):

Active/weak-supervision pipeline, auto-triage 'unknown'.
Slice analysis and fairness reports, probability calibration.
DSAR/RTBF procedures for marked sets, Legal Hold for cases.

Phase 3 (8-12 weeks):

Full QA automation (golden/honey-pots), markup regression tests.
Catalog of datasets and "model quality" cards; expected-cost thresholding.
Chargeback by markup/inference cost, SLA by label updates.

17) RACI

R (Responsible): Data Science (ontology, metrics), Label Ops (process/QA), Data Eng (samples/PII/storage).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/DPO (PII/residency/DSAR), Risk/AML/RG (policy), Security (KMS/audit).
I (Informed): Product/Marketing/Operations/Support.

18) Pre-sale checklist

Ontology and guides approved, version fixed.
Qualitative sample: stratification, holdout by time/market.
κ/ α ≥ target threshold golden-accuracy complied with.
Point-in-time collection of features and labels; test for the absence of leukage passed.
Metrics selected by expected cost, slice analysis, and fairness.
Drift/calibration monitoring on; alerts are set up.
PII/DSAR/RTBF and Legal Hold policies enforced; audit enabled.

19) Anti-patterns and risks

Markup without clear criteria → low κ, noisy labels.
Lakage from the future (post-factual signs/labels).
Unbalanced samples, ROC-AUC metric excluding cost.
Lack of golden/QA and regression markup tests.
PII in unmasked and residency datasets.
No slice analysis → hidden degradation on regions/providers.

20) The bottom line

Model quality starts with label quality. Strict ontology, instructions with examples, point-in-time discipline, QA contours and metrics that take into account the cost of errors are the basis of reproducible ML in iGaming. By embedding these practices in the data pipeline and MLOps, you get sustainable, ethical and compliant models that improve business results without surprises.

Data Markup and Model Quality

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects