Data auditing and versioning
1) Why do you need it
Auditing and versioning create reproducibility: you can explain any figure, repeat the calculation and safely develop models/showcases. In iGaming, this is critical for finance (GGR/NET), payments, KYC/AML, Responsible Gaming and regulatory reporting.
Objectives:- Tracing: who changed the data/schema/logic and why.
- Reproducibility: Which version of the data/code/model generated the report.
- Release security: rollback and predictability of changes.
- Compliance: provable logs for regulators and internal audits.
2) Concepts and version levels
1. Schema Version - Field/Type/Semantic Evolution (SEMVER).
2. Dataset Version-Snapshot/slice at a time "true" for report/training.
3. Data Product Version: formulas, filters, aggregations.
4. ML feature/model version: date/code/hyperparameters/feature/data (end-to-end).
5. Pipeline version: transformation code, configs, dependencies.
6. Data contract version: producer/consumer requirements (scheme, SLA, quality).
3) Audit: what to log
Who: subject (user/service), role/attributes (RBAC/ABAC).
What: Table/Showcase/Model/Scheme/Contract.
When: exact time, tz, correlation id.
Why: link to task/ticket/release note, reason.
Than: code/model version, commit hash, container image.
How has it changed: before/after (diff), row volume (rows affected), integrity control (hash/signature).
Context: environment (prod/stage), domain, data sensitivity (class).
Audit logs are append-only/WORM, signed, and available in SIEM.
4) Versioning policy (recommendations)
SEMVER: `MAJOR. MINOR. PATCH`
MAJOR - incompatible schema/semantics changes.
MINOR - reversibly compatible additions (new fields/columns with nullable, new vNext showcases).
PATCH - fixes without changing the contract (quality-fix, backfill).
Deviation-procedure: obsolescence window, warnings in the/CI directory, date of disconnection.
Release Notes: one page per release: what, why, risks, rollback plan.
5) Techniques in storage and streams
Time-travel/Snapshots: storing table versions; ability to execute the query "as it was on T-0."
SCD (Slowly Changing Dimensions): types 1/2/3 for dimensions (games, providers, players).
CDC/CDF (Change Data/Capture & Feed): incremental changes for facts (rates, payments, KYC).
Audit Fact-A separate fact table with edit/add/delete events.
Integrity control: batch/file hashes, package signatures, aggregate reconciliations.
6) Evolution of circuits and Data Contracts
Contract as code: schema, types, mandatory fields, allowed values, SLA freshness, DQ rules.
Compatibility: added → MINOR field; changed the type/semantics → MAJOR with migration and dual-write.
CI gate: PR changing scheme is blocked if compatibility is broken or there is no Release Notes.
Directory/Registry: stores active/obsolete versions and owners.
7) Versioning in BI and metrics
Certified "gold" showcases: fixed KPI semantics (GGR, ARPPU, retention).
Dual-run: a new version of the showcase is built in parallel (v2), comparison of metrics (tolerance bands).
Commit Reports - Each export/dashboard references a'dataset _ version' and a'definition _ version'.
Calendar sections: "dey-kat," "month-to-date" - are fixed on the data version.
8) Versioning in ML/MLOps
Model Registry: model, date, quality metrics, training data (dataset_version), feature versions (feature_set_version).
Feature Store: versioned feature groups; prohibition of "hot" fields without an explicit version.
Repro set: training code (commit), environment (Docker/conda lock), sid.
Champion-Challenger: parallel versions in sales, reports on quality, fairness and privacy.
Rollback: quick rollback to the previous stable model and feature set.
9) Rollback, backfill and fixes
Rollback plan: for each MAJOR/MINOR version - clear return steps.
Backfill playbook: source of truth, date range, order of recalculation, checksums, labels "recomputed = true."
Edit visibility: v2 replaces v1 only after comparison; all "historical" reports continue to reference their versions.
10) Safety and compliance in the audit
Event/package signing: producer signs, consumer verifies.
PII sanitation: the audit stores tokens that are not raw PII.
Legal Hold: No deletion of version/logs for the duration of the investigation.
DSAR: versions find and upload subject records by token; historical snapshots are taken into account.
11) Metrics and SLO
Repro Rate is the percentage of reports played from the data version/code ≥ the target threshold.
Coverage:% of tables with time-travel/audit log enabled.
Schema Compatibility Pass: rate of successful compatibility checks in CI.
Dual-run Delta: variance v1/v2 within tolerance.
Rollback MTTR: average version rollback time.
Audit Integrity - percentage of events signed and verified.
Backfill Success - percentage of recalculations completed correctly.
12) iGaming Patterns (Cases)
GGR correction retroactively: the supplier has recalculated RTP - we make backfill of facts for the period, fix 'recomputed _ at', publish Release Notes, compare v1/v2; we do not rewrite the reports for the past months, but mark "the corrected version is available."
Anti-fraud rules: we change the semantics of features - MAJOR, dual-run models and showcases, rollback to champion when regressing.
KYC/AML: added new provider statuses - MINOR with nullable; include compatibility tests in contracts.
RG signals: clarified the logic of the "series of losses" - MINOR + Release Notes and impact monitoring.
13) Tools and artifacts (categories)
Catalog/Lineage/Registry: set/schematic/storefront versions, owners, connections, contracts.
Orchestrator & CI/CD: compatibility gates, dual-run, release notes publishing.
Storage with time-travel: storage of snapshots/logs.
Signing & Checksums: batch signature, batch checksums.
Model/Feature Registry: feature/model versions, champion-challenger reports.
14) Templates (ready to use)
14. 1 Release Notes
Version: 'payments _ gold v2. 1. 0`
Type: MINOR (new fields' psp _ country ',' method _ group ')
Reason: PSP/Country Reporting Unification
Risks: Impact on display case'risk _ signals'
Validation: dual-run 14 days, delta ≤ 0. 2% GGR
Rollback: switch to 'v2. 0. 3 'via orchestrator flag
Deploy date/owner/ticket
14. 2 Kit version passport
Dataset: `game_rounds_silver`
Version: '2025-11-01T00: 00: 00Z' (snapshot id)
Schema: 'schema @ 1. 7. 0 '(contract reference)
Source: Provider Feeds A/B (commit...)
Integrity checksum signed manifest
DQ: Completeness 99. 9%, freshness ≤ 15 min
Uses: 'games _ perf _ gold v3. x`, `rg_signals v1. x`
14. 3 Change audit report
Event: update schema 'kyc _ status' → 'kyc _ status, v2'
user/service, 'Data-Engineer' role
When: '2025-11-01 09:32:10 + 02'
Why: Ticket # 3421 (new provider statuses)
Diff: + 'status _ reason' (nullable), enum extended
Checks: CI semver pass, MINOR contract
Caption: 'sig =...', hash diff: 'sha256 =...'
14. 4 Versioning policy (fragment)
MAJOR: breaks compatibility; dual-write ≥ 30 days; mandatory rollback plan.
MINOR: reversibly compatible; Warnings in the directory A/B storefronts 7-14 days.
PATCH: quality fixes/recalculations; Release Notes required.
Archiving: we store snapshots for regulation ≥ N months; WORM for audit.
15) Processes (end-to-end)
1. Initiative: change ticket + linedge impact score.
2. Engineering Contract/Schema Update + Release Notes.
3. Validation: CI compatibility checks, DQ tests, dual-run.
4. Deploy: by flag, canary; Publish the version to the catalog.
5. Monitoring: delta v1/v2, KPI, complaints.
6. Backfill: By regression playbook.
7. Post-mortem: if incident, update policy/tests.
16) RACI (example)
Policies and Standards: CDO (A), Data Governance Council (R/A), DPO/Sec (C).
Contracts/schemes: Domain Owners (A), Data Stewards (R), Platform/Eng (C).
Orchestration/Storage: Platform/Eng (R), SRE (C).
BI/metrics: Analytics Lead (R), Product/Finance (C).
ML versions: ML Lead (A), DS (R), Platform (C).
Audit/Logs: SecOps (R), Internal Audit (C).
17) Implementation Roadmap
0-30 days (MVP)
Enable time-travel/snapshots for critical tables (payments, game_rounds, kyc).
Run immutable audit logs and signature of ingestion packages.
Accept SEMVER policy and Release Notes template.
Catalog: add 'owner', 'schema _ version', 'dataset _ version' to top showcases.
30-90 days
Enter dual-run for all MINOR/MAJOR; automatic v1/v2 comparison.
Associate contracts with compatibility and DQ CI gates.
Backfill/rollback regulation; train teams.
Model/Feature Registry with a full set of dannyye→fichi→model→inferens links.
3-6 months
Full audit log coverage, WORM storage, reports for regulators.
Automated Release Notes from diff + lineage.
Repro Rate/Schema Compatibility/Rollback MTTR reports in dashboards.
Quarterly reviews of KPI versions and "freezing" of definitions.
18) Anti-patterns
Changing KPI semantics without a new version/release note.
Recalculations "quietly" without a backfill plan and'recomputed 'marks.
Storage of raw PII in audit logs.
Lack of dual-run and instant window replacement.
"Eternal" models/showcases without specifying the version and sources.
19) Related Sections
Data Management, Data Origin and Path, Access Control, Tokenization, Security and Encryption, Model Monitoring, Ethics and DSAR, Federated Learning, Confidential ML.
Result
Auditing and versioning turn data and models into a reliable product: each change is transparent, reproducible and reversible. For iGaming, this is the foundation of trust in KPIs, sustainability of compliance and speed of secure releases.