Data validation
1) Why does the iGaming platform need it?
Trust in reports and KPIs: GGR/NET, conversions, retention, RG signals.
ML/scoring reliability: correct features for anti-fraud/recommendations/RG.
Real-time operations: Alerts at drift/loss of events before payouts/UX are affected.
Compliance: no PII/secrets where they shouldn't be; provable traceability.
2) Where to validate: control levels
1. Injection (batch/stream): scheme, types, required fields, idempotency/dedup.
2. Stream processing: windows/watermarks, order, omissions/delays, exactly-once.
3. ETL/ELT and transformations: links/joys, aggregates, business balances.
4. DWH/storefronts (Gold): consistency between tables, freshness, uniqueness of keys.
5. Feature Store/online: feature ranges, offlayn↔onlayn consistency.
6. BI/API: counts and filters, SLAs on latency/freshness, k-anonymity.
3) Types of checks (catalog)
Schematic: type/nullable/enum/regex/JSON-shape; incompatible changes to stop →.
Domain: ≥0 amounts, ∈ currency {EUR, USD, TRY, BRL}, ≤ limit rate, strana∈litsenzii.
Identity/keys: the primary key is unique, the foreign key is not "hanging."
Field quality: fullness, length, format (IBAN, BIN, e-mail token).
Statistics/baselines: frequencies, distributions, quantile corridors.
Anomalies: volume/fraction spikes, zeros/duplicates, schema drift.
Freshness: max (ts) no older than X; lag ingest→gold ≤ T.
Consistency: sum of parts = summary; multi-table reconciliation.
Privacy/security: Zero-PII outside the permitted zones; tokenization/masks.
Regulatory: RG/AML fields are present and plausible.
4) Data Contracts
The contract fixes the scheme + quality rules + SLO between the source and consumers.
Minimum contract (fragment):yaml dataset: payments_ingest_v2 owner: team-payments schema:
id: {type: string, pattern: "^[a-f0-9]{32}$", unique: true}
ts: {type: timestamp, timezone: "UTC", nullable: false}
amount: {type: decimal(18,2), min: 0. 00}
currency: {type: string, enum: ["EUR","USD","TRY","BRL"]}
psp: {type: string, required: true}
quality:
freshness_max: "PT5M"
completeness_min: 0. 995 duplicate_rate_max: 0. 001 pii_allowed: false slo:
p95_ingest_latency_ms: 30000 success_rate: 0. 995
Contract changes - through semver and migrations: 'MAJOR' breaks, 'MINOR' adds a field, 'PATCH' corrects the description.
5) Expectations and policies
Expectations - declarative checks executed in pipelines (batch/stream).
Examples of expectations (YAML):yaml expectations:
- name: unique_primary_key check: "unique(id)"
severity: "error"
- name: amount_non_negative check: "amount >= 0"
severity: "error"
- name: currency_enum check: "currency in ['EUR','USD','TRY','BRL']"
severity: "error"
- name: ts_fresh_enough check: "now() - max(ts) <= interval '5 minutes'"
severity: "warn"
- name: pii_absent check: "no_plain_pii(columns: ['email','card','iban'])"
severity: "error"
Response Policy:
- 'error '→ party/batch quarantine, alert + ticket; downstream block.
- 'varn '→ passes, but creates a parsing task; quality marking.
- 'info '→ monitoring only.
6) Streaming: Specifics of checks
Watermarks/late data: let's be late '≤ 120s', otherwise - quarantine; compensate with finite windows.
Idempotency: event key + hash payload → deadlock on broker/thread.
Exactly-once: transactional sing (+ idempotent sinks) for critical flows (payments/rounds).
Volume counters: "expected" vs "received" per window; discrepancy → alert.
scala val deduped = stream
.keyBy(_.id)
.process(new DeduplicateWithin(Time. minutes(10)))
val validated = deduped
.filter(_.amount >= 0)
.filter(_.currency in Set("EUR","USD","TRY","BRL"))
emitToQuarantineIfLate(validated, allowedLateness = 120. seconds)
7) DWH/SQL: invariants and reconciliations
SQL checks (example):sql
-- uniqueness
SELECT id, COUNT() c FROM gold. payments GROUP BY 1 HAVING c>1;
-- freshness
SELECT NOW() - MAX(ts) AS lag FROM gold. payments;
-- reconciliation of totals
SELECT
SUM(amount) AS by_rows,
(SELECT total_amount FROM gold. payments_summary WHERE date=CURRENT_DATE) AS by_summary
FROM gold. payments
WHERE date = CURRENT_DATE;
Window matching: daily 'detail → summary' reconciliations, discrepancy reports, automatic ticket.
8) Privacy and security
Default PII edition: input masks/tokens; we prohibit "raw" e-mail/cards/phones in the logs.
Permission policy: tables with PII - separate layer/directory, access by roles (RBAC/ABAC).
K-anonymity of reports: minimum N rows in slice.
Leak detectors: regular checks for PII patterns, "secrets" (keys/tokens).
Jurisdictions: geo/tenant-isolation (country/brand/license), separate keys.
9) Quality and SLO metrics
Quality measurements (D):- Freshness - lag max (ts).
- Completeness - proportion of non-empty/expected records.
- Uniqueness - duplicate keys.
- Consistency - invariants and balances (inter-table).
- Accuracy - validation with external domain source/rules.
- Validity - matching/enum/regex types.
- `Freshness payments_gold ≤ 5 мин` (p95).
- `Completeness game_rounds ≥ 99. 7 %/day '.
- `Duplicate_rate ≤ 0. 1‰`.
- `PII_leak = 0`.
10) Alerts, tickets and runbook
Routing: Slack/PagerDuty → domain owner; automatically apply samples and diff.
Grouping: one incident per set "labels: dataset = payments, brand = TR."
Runbook (example "Freshness breach: payments_gold"):1. Check ingest log and broker queue.
2. Compare "expected vs received" by PSP.
3. Enable Retrai/Switch PSP Route.
4. Annotate cause; restart of backs; post-mortem.
11) Versioning, tests and waiver process
Semver of quality rules: 'quality @ MAJOR. MINOR. PATCH`.
Unit tests of transformations (SQL/DBT/python) and contract tests for sources.
GOLDEN sets: known cases of discrepancies/leaks are mandatory in regression.
Waiver: short-term permission to violate the rule (description, owner, term, compensatory measures).
12) Catalogs/artifacts (ready-made templates)
12. 1 Datacet passport
yaml dataset: gold. game_rounds owner: team-games steward: data-governance contracts: ["games_rounds_v3"]
quality_slo:
freshness_p95: "PT10M"
completeness_min: 0. 997 uniqueness_max_dup: 0. 0005 alerts:
channels: ["#dq-incidents","#games-ops"]
severity_map: {error: "P1", warn: "P2"}
12. 2 Quarantine Policy
yaml quarantine:
storage: "s3://quarantine/payments/"
retention: "P30D"
access: ["team-payments","data-governance"]
auto_reprocess:
cron: "/15 "
max_attempts: 3
12. 3 Expectation для Feature Store
yaml featureset: fs_payments_online_v1 checks:
- name: feature_freshness check: "now() - max(feature_ts) <= interval '60 seconds'"
severity: "error"
- name: range_amount_avg check: "amount_avg in [0, 2000]"
severity: "warn"
- name: enum_device check: "device in ['ios','android','web']"
severity: "error"
13) The specifics of iGaming: ready-made cases
Payments/PSP: reconciliation of deposits/withdrawals to PSP reports; missing statuses → butch quarantine; alert for growth 'decline _ rate'.
Game providers: drop 'rounds _ per _ min' vs baseline + schema drift from the provider → transformation block of provider A, status banner.
RG/AML: mandatory fields (limits, self-exclusion, KYC statuses); overdue KYC → flag on the payment block, ticket in compliance.
Marketing/CRM: validity of campaign parameters, UTM, event dedup; k-anonymity in storefronts.
14) Implementation Roadmap
0-30 days (MVP)
1. Include contracts for key sets: payments, game_rounds, users, features.
2. Catalog of expectations (10-15 basic) + quarantine + alerts.
3. Dashboard Freshness/Completeness/Uniqueness; incident report.
4. Runbook’и для `Freshness`, `Duplicates`, `Schema drift`.
30-90 days
1. Intertable reconciliations and balances; waiver process and semver rules.
2. Stream validation (late data, deadlock, watermarks); PII detectors.
3. Integration with CI/CD: contract-tests of sources and transformations.
4. Quality SLOs in domain command OKRs.
3-6 months
1. AIOps threshold hints; auto-localization of causes.
2. Cross-brand/geo quality policy and compliance reports.
3. Post-mortems P1 incidents → replenishment of golden sets and rules.
4. Linkage with flow alerting and anomaly analysis (single loop).
15) RACI
Data Governance (A/R): standards, contracts, rule auditing.
Domain Owners (R): domain expectations and invariants.
Data Platform (R): expectations framework, quarantine, alerts, monitoring.
Security/DPO (A/R): privacy/PII/k-anonymity, geo/tenant-isolation.
SRE/Observability (C): incident routing, SLO/SLI.
Product/Finance (C): business balances, incident priorities.
16) Anti-patterns
Validation "only in DWH" - late, expensive, painful.
No quarantine - "dirt" goes to Gold/ML and breaks trust.
Hard thresholds without seasonality/hours/markets → alert storm.
Lack of owner and semver rules → chaos of exceptions.
Logs with PII and "screenshots to the common channel."
One-time "sanitary days" instead of a permanent circuit.
17) Related Sections
DataOps Practices, Data Auditing and Versioning, Data Origin and Path, Data Stream Alerts, Anomaly and Correlation Analysis, Access Control, Data Security and Encryption, Data Retention Policies, MLOps: Model Exploitation.
Total
Validation is not a filter at the end, but an end-to-end quality contract: from injection and stream to storefronts and online feature. Clear expectations, quarantines, alerts and SLOs turn data into a reliable asset: reports are correct, models are stable, payments are secure, compliance is calm.