GH GambleHub

Data validation

1) Why does the iGaming platform need it?

Trust in reports and KPIs: GGR/NET, conversions, retention, RG signals.
ML/scoring reliability: correct features for anti-fraud/recommendations/RG.
Real-time operations: Alerts at drift/loss of events before payouts/UX are affected.
Compliance: no PII/secrets where they shouldn't be; provable traceability.

2) Where to validate: control levels

1. Injection (batch/stream): scheme, types, required fields, idempotency/dedup.
2. Stream processing: windows/watermarks, order, omissions/delays, exactly-once.
3. ETL/ELT and transformations: links/joys, aggregates, business balances.
4. DWH/storefronts (Gold): consistency between tables, freshness, uniqueness of keys.
5. Feature Store/online: feature ranges, offlayn↔onlayn consistency.
6. BI/API: counts and filters, SLAs on latency/freshness, k-anonymity.

3) Types of checks (catalog)

Schematic: type/nullable/enum/regex/JSON-shape; incompatible changes to stop →.
Domain: ≥0 amounts, ∈ currency {EUR, USD, TRY, BRL}, ≤ limit rate, strana∈litsenzii.

Identity/keys: the primary key is unique, the foreign key is not "hanging."

Field quality: fullness, length, format (IBAN, BIN, e-mail token).
Statistics/baselines: frequencies, distributions, quantile corridors.
Anomalies: volume/fraction spikes, zeros/duplicates, schema drift.
Freshness: max (ts) no older than X; lag ingest→gold ≤ T.
Consistency: sum of parts = summary; multi-table reconciliation.
Privacy/security: Zero-PII outside the permitted zones; tokenization/masks.
Regulatory: RG/AML fields are present and plausible.

4) Data Contracts

The contract fixes the scheme + quality rules + SLO between the source and consumers.

Minimum contract (fragment):
yaml dataset: payments_ingest_v2 owner: team-payments schema:
id: {type: string, pattern: "^[a-f0-9]{32}$", unique: true}
ts: {type: timestamp, timezone: "UTC", nullable: false}
amount: {type: decimal(18,2), min: 0. 00}
currency: {type: string, enum: ["EUR","USD","TRY","BRL"]}
psp: {type: string, required: true}
quality:
freshness_max: "PT5M"
completeness_min: 0. 995 duplicate_rate_max: 0. 001 pii_allowed: false slo:
p95_ingest_latency_ms: 30000 success_rate: 0. 995

Contract changes - through semver and migrations: 'MAJOR' breaks, 'MINOR' adds a field, 'PATCH' corrects the description.

5) Expectations and policies

Expectations - declarative checks executed in pipelines (batch/stream).

Examples of expectations (YAML):
yaml expectations:
- name: unique_primary_key check: "unique(id)"
severity: "error"
- name: amount_non_negative check: "amount >= 0"
severity: "error"
- name: currency_enum check: "currency in ['EUR','USD','TRY','BRL']"
severity: "error"
- name: ts_fresh_enough check: "now() - max(ts) <= interval '5 minutes'"
severity: "warn"
- name: pii_absent check: "no_plain_pii(columns: ['email','card','iban'])"
severity: "error"
Response Policy:
  • 'error '→ party/batch quarantine, alert + ticket; downstream block.
  • 'varn '→ passes, but creates a parsing task; quality marking.
  • 'info '→ monitoring only.

6) Streaming: Specifics of checks

Watermarks/late data: let's be late '≤ 120s', otherwise - quarantine; compensate with finite windows.
Idempotency: event key + hash payload → deadlock on broker/thread.
Exactly-once: transactional sing (+ idempotent sinks) for critical flows (payments/rounds).
Volume counters: "expected" vs "received" per window; discrepancy → alert.

Flink rule pattern (pseudo):
scala val deduped = stream
.keyBy(_.id)
.process(new DeduplicateWithin(Time. minutes(10)))

val validated = deduped
.filter(_.amount >= 0)
.filter(_.currency in Set("EUR","USD","TRY","BRL"))

emitToQuarantineIfLate(validated, allowedLateness = 120. seconds)

7) DWH/SQL: invariants and reconciliations

SQL checks (example):
sql
-- uniqueness
SELECT id, COUNT() c FROM gold. payments GROUP BY 1 HAVING c>1;

-- freshness
SELECT NOW() - MAX(ts) AS lag FROM gold. payments;

-- reconciliation of totals
SELECT
SUM(amount) AS by_rows,
(SELECT total_amount FROM gold. payments_summary WHERE date=CURRENT_DATE) AS by_summary
FROM gold. payments
WHERE date = CURRENT_DATE;

Window matching: daily 'detail → summary' reconciliations, discrepancy reports, automatic ticket.

8) Privacy and security

Default PII edition: input masks/tokens; we prohibit "raw" e-mail/cards/phones in the logs.
Permission policy: tables with PII - separate layer/directory, access by roles (RBAC/ABAC).
K-anonymity of reports: minimum N rows in slice.
Leak detectors: regular checks for PII patterns, "secrets" (keys/tokens).
Jurisdictions: geo/tenant-isolation (country/brand/license), separate keys.

9) Quality and SLO metrics

Quality measurements (D):
  • Freshness - lag max (ts).
  • Completeness - proportion of non-empty/expected records.
  • Uniqueness - duplicate keys.
  • Consistency - invariants and balances (inter-table).
  • Accuracy - validation with external domain source/rules.
  • Validity - matching/enum/regex types.
SLO examples:
  • `Freshness payments_gold ≤ 5 мин` (p95).
  • `Completeness game_rounds ≥ 99. 7 %/day '.
  • `Duplicate_rate ≤ 0. 1‰`.
  • `PII_leak = 0`.

10) Alerts, tickets and runbook

Routing: Slack/PagerDuty → domain owner; automatically apply samples and diff.

Grouping: one incident per set "labels: dataset = payments, brand = TR."

Runbook (example "Freshness breach: payments_gold"):

1. Check ingest log and broker queue.

2. Compare "expected vs received" by PSP.

3. Enable Retrai/Switch PSP Route.

4. Annotate cause; restart of backs; post-mortem.

11) Versioning, tests and waiver process

Semver of quality rules: 'quality @ MAJOR. MINOR. PATCH`.
Unit tests of transformations (SQL/DBT/python) and contract tests for sources.
GOLDEN sets: known cases of discrepancies/leaks are mandatory in regression.
Waiver: short-term permission to violate the rule (description, owner, term, compensatory measures).

12) Catalogs/artifacts (ready-made templates)

12. 1 Datacet passport

yaml dataset: gold. game_rounds owner: team-games steward: data-governance contracts: ["games_rounds_v3"]
quality_slo:
freshness_p95: "PT10M"
completeness_min: 0. 997 uniqueness_max_dup: 0. 0005 alerts:
channels: ["#dq-incidents","#games-ops"]
severity_map: {error: "P1", warn: "P2"}

12. 2 Quarantine Policy

yaml quarantine:
storage: "s3://quarantine/payments/"
retention: "P30D"
access: ["team-payments","data-governance"]
auto_reprocess:
cron: "/15  "
max_attempts: 3

12. 3 Expectation для Feature Store

yaml featureset: fs_payments_online_v1 checks:
- name: feature_freshness check: "now() - max(feature_ts) <= interval '60 seconds'"
severity: "error"
- name: range_amount_avg check: "amount_avg in [0, 2000]"
severity: "warn"
- name: enum_device check: "device in ['ios','android','web']"
severity: "error"

13) The specifics of iGaming: ready-made cases

Payments/PSP: reconciliation of deposits/withdrawals to PSP reports; missing statuses → butch quarantine; alert for growth 'decline _ rate'.
Game providers: drop 'rounds _ per _ min' vs baseline + schema drift from the provider → transformation block of provider A, status banner.
RG/AML: mandatory fields (limits, self-exclusion, KYC statuses); overdue KYC → flag on the payment block, ticket in compliance.
Marketing/CRM: validity of campaign parameters, UTM, event dedup; k-anonymity in storefronts.

14) Implementation Roadmap

0-30 days (MVP)

1. Include contracts for key sets: payments, game_rounds, users, features.
2. Catalog of expectations (10-15 basic) + quarantine + alerts.
3. Dashboard Freshness/Completeness/Uniqueness; incident report.
4. Runbook’и для `Freshness`, `Duplicates`, `Schema drift`.

30-90 days

1. Intertable reconciliations and balances; waiver process and semver rules.
2. Stream validation (late data, deadlock, watermarks); PII detectors.
3. Integration with CI/CD: contract-tests of sources and transformations.
4. Quality SLOs in domain command OKRs.

3-6 months

1. AIOps threshold hints; auto-localization of causes.
2. Cross-brand/geo quality policy and compliance reports.
3. Post-mortems P1 incidents → replenishment of golden sets and rules.
4. Linkage with flow alerting and anomaly analysis (single loop).

15) RACI

Data Governance (A/R): standards, contracts, rule auditing.
Domain Owners (R): domain expectations and invariants.
Data Platform (R): expectations framework, quarantine, alerts, monitoring.
Security/DPO (A/R): privacy/PII/k-anonymity, geo/tenant-isolation.
SRE/Observability (C): incident routing, SLO/SLI.
Product/Finance (C): business balances, incident priorities.

16) Anti-patterns

Validation "only in DWH" - late, expensive, painful.
No quarantine - "dirt" goes to Gold/ML and breaks trust.
Hard thresholds without seasonality/hours/markets → alert storm.
Lack of owner and semver rules → chaos of exceptions.

Logs with PII and "screenshots to the common channel."

One-time "sanitary days" instead of a permanent circuit.

17) Related Sections

DataOps Practices, Data Auditing and Versioning, Data Origin and Path, Data Stream Alerts, Anomaly and Correlation Analysis, Access Control, Data Security and Encryption, Data Retention Policies, MLOps: Model Exploitation.

Total

Validation is not a filter at the end, but an end-to-end quality contract: from injection and stream to storefronts and online feature. Clear expectations, quarantines, alerts and SLOs turn data into a reliable asset: reports are correct, models are stable, payments are secure, compliance is calm.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.