Data quality control

1) Purpose and principles

Why: reliable reports (GGR/taxes), anti-fraud and RG models, compliance uploads, products and personalization.

Principles:

Schema-first & Contracts: All sources are required to publish contract data.
DQ-as-code: rules in the repository, versions, tests and reviews.
Observation-by-default: metrics/logging/lineage.
Privacy-by-design: PII minimum, masking and RLS/CLS.
Cost-aware: prioritization of critical rules, smart samples.

2) Taxonomy of quality measurements

Completeness - Percentage of required fields/rows.
Validity-Matches types/ranges/reference books.
Uniqueness: no duplicate keys/events.

Consistency: referential integrity, business invariants

Accuracy-Approaches the "true" source (summary reconciliations).
Timeline/Freshness - Material delay.
Lineage Integrity: preserving the origin/versions of transformations.

Quality and criticality KPIs (critical/major/minor) are defined for each domain.

3) Contracts and schemes (source of truth)

Data contracts: JSON Schema/Avro/OpenAPI/AsyncAPI, hosted by Registry.
Stability: backward-compatible changes - adding nullable; breaking - new version + double entry.
Traceability: in events - 'event _ id', 'trace _ id', 'schema _ version', 'source'.

4) DQ-as-code: artifact structure

Store the rules in Git along with the pipelines:


/dq/
rules/
silver. payments. yaml gold. ggr_daily. yaml checks/
sql/
python/
policies/
severities. yaml notifications/
routes. yaml

Rules: declarative YAML/SQL;

Severity: mapping → alert channels/escalation levels;

CI: circuit linters, compatibility tests, dry-run/simulator.

5) Example Rules (YAML)

yaml table: silver. payments owner: data-payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: amount_positive severity: critical type: range column: amount min: 0. 01
- name: currency_in_whitelist severity: major type: in_set column: currency set: [EUR, USD, GBP, TRY, BRL]
- name: unique_tx severity: critical type: unique columns: [transaction_id]
- name: fk_user_exists severity: critical type: foreign_key column: user_pseudo_id ref_table: dim. users ref_column: user_pseudo_id
- name: ts_monotonicity severity: minor type: temporal expression: "ts between date_sub(now(), interval 90 day) and now()"

6) SQL tests (samples)

Uniqueness of keys

sql
SELECT transaction_id, COUNT() AS c
FROM silver. payments
GROUP BY transaction_id
HAVING COUNT() > 1;

Required Field Completeness

sql
SELECT COUNT() AS nulls
FROM silver. payments
WHERE amount IS NULL OR currency IS NULL OR ts IS NULL;

References/Consistency

sql
SELECT p. currency
FROM silver. payments p
LEFT JOIN ref. currencies r ON p. currency = r. code
WHERE r. code IS NULL;

7) Streaming DQ (real-time)

Ingest-validation: schema validation, size-limits, types and enum's.
On-stream checks: dedup '(event_id, source)', allowed lateness, currency/amount validity.
Boundaries: critical errors → DLQ + alert; not critical → tag, but skip (with the 'dq _ flag' flag).
Metrics: completeness/lag/dup-rate by party.

8) Handling errors and exceptions

DLQ/Quarantine: Sick records are held, available for correction.
Exception records: exclusion card (owner, date, reason, area).
Auto-fallback: Use the last correct snapshot of the display case.
Closing SLA: critical - ≤ 24-48 hours; major - ≤ 5 employees days.

9) Coordination with privacy and compliance

PII minimization: do not check "raw" PII in analytical layers; Use aliases.
RLS/CLS-Checks are performed based on field masking.
Regionalisation: Rules take into account 'jurisdiction' (EEA/UK/BR).
Legal Hold: No rewriting of archives as part of hold.

10) Observability, SLI/SLO and alerts

Recommended SLI/SLOs:

Freshness p95 (Silver): ≤ 15 min
Completeness (critical types): ≥ 99. 5%.
Validity (schema): ≥ 99. 9%.
Duplicate rate: ≤ 0. 1%.
DQ incident MTTR: ≤ 24–48 ч.

Alerts: pager for critical, anti-aliasing, maintenance windows.

11) Dashboards (minimum set)

Freshness/Completeness heat map by domain and market.
Top N tables by incident rate and by cost of corrections.
DQ funnel: ingest → silver → gold (loss/correction).
Linedge map for critical reports (regulator/GGR/RG/AML).
Map of "legacy" schemas and clients (SDK/schema versions).

12) Processes and RACI

R (Responsible): Data Engineering (rules on tables), Domain Owners (semantics).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/Legal/DPO, Architecture, SRE.
I (Informed): BI/Продукт/Маркетинг/Финансы/Операции.

Rule lifecycle: offer → review → "dark run" → inclusion → monitoring → retrospective.

13) Reconciliation and Accuracy

Checksums/transactions: vault with OLTP/providers (PSP/KYC).
Two-loop comparisons: independent pipeline for selective validation.
Tolerances are percentage thresholds by metrics (e.g. GGR variance ≤ 0. 2%).
Daily acts: audit reconciliation reports.

14) Cost and prioritization

Run critical rules more often (streaming/hourly), minor - daily.
Use fetches and materialized checks for heavy tables.
Track cost/query and cost/GB, apply clustering/indexing.
Allocate a budget for DQ in the context of teams (chargeback).

15) Templates for Gold storefronts (GGR Daily example)

yaml table: gold. ggr_daily owner: fin-analytics slo:
ready_by_local_time: "06:00"
rules:
- name: ggr_not_negative severity: critical type: range column: ggr min: 0. 0
- name: market_known severity: major type: in_set column: market set_ref: ref. markets
- name: fx_source_present severity: major type: not_null column: fx_source
- name: completeness_by_market severity: critical type: completeness partition_keys: [event_date, market]
expected_rows_expression: "ref. expected_activity(event_date, market)"

16) Quality Incidents: Management and Communication

Ticketing: auto-creation of tasks with attached selections and metrics.
Comm templates: notifying product owners/regulators when affected.

Post-mortem: root cause (schema drift, upstream bug, load), CAPA actions, control of "regression return."

17) Implementation Roadmap

MVP (2-4 weeks):

1. Catalog of critical tables (Payments, Gameplay, GGR, Compliance).

2. YAML rules for 10-15 key checks + CI validation.

3. Freshness/Completeness dashboard and alerts for critical.

4. DLQ/Quarantine + runbook fixes.

Phase 2 (4-8 weeks):

Rule extension (FK/accuracy), dry-run simulator, A/B inclusions.
Lineage integration, exception arrangements and SLAs.
Streaming DQ on ingest for "noisy" sources.

Phase 3 (8-12 weeks):

Autogeneration of documentation by rules, cost metrics.
"Control contours" (independent reconciliation), weekly retrospective.
Rule-as-Code platform SDKs, a registry of standard domain checks.

18) Pre-sale checklist

Contracts and schemas in Registry, compatibility tests pass.
YAML rules frozen, severity/escalation assigned.
Dashboards and alerts are active; SLOs are defined and agreed.
DLQ/Quarantine is available, runbooks are documented.
Exception/reconciliation procedures agreed with Legal/Compliance.
Measurement of the cost of inspections and limits on heavy requests.

19) Frequent mistakes and how to avoid them

Raw data without contracts: enter schema-first and consumer-tests.
"Manual" checks: translate into DQ-as-code and CI.
No prioritization: separate critical/major/minor and alert channels.
There is no DLQ: there is nothing to work with errors - add quarantine.
Ignore cost: Profile queries, use materialization.
No post-mortems: errors are repeated - enter CAPA and regression control.

20) The bottom line

The data quality control system is not a set of scattered checks, but a managed program: contracts and schemes, DQ-as-code, observability and SLO, incident and reconciliation discipline. By following this article, you will receive reproducible, verifiable and cost-effective data sufficient for regulatory reporting, product solutions and real-time risk detectors.

Data quality control

Required Field Completeness

References/Consistency

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects