Data quality control
1) Purpose and principles
Why: reliable reports (GGR/taxes), anti-fraud and RG models, compliance uploads, products and personalization.
Principles:- Schema-first & Contracts: All sources are required to publish contract data.
- DQ-as-code: rules in the repository, versions, tests and reviews.
- Observation-by-default: metrics/logging/lineage.
- Privacy-by-design: PII minimum, masking and RLS/CLS.
- Cost-aware: prioritization of critical rules, smart samples.
2) Taxonomy of quality measurements
Completeness - Percentage of required fields/rows.
Validity-Matches types/ranges/reference books.
Uniqueness: no duplicate keys/events.
Consistency: referential integrity, business invariants
Accuracy-Approaches the "true" source (summary reconciliations).
Timeline/Freshness - Material delay.
Lineage Integrity: preserving the origin/versions of transformations.
Quality and criticality KPIs (critical/major/minor) are defined for each domain.
3) Contracts and schemes (source of truth)
Data contracts: JSON Schema/Avro/OpenAPI/AsyncAPI, hosted by Registry.
Stability: backward-compatible changes - adding nullable; breaking - new version + double entry.
Traceability: in events - 'event _ id', 'trace _ id', 'schema _ version', 'source'.
4) DQ-as-code: artifact structure
Store the rules in Git along with the pipelines:
/dq/
rules/
silver. payments. yaml gold. ggr_daily. yaml checks/
sql/
python/
policies/
severities. yaml notifications/
routes. yaml
Rules: declarative YAML/SQL;
Severity: mapping → alert channels/escalation levels;
CI: circuit linters, compatibility tests, dry-run/simulator.
5) Example Rules (YAML)
yaml table: silver. payments owner: data-payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: amount_positive severity: critical type: range column: amount min: 0. 01
- name: currency_in_whitelist severity: major type: in_set column: currency set: [EUR, USD, GBP, TRY, BRL]
- name: unique_tx severity: critical type: unique columns: [transaction_id]
- name: fk_user_exists severity: critical type: foreign_key column: user_pseudo_id ref_table: dim. users ref_column: user_pseudo_id
- name: ts_monotonicity severity: minor type: temporal expression: "ts between date_sub(now(), interval 90 day) and now()"
6) SQL tests (samples)
Uniqueness of keys
sql
SELECT transaction_id, COUNT() AS c
FROM silver. payments
GROUP BY transaction_id
HAVING COUNT() > 1;
Required Field Completeness
sql
SELECT COUNT() AS nulls
FROM silver. payments
WHERE amount IS NULL OR currency IS NULL OR ts IS NULL;
References/Consistency
sql
SELECT p. currency
FROM silver. payments p
LEFT JOIN ref. currencies r ON p. currency = r. code
WHERE r. code IS NULL;
7) Streaming DQ (real-time)
Ingest-validation: schema validation, size-limits, types and enum's.
On-stream checks: dedup '(event_id, source)', allowed lateness, currency/amount validity.
Boundaries: critical errors → DLQ + alert; not critical → tag, but skip (with the 'dq _ flag' flag).
Metrics: completeness/lag/dup-rate by party.
8) Handling errors and exceptions
DLQ/Quarantine: Sick records are held, available for correction.
Exception records: exclusion card (owner, date, reason, area).
Auto-fallback: Use the last correct snapshot of the display case.
Closing SLA: critical - ≤ 24-48 hours; major - ≤ 5 employees days.
9) Coordination with privacy and compliance
PII minimization: do not check "raw" PII in analytical layers; Use aliases.
RLS/CLS-Checks are performed based on field masking.
Regionalisation: Rules take into account 'jurisdiction' (EEA/UK/BR).
Legal Hold: No rewriting of archives as part of hold.
10) Observability, SLI/SLO and alerts
Recommended SLI/SLOs:- Freshness p95 (Silver): ≤ 15 min
- Completeness (critical types): ≥ 99. 5%.
- Validity (schema): ≥ 99. 9%.
- Duplicate rate: ≤ 0. 1%.
- DQ incident MTTR: ≤ 24–48 ч.
Alerts: pager for critical, anti-aliasing, maintenance windows.
11) Dashboards (minimum set)
Freshness/Completeness heat map by domain and market.
Top N tables by incident rate and by cost of corrections.
DQ funnel: ingest → silver → gold (loss/correction).
Linedge map for critical reports (regulator/GGR/RG/AML).
Map of "legacy" schemas and clients (SDK/schema versions).
12) Processes and RACI
R (Responsible): Data Engineering (rules on tables), Domain Owners (semantics).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/Legal/DPO, Architecture, SRE.
I (Informed): BI/Продукт/Маркетинг/Финансы/Операции.
Rule lifecycle: offer → review → "dark run" → inclusion → monitoring → retrospective.
13) Reconciliation and Accuracy
Checksums/transactions: vault with OLTP/providers (PSP/KYC).
Two-loop comparisons: independent pipeline for selective validation.
Tolerances are percentage thresholds by metrics (e.g. GGR variance ≤ 0. 2%).
Daily acts: audit reconciliation reports.
14) Cost and prioritization
Run critical rules more often (streaming/hourly), minor - daily.
Use fetches and materialized checks for heavy tables.
Track cost/query and cost/GB, apply clustering/indexing.
Allocate a budget for DQ in the context of teams (chargeback).
15) Templates for Gold storefronts (GGR Daily example)
yaml table: gold. ggr_daily owner: fin-analytics slo:
ready_by_local_time: "06:00"
rules:
- name: ggr_not_negative severity: critical type: range column: ggr min: 0. 0
- name: market_known severity: major type: in_set column: market set_ref: ref. markets
- name: fx_source_present severity: major type: not_null column: fx_source
- name: completeness_by_market severity: critical type: completeness partition_keys: [event_date, market]
expected_rows_expression: "ref. expected_activity(event_date, market)"
16) Quality Incidents: Management and Communication
Ticketing: auto-creation of tasks with attached selections and metrics.
Comm templates: notifying product owners/regulators when affected.
Post-mortem: root cause (schema drift, upstream bug, load), CAPA actions, control of "regression return."
17) Implementation Roadmap
MVP (2-4 weeks):1. Catalog of critical tables (Payments, Gameplay, GGR, Compliance).
2. YAML rules for 10-15 key checks + CI validation.
3. Freshness/Completeness dashboard and alerts for critical.
4. DLQ/Quarantine + runbook fixes.
Phase 2 (4-8 weeks):- Rule extension (FK/accuracy), dry-run simulator, A/B inclusions.
- Lineage integration, exception arrangements and SLAs.
- Streaming DQ on ingest for "noisy" sources.
- Autogeneration of documentation by rules, cost metrics.
- "Control contours" (independent reconciliation), weekly retrospective.
- Rule-as-Code platform SDKs, a registry of standard domain checks.
18) Pre-sale checklist
- Contracts and schemas in Registry, compatibility tests pass.
- YAML rules frozen, severity/escalation assigned.
- Dashboards and alerts are active; SLOs are defined and agreed.
- DLQ/Quarantine is available, runbooks are documented.
- Exception/reconciliation procedures agreed with Legal/Compliance.
- Measurement of the cost of inspections and limits on heavy requests.
19) Frequent mistakes and how to avoid them
Raw data without contracts: enter schema-first and consumer-tests.
"Manual" checks: translate into DQ-as-code and CI.
No prioritization: separate critical/major/minor and alert channels.
There is no DLQ: there is nothing to work with errors - add quarantine.
Ignore cost: Profile queries, use materialization.
No post-mortems: errors are repeated - enter CAPA and regression control.
20) The bottom line
The data quality control system is not a set of scattered checks, but a managed program: contracts and schemes, DQ-as-code, observability and SLO, incident and reconciliation discipline. By following this article, you will receive reproducible, verifiable and cost-effective data sufficient for regulatory reporting, product solutions and real-time risk detectors.