Data integrity
1) What is data integrity
Data integrity is a set of properties and controls to ensure that data is correct, consistent, and consistent throughout its lifecycle, from sources and transformations to storefronts, APIs, and exports. The goal is that the same statement gives the same answer when repeated, and any changes are traceable and verifiable.
2) Types of integrity and where they live
Entity-Unique primary keys, no duplicates.
Referential-Valid FK links absence of "hanging" links.
Domain-Valid ranges and formats (type, length, directories).
Business rules: subject area invariants (balance ≥ 0, transaction amount = 0, etc.).
Temporary: monotony and timestamp consistency, correct time zones.
Access policies: RLS/CLS does not violate the logical consistency of visible data.
3) Data contracts and schemas (source of truth)
We set formal contracts for sets and events; we apply them at the entrance and after each transformation.
Example (YAML, simplified):yaml dataset: payments primary_key: txn_id foreign_keys:
- fk: user_id -> users. user_id schema:
- {name: txn_id, type: string, unique: true}
- {name: user_id, type: string, not_null: true}
- {name: amount, type: decimal(18,2), min: 0}
- {name: currency, type: string, in: [USD,EUR,TRY,UAH]}
- {name: event_time, type: timestamp, tz: UTC}
dq_rules:
- "duplicates(txn_id)=0"
- "ref_integrity(user_id, users. user_id)=true"
- "sum(amount) >= 0"
evolution:
semver: [MAJOR, MINOR, PATCH]
breaking_changes_require: approval:data-governance
4) Transactional guarantees and isolation
ACID for OLTP: atomicity, consistency, isolation, durability.
Isolation levels: Read Committed/Repeatable Read/Serializable - choose at risk of "dirty "/unique/phantom reads.
OLAP and lakehouse: atomic commits of tables (transaction log), idempotent sink and schema-evolution with compatibility control.
Consistency of KPI formulas: semantic layer → one truth for reports and APIs.
5) Distributed systems: order, repetitions, idempotency
Event order: use 'event _ time' + 'ingested _ at', watermarks and lateness tolerance; aggregates based on event time.
Redelivery (at-least-once): global 'event _ id', idempotency keys tables, upsert/merge by stable key.
Out-of-order: recalculation of windows, delay strategy, compensation.
Exactly-once in meaning: transport can be at-least-once, receiver - idempotent.
6) Integrity validation (DQ) on each layer
We include integrity rules in CI/CD and in pipeline runtime:- Freshness/Completeness/Uniqueness/Valid Values/Referential Integrity.
- Anomalies: bursts of duplicates, time gaps, sharp shifts in distributions.
- Control of KPI formulas: versioning of calculations and tests for matching results (golden sets).
- Export control - Prohibit the issuance of sets with violations (quarantine).
yaml expect_column_values_to_be_unique: {column: txn_id}
expect_column_values_to_not_be_null: {column: user_id}
expect_column_values_to_be_in_set: {column: currency, value_set: [USD,EUR,TRY,UAH]}
7) Financial and operational integrity
Double-entry: debit/credit in the balance; summary reconciliations in cut-off.
Total invariants: payout amount = write-off amount + fees + adjustments.
Operational invariants: SLA/guardrail metrics do not break business rules (for example, auto-repair does not create duplicates).
8) Lineage, auditing and reproducibility
Linage: source to showcase/feature; the visibility of transformations and owners.
Audit trails: who changed what, when and why; schema/formula/job versions.
Snapshots/checkpoints: the ability to recalculate and confirm past reports.
Repro: same query on the same slice → the same result (versions and layers).
9) Security and privacy without loss of integrity
RLS/CLS: row/column filters should not violate invariants (for example, the sum of the visible sample should match the declared one).
Masking/tokenization: deterministic strategies to ensure that dedup and referential integrity are maintained.
Encryption: in the channel and "on disk" after compression; key management and access auditing.
DSAR/Retention: delete/anonymize does not break connectivity (cascading policy).
10) Self-service and automatic repair
Quarantine: isolation of suspicious parties/batches; consumers - a "clean" branch.
Replay/Backfill: Replays a window from an unalterable raw log.
Reconcile: layer and system reconciliations (raw↔curated↔marts; istochnik↔DWH).
Dedup/Compaction/Rebuild: system procedures for repairing indexes/aggregates.
Policy-as-code: "what anomaly → what action → thresholds → escalation."
11) Modeling and storage practices
Stable keys: surrogate PK (UUID/ULID), unchanged natural keys in reference books.
Normalizatsiya↔denormalizatsiya: FK connections in sources, denormalized showcases with logic version control.
SCD1/SCD2: guided history for dimensions.
Sorting/clustering: Improves RLE/zone-maps and simplifies reconciliations.
Hashes and checksums: checking the integrity of files/batches.
12) Integrity over time and in reporting
Formula versions: The January 2025 report should be reproducible with a version X formula.
Cut-off and "period closure": freezing shop windows and archive slices.
Late arriving facts: mechanics of refilling and recounts with report version mark.
Documenting overrides: manual adjustments - audit only.
13) Integrations and APIs
API contract: schemas, types, required fields, error codes; versioning (v1/v2).
Validation at the entrance: reject bad payloads, do not "repair silently."
Idempotent POST: idempotence key, retry safe.
Export to files: batch consistency, hashes, signatures.
14) Antipatterns
SELECT in sales queries and blizzards - breaks down with MINOR evolution.
FK "in words": no real reference checking.
Silent data corrections without audit and reporting.
Mix TZ and time formats in one set.
"Grip" KPI overrides without versions and logs.
Single deduplication key without fallback strategies.
DSAR deletion without cascading link checking.
15) Implementation Roadmap
1. Inventory & criticality: set/event map, owners, risks, invariants.
2. Contracts and schemes: formalize types/constraints/FK, CI compatibility checks.
3. DQ in pipeline: Freshness/Completeness/Uniqueness/RI, quarantine, alerts.
4. Transactional basis: atomic-sink, upsert/merge, SCD history, formula versioning.
5. Lineage and audit: directory, trace, change-logs, access-logs.
6. Repair policies: replay/backfill/dedup/reconcile as code; runbook’и и SLO MTTR-data.
7. Security/priv: RLS/CLS, masking, encryption, DSAR processes.
8. Reporting: cut-off, freeze-slices, KPI versioning.
16) Pre-release checklist/display case
- PK/FK and domain constraints are set and pass tests.
- Schema/formula versioning is enabled; schema-diff green.
- DQ rules (freshness/completeness/uniqueness/ranges/RI) are green.
- Idempotent entries: upsert/merge, idempotence key (for events).
- Time: 'event _ time' and 'ingested _ at', TZ = UTC; late data policy.
- Lineage and audit visible; included quarantine and alerts.
- RLS/CLS/masking does not violate invariants and RI.
- DSAR/Retention tested; cut-off/archive is ready.
17) Mini templates
SQL: referential integrity check
sql select count() as orphans from fact_payments f left join dim_users u on f. user_id = u. user_id where u. user_id is null;
-- expect orphans = 0
quarantine/repair policy (pseudo-YAML)
yaml policy: payments_integrity detect:
- rule: duplicates(txn_id) > 0
- rule: ref_integrity(user_id, users. user_id) = false auto_actions:
- quarantine_partition: {date: today}
- trigger_replay: {window: "last_2h"}
escalate_if:
- condition: violations_persist>30m page: "oncall-data"
Measurement SCD2 diagram
sql
-- dim_user_status (SCD2)
user_id, status, valid_from, valid_to, is_current
18) The bottom line
Data integrity is not a single check, but an end-to-end guarantee system: formal contracts and restrictions, transactional and distributed invariants, validation and automation of repairs, lineage and audit, privacy and rights. When these elements work together, data becomes a reliable basis for solutions, and incidents are rare, short, and predictable.