Data integrity

1) What is data integrity

Data integrity is a set of properties and controls to ensure that data is correct, consistent, and consistent throughout its lifecycle, from sources and transformations to storefronts, APIs, and exports. The goal is that the same statement gives the same answer when repeated, and any changes are traceable and verifiable.

2) Types of integrity and where they live

Entity-Unique primary keys, no duplicates.
Referential-Valid FK links absence of "hanging" links.
Domain-Valid ranges and formats (type, length, directories).
Business rules: subject area invariants (balance ≥ 0, transaction amount = 0, etc.).
Temporary: monotony and timestamp consistency, correct time zones.
Access policies: RLS/CLS does not violate the logical consistency of visible data.

3) Data contracts and schemas (source of truth)

We set formal contracts for sets and events; we apply them at the entrance and after each transformation.

Example (YAML, simplified):

yaml dataset: payments primary_key: txn_id foreign_keys:
- fk: user_id -> users. user_id schema:
- {name: txn_id, type: string, unique: true}
- {name: user_id, type: string, not_null: true}
- {name: amount, type: decimal(18,2), min: 0}
- {name: currency, type: string, in: [USD,EUR,TRY,UAH]}
- {name: event_time, type: timestamp, tz: UTC}
dq_rules:
- "duplicates(txn_id)=0"
- "ref_integrity(user_id, users. user_id)=true"
- "sum(amount) >= 0"
evolution:
semver: [MAJOR, MINOR, PATCH]
breaking_changes_require: approval:data-governance

4) Transactional guarantees and isolation

ACID for OLTP: atomicity, consistency, isolation, durability.
Isolation levels: Read Committed/Repeatable Read/Serializable - choose at risk of "dirty "/unique/phantom reads.
OLAP and lakehouse: atomic commits of tables (transaction log), idempotent sink and schema-evolution with compatibility control.
Consistency of KPI formulas: semantic layer → one truth for reports and APIs.

5) Distributed systems: order, repetitions, idempotency

Event order: use 'event _ time' + 'ingested _ at', watermarks and lateness tolerance; aggregates based on event time.
Redelivery (at-least-once): global 'event _ id', idempotency keys tables, upsert/merge by stable key.
Out-of-order: recalculation of windows, delay strategy, compensation.
Exactly-once in meaning: transport can be at-least-once, receiver - idempotent.

6) Integrity validation (DQ) on each layer

We include integrity rules in CI/CD and in pipeline runtime:

Freshness/Completeness/Uniqueness/Valid Values/Referential Integrity.
Anomalies: bursts of duplicates, time gaps, sharp shifts in distributions.
Control of KPI formulas: versioning of calculations and tests for matching results (golden sets).
Export control - Prohibit the issuance of sets with violations (quarantine).

Example (Great Expectations-style):

yaml expect_column_values_to_be_unique: {column: txn_id}
expect_column_values_to_not_be_null: {column: user_id}
expect_column_values_to_be_in_set: {column: currency, value_set: [USD,EUR,TRY,UAH]}

7) Financial and operational integrity

Double-entry: debit/credit in the balance; summary reconciliations in cut-off.
Total invariants: payout amount = write-off amount + fees + adjustments.
Operational invariants: SLA/guardrail metrics do not break business rules (for example, auto-repair does not create duplicates).

8) Lineage, auditing and reproducibility

Linage: source to showcase/feature; the visibility of transformations and owners.
Audit trails: who changed what, when and why; schema/formula/job versions.
Snapshots/checkpoints: the ability to recalculate and confirm past reports.
Repro: same query on the same slice → the same result (versions and layers).

9) Security and privacy without loss of integrity

RLS/CLS: row/column filters should not violate invariants (for example, the sum of the visible sample should match the declared one).
Masking/tokenization: deterministic strategies to ensure that dedup and referential integrity are maintained.
Encryption: in the channel and "on disk" after compression; key management and access auditing.
DSAR/Retention: delete/anonymize does not break connectivity (cascading policy).

10) Self-service and automatic repair

Quarantine: isolation of suspicious parties/batches; consumers - a "clean" branch.
Replay/Backfill: Replays a window from an unalterable raw log.
Reconcile: layer and system reconciliations (raw↔curated↔marts; istochnik↔DWH).
Dedup/Compaction/Rebuild: system procedures for repairing indexes/aggregates.

Policy-as-code: "what anomaly → what action → thresholds → escalation."

11) Modeling and storage practices

Stable keys: surrogate PK (UUID/ULID), unchanged natural keys in reference books.
Normalizatsiya↔denormalizatsiya: FK connections in sources, denormalized showcases with logic version control.
SCD1/SCD2: guided history for dimensions.
Sorting/clustering: Improves RLE/zone-maps and simplifies reconciliations.
Hashes and checksums: checking the integrity of files/batches.

12) Integrity over time and in reporting

Formula versions: The January 2025 report should be reproducible with a version X formula.
Cut-off and "period closure": freezing shop windows and archive slices.
Late arriving facts: mechanics of refilling and recounts with report version mark.
Documenting overrides: manual adjustments - audit only.

13) Integrations and APIs

API contract: schemas, types, required fields, error codes; versioning (v1/v2).

Validation at the entrance: reject bad payloads, do not "repair silently."

Idempotent POST: idempotence key, retry safe.
Export to files: batch consistency, hashes, signatures.

14) Antipatterns

SELECT in sales queries and blizzards - breaks down with MINOR evolution.
FK "in words": no real reference checking.
Silent data corrections without audit and reporting.
Mix TZ and time formats in one set.
"Grip" KPI overrides without versions and logs.
Single deduplication key without fallback strategies.
DSAR deletion without cascading link checking.

15) Implementation Roadmap

1. Inventory & criticality: set/event map, owners, risks, invariants.
2. Contracts and schemes: formalize types/constraints/FK, CI compatibility checks.
3. DQ in pipeline: Freshness/Completeness/Uniqueness/RI, quarantine, alerts.
4. Transactional basis: atomic-sink, upsert/merge, SCD history, formula versioning.
5. Lineage and audit: directory, trace, change-logs, access-logs.
6. Repair policies: replay/backfill/dedup/reconcile as code; runbook’и и SLO MTTR-data.
7. Security/priv: RLS/CLS, masking, encryption, DSAR processes.
8. Reporting: cut-off, freeze-slices, KPI versioning.

16) Pre-release checklist/display case

PK/FK and domain constraints are set and pass tests.
Schema/formula versioning is enabled; schema-diff green.
DQ rules (freshness/completeness/uniqueness/ranges/RI) are green.
Idempotent entries: upsert/merge, idempotence key (for events).
Time: 'event _ time' and 'ingested _ at', TZ = UTC; late data policy.
Lineage and audit visible; included quarantine and alerts.
RLS/CLS/masking does not violate invariants and RI.
DSAR/Retention tested; cut-off/archive is ready.

17) Mini templates

SQL: referential integrity check

sql select count() as orphans from fact_payments f left join dim_users u on f. user_id = u. user_id where u. user_id is null;
-- expect orphans = 0

quarantine/repair policy (pseudo-YAML)

yaml policy: payments_integrity detect:
- rule: duplicates(txn_id) > 0
- rule: ref_integrity(user_id, users. user_id) = false auto_actions:
- quarantine_partition: {date: today}
- trigger_replay: {window: "last_2h"}
escalate_if:
- condition: violations_persist>30m page: "oncall-data"

Measurement SCD2 diagram

sql
-- dim_user_status (SCD2)
user_id, status, valid_from, valid_to, is_current

18) The bottom line

Data integrity is not a single check, but an end-to-end guarantee system: formal contracts and restrictions, transactional and distributed invariants, validation and automation of repairs, lineage and audit, privacy and rights. When these elements work together, data becomes a reliable basis for solutions, and incidents are rare, short, and predictable.

Data integrity

quarantine/repair policy (pseudo-YAML)

Measurement SCD2 diagram

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects