Self-healing data
1) Definition and objectives
Self-healing data is an approach to data engineering in which defects are detected automatically, and corrective actions (repair, re-delivery, rollback, re-consolidation, re-indexing) are performed without human intervention or with minimal intervention (human-in-the-loop for sensitive cases).
Goals: Lower data MTTR, increased trust, resilience to drift and failure, predictable cost of ownership.
2) Typical glitches to be treated for
Schemes and contracts: incompatible changes, missing columns, type conflicts.
Quality/integrity: duplicates, omissions, uniqueness/referential integrity violations.
Time and freshness: injection delays, "holes" in the windows, desynchronization of TZ/locales.
Identifiers and keys: changing the ID generator, collisions, floating natural keys.
Order of events: late events, reordering, re-delivery (at-least-once).
Storages: degradation of batches, broken files/blocks, distortion of sharding.
Rights/security: incorrect masks/encryption, PII leaks in uploads.
3) Pillars of self-healing
1. Data contracts (schemas + rules) with automatic tests.
2. Idempotent pipelines (restart without double effects).
3. Journaling and reproducibility (raw/bronze unchangeable, lineage).
4. Repair mechanisms (replay, backfill, compaction, merge-repair, rebuild).
5. Observability and SLO (freshness, completeness, uniqueness, latency).
6. Decision-making policies (when we auto-fix, when we escalate).
4) Contracts and quality tests
The contract describes: scheme, acceptable ranges, uniqueness, RLS/masking, SLA freshness.
Example (YAML style):yaml dataset: payments schema:
- name: txn_id; type: string; unique: true
- name: user_id; type: string; not_null: true
- name: amount; type: decimal(18,2); min: 0
- name: created_at; type: timestamp; tz: UTC freshness_sla: 15m constraints:
- "count(distinct txn_id) = count()"
- "pct_null(user_id) < 0. 1%"
privacy:
- mask: card_pan -> BIN6LAST4 actions_on_violation:
- auto_quarantine_partition
- backfill_missing_window
- notify_owner_and_open_ticket
Tests are executed at every step: injection → staging → showcase. Violation of the rules activates auto-repair (see below) and/or quarantine.
5) Idempotence and determinism
Upsert/Merge by stable keys (SCD2 for history, SCD1 for slices).
Deterministic transformations: one input → one output with the same parameters.
Versioning - Fix the code/schema/layer version and the data label (watermark).
Idempotent sink: recording via staging + atomic swap/rename.
Exactly-once in meaning: acceptable "at-least-once" transport + idempotent receiver.
6) Repair toolkit
Replay/Backfill: redelivery for window't ∈ [T0, T1] 'from unalterable log (raw).
Reconciliation: comparison of aggregates/keys between layers (raw ↔ curated ↔ marts) and between systems (source ↔ DWH).
Deduplication: window-dedup by key (txn_id, event_id) + distance heuristic (fuzzy for dirty keys).
Compaction: transferring small files to large parties (Parquet/ORC), re-indexing.
Merge-repair: when records conflict, priority predicates (by source/time/version).
Rebuild indexes/materializations: recalculating aggregates/cube/roll-up.
Quarantine/Shadow: Suspicious parties isolate themselves; consumers read a "clean" thread.
Schema mediation: automatic projection selector (filling defaults, computable columns) for minor changes.
7) Storage protection and integrity
Check amounts and block validation (CRC, parity).
Quorum storage (RAFT/Paxos-compatible systems, quorum reads/writes).
Erasure coding for cost-effective redundancy.
Object store versioning (undelete).
Atomic commit в Lakehouse (transaction log, ACID-таблицы: Delta/Iceberg/Hudi).
8) Order of events and "dirty reality"
Late events: keep lateness-window, use watermark 'and; recalculating windows.
Redelivery: dedup by global 'event _ id', idempotency-keys tables.
Offset time: normalizing TZ, storing'ingested _ at'and'event _ time'.
Out-of-order: event_time-based aggregates with watermark adjustment.
9) Decision logic (policy engine)
Rule: "what anomaly → what action → what thresholds → who is the owner."
Example (pseudo):yaml policy: payments_freshness detect: freshness_delay > 15m auto_actions:
- trigger: backfill(last_60m)
- if: gap_persisted > 30m then: quarantine_partition(date=today, hour=current_hour)
escalate:
- if: gap_persisted > 60m -> page_oncall:data guardrails:
- do_not_expose_unverified_to_marts
10) Observability and SLO for data
SLO set:- Freshness of display cases ≤ 15 min.
- Completeness> 99. 5% on key fields.
- Uniqueness: duplicates <0. 01%.
- Calculation latency: p95 <5 min.
- Repair stability: MTTR-data <30 min.
Metrics and alerts: exhibit in Prometheus/Grafana; Build a priority tape of data incidents.
11) Reconciliation (practices)
Check aggregates: 'count/sum/min/max' between layers/systems on sliding window.
Key reconciliation: symmetric difference of sets' Δ = (A\B) ∪ (B\A) '.
Periodic "audit job": comparison with the source, selective check at the source.
Payments/finance: double-entry, daily cut-off reconciliations, adjustment log.
12) Circuit management and evolution
SemVer for schemes: MAJOR (breaks )/MINOR (adds )/PATCH (fixes).
Contracts in CI/CD: schema-diff, compatibility, autogeneration of migrations.
Backfill hook: with MINOR add defaults/calculated fields, recalculate showcases.
Flexible projections: Readers read subsets of columns; prohibit "SELECT."
13) Security, privacy, compliance
RLS/CLS: row/column filters, especially in repair branches and exports.
PII-based tokenization for sustainable deduplication.
Access/export audit: who saw what they exported to, where they sent it.
DSAR/Retention: auto-deletion/anonymization in repair processes; kickbacks take into account legal requirements.
14) Cost and performance
Cost-aware backfill: limiting the width of windows (for example, sliding 3-7 days).
Materializations and caches: recalculation of only changed batches (incremental).
Prioritization: first critical showcases (finance, risks), then analytical.
Off-peak repairs: night windows/low priority in the scheduler.
15) Testing and incident simulations
Chaos-data-testing: Deliberately break partitions/circuits on the stage.
Fake delays: Fake batches, out-of-order, duplicates.
Golden datasets: benchmarks for post-repair reconciliation.
GameDays: regular team training on runbooks.
16) Antipatterns
"Invisible" fixes: silent edits without auditing or reporting.
Untested backfills: no truth source/formula version.
Heavy live requests to OLTP during repairs: you finish the prod.
SELECT in consumers: breaks with any MINOR change.
The only key for deduplication is the absence of fallback keys/hash signatures.
17) Implementation Roadmap
1. Discovery: critical sets/metrics, risks, owners; dependency map.
2. Contracts and tests: formalize schemes/rules in CI; publish glossary.
3. Idempotency: rewrite key pipelines in upsert/merge, atomic sink.
4. Raw log and lineage: immutable layer, full metadata, watermark 'and.
5. Repair mechanics: backfill/replay, dedup, compaction, quarantine; policy engine.
6. Observability and SLO: quality dashboards, alerts, priority tape.
7. Chaos-data and training: regular exercises + runbook 'and.
8. Cost optimization: incremental recalculations, window prioritization.
18) Pre-release checklist
- Data contracts and quality tests cover critical sets.
- Pipelines are idempotent; there are atomic commit and pullbacks.
- Backfill/replay and quarantine are configured, escalation policies are spelled out.
- Freshness/Completeness/Uniqueness/Latency metrics and alerts in prod.
- Included audit of edits/repairs; stores versions of formulas and storefronts.
- DSAR/Retention is followed for repairs and rollbacks.
- There is a runbook 'and, conducted exercises, MTTR-target fixed.
- The cost of backfills is limited by budget guards.
19) Examples of auto-actions (templates)
"Window freshness failure X" → backfill (last_2h) → if not ok in 30 minutes → quarantine + on-call page.
"Duplicate txn_id spike" → include strict dedup + source reconciliation → cause report.
"MINOR schema change" → generate a calculated default field → rebuild aggregates.
"Loss of batches" → restore → verification of the check amount from the versioned object.
Bottom line: self-healing data is not one "repair script," but a system architecture: formal contracts, idempotent pipelines, reliable logging, automated repair mechanics and transparent observability with strict SLOs. Such a system not only repairs itself, but also turns incidents into manageable events with an understandable cost and recovery time.