Data origin
Lineage
1) What lineage is and why it is needed
Data Lineage is a formal record of "where the data came from, how it was transformed, where and by whom it was used." The result is a directed graph of dependencies with attributes (time, versions, owners, transformations, access policies, quality), which makes the data system understandable and auditable.
Business Value:- Transparency of metrics (finance, product, risk): "why is the number X = 1,234? ».
- Quick impact analysis of changes (scheme/job): "what will break if...."
- Compliance and auditing (GDPR/ISO/SOC): provable field path.
- Accelerating onboarding and reducing toil (self-service knowledge).
- Quality improvement: targeted inspections where risk is higher.
2) Coverage areas and levels of detail
Stream level (pipeline/job): Which jobs/orchestrators spawned datasets.
Dataset level (table/view/topic/file): inputs → outputs, versions/snapshots.
Column/feature-level - how each field is calculated, from which sources.
Consumption layer: BI reports, APIs, ML models, dashboards and alerts.
For critical entities (money, regulation), column-level detailing is required.
3) Lineage Data Model - Key Entities
Dataset: `{urn, type, schema, owners, pii_class, retention, tags}`
Job/Task: `{urn, code_ref, version, runtime, schedule, owners}`
Run/Execution: `{run_id, job_urn, start/end, status, inputs[], outputs[], code_sha, infra}`
Field: '{dataset _ urn, name, type, derivation}' (derivation - expression/AST/operator).
Policy: `{dataset_urn/field, access_rules, masking, consent_scope}`
Quality Check: `{check_id, scope, rule, severity, result}`
4) Lineage sources: active vs passive assembly
Active (event-based): instrumenting orchestrators/engines (Spark/DBT/SQL engines/Kafka) to issue events "job started/finished, inputs/outputs, column-mapping."
Pros: accuracy, relevance, minimizing post-parsing.
Passive (inference): DAG parsim, SQL/DDL/log requests, directory/storage logs; build dependencies retroactively.
Pros: rapid heritage coverage; cons: lower accuracy at column-level.
Usually a hybrid is used: active events where possible, and passive analysis as a "insurance grid."
5) Solution architecture (reference)
Producers (orchestrators/engines) → Lineage event bus → Normalizer → Graph storage → Index/search → UI/API/alerts → Export/catalog.
Events: unified (job/run/dataset/column-lineage), with URNs and semantic versions.
Graph storage: column-level graph (for example, based on a graph database or relational + inverted index).
UI: interactive visualization of shortest paths, impact/root-cause, "quality signals" on edges and nodes.
Integrations: data catalog, quality system (DQ), access control (ABAC), audit (append-only logs).
6) Identifiers and versioning
URN/Global ID for each dataset/jobs/fields: stable, human-readable, including platform/namespace/name/version.
SchemaVersion and code version (code SHA, image digest).
Time-travel lineage: reproducibility of investigations.
7) Column-level lineage: how to get reliable
SQL parsing with AST construction and normalization of aliases/CTE/blizzard.
Annotations in transformation code (DBT tests, primitive comments, UDF-metadata).
Events from engines: specifying "target. col = f(src. a, src. b)».
Semantic rules: UDF/aggregation opses are marked as "lossy" (with loss of granularity) or "sensitive-preserving" (transfers PII tags).
8) Linking lineage to privacy and security
Privacy by Design: field labels' pii _ class', 'consent _ scope', 'retention'. When promoting columns, labels are transmitted according to the rules (for example, 'email → hash_email' PII-derived remains).
PII tokenization: lineage stores tokenization/detokenization fact and token service nodes; any detokenization is an audit event.
Encryption: for AEAD/FPE fields, lineage captures the "crypto state" and the key area (tenant/scope) - without key disclosure.
Audit and WORM - lineage events and policy changes are stored in a non-modifiable log (append-only with hash chains).
9) Data quality and lineage-based SLOs
Checks on edges: freshness, completeness, uniqueness/keys, drift of distributions.
SLO/SLI: "95% of jobs feeding fino-report metrics completed ≤ 06:00 UTC."
Root-cause: graph + execution times give a quick definition of the "first broken node."
10) Impact analysis and change management
In case of a planned change in the schema/logic: by the column downstream (downstream) - a list of affected reports/models/API clients.
Breaking changes policy: mandatory notification of owners of downstream artifacts, grace period, parallel versions ('v1 '/' v2') and the sunset-date flag.
Automatic PR/tickets with a list of consumers and a migration checklist.
11) Integration with orchestrators and engines
Orchestrators: 'RunStarted/RunCompleted' events with inputs/outputs are emitted before/after the job.
SQL/ELT: connectors to engines (warehouse, lakehouse) to obtain the actual execution plan and column mapping.
Stream-processing: lineage of messages (topic→topic, key/headers), Avro/Protobuf schemes, evolution of schemes through registry.
ML: lineage features/datasets, model versions, training artifacts, feature sources.
12) Modeling of label propagation rules (data contracts)
Data set contract: schema + field semantics (keys, PII, aggregability, licenses/legal grounds, retention).
Propagation rules:- 'SELECT a, b FROM T '→ move labels' a, b '.
- 'hash (email) '→ label' PII-derived (pseudonymized) 'with detokenization prohibited.
- 'SUM (amount) '→ loss of individuality; join's are not allowed on the result field.
- Contracts are validated in CI (blocker in case of non-compliance), and violations are events in the audit.
13) Performance and scale
Incremental injection of lineage events; deduplication by '(run_id, job_urn)'.
Column storage: separation of hot index (last 30-90 days) and archive; snapshots.
Caching paths for frequent requests (short paths to "golden" metrics).
Sharding by neimspaces/tenants; protection against "monster nodes" (fan-out limitation).
14) Visualization and UX
Modes:- Path to metric: "from which the metric is assembled."
- Impact from source: "who will be affected by the change."
- Field lineage: "how the field is calculated."
- Overlays: job statuses, quality, PII tags, retentions, owners.
- Actions: open a contract, create a ticket for migration, subscribe to change alerts.
15) Security of access to the graph
ABAC: Node/edge visibility is restricted to tenants/roles.
Redaction: hiding sensitive field names (or aliasing them) in UI for untrained roles.
mTLS/OIDC for API lineage events are signed with service identities.
WORM and read control: reading critical graph segments is also logged.
16) Operation: SLO, monitoring, alerts
Graph SLO: event delay <5 min; coverage completeness> 98% of critical pipelines; 100% of "golden metrics" have column-level lineage.
Alerts: chain break, run without completion events, inconsistent schemes, orphaned datasets, fan out growth/cycles.
Reports: weekly "state of lineage coverage," top 10 risk nodes.
17) Privacy and compliance (bundles)
GDPR/PbD: store processing bases and retentions as tags; lineage provides fast DSAR pathfinding and "right to delete" through cascaded crypto deletion of the corresponding segments.
Secret management: sources of access to raw materials never fall into lineage as open credits; only the role/policy reference is stored.
Audit/unmodified logs - all lineage events are signed and pinned to the append-only repository (see corresponding article).
18) Checklists
Before starting:- URN agreements defined for datasets/jobs/fields.
- Enabled emission of lineage events from orchestrators and engines.
- SQL/DDL parser and schema normalizer work.
- Data-contracts and PII/retention propagation rules are approved.
- Configured WORM event log and graph backups.
- BI/ML are connected as lineage consumers (reports, models, features).
- Lineage coverage for critical domains ≥ 98%, column-level for "money" = 100%.
- Alerts for breaks, orphaned datasets, circuit drift are on.
- Quarterly audits of PII tags and contracts.
- Document flow of changes (breaking) and distribution to consumers.
19) Mini recipes
RunCompleted event (pseudo-JSON):json
{
"event": "RunCompleted",
"run": {
"id": "run_2025-10-31T14:20:00Z_42",
"job": "urn:job:etl:finance:close_books_v3",
"status": "SUCCESS",
"code_sha": "b3f9…",
"started_at": "2025-10-31T14:05:00Z",
"ended_at": "2025-10-31T14:19:52Z"
},
"inputs": [
"urn:dataset:lake:bank_txn_v2",
"urn:dataset:warehouse:fx_rates_d+1"
],
"outputs": [
"urn:dataset:warehouse:pnl_daily_v3"
],
"column_lineage": [
{
"output": "pnl_daily_v3. pnl_usd",
"expr": "SUM(txn. amount_local fx. rate)",
"inputs": ["bank_txn_v2. amount_local", "fx_rates_d+1. rate"],
"lossy": true
}
]
}
PII propagation rule (idea):
if input. field. pii in {email, phone, id} and transform in {hash, tokenize}:
output. field. pii = "pseudonymized"
elif transform in {aggregate, anonymize_k}:
output. field. pii = "anonymous"
else:
output. field. pii = input. field. pii
Impact quaris "what will break":
affected = downstream(urn:"urn:dataset:warehouse:users_v4", depth=4)
filter affected where kind in {"dashboard","model","api"} and owner not in {"team-exp"}
20) Frequent mistakes and how to avoid them
Lineage "in the picture" without a formal model. Events/schemes/URN are needed, otherwise the graph is not scaled.
There is no column-level where there is "money." Calculations cannot be explained without a column level.
Incomplete events (without code_sha/versii schemas). Reproducibility is not possible.
Ignore privacy. PII tags must live and be carried with the fields.
One large graph database without sharding. Divide by namespaces, store snapshots.
Blind faith in parsers. In controversial cases - active events from engines.
21) Runbook’и
Incident: Metric "jumped."
1. Open "Path to metric" → check the last'Run 'nodes on the path.
2. Check code/schema versions, check DQ status on edges.
3. If a broken link is found, create a ticket for the owner, enable the temporary "hold" of the metric publication.
4. After the fix - mark RCA and associate with the nodes of the graph.
Modifying source schema.
1. Request downstream impact.
2. Send notifications to owners, create migration PRs.
3. Raise parallel 'v _ next', keep both versions until sunset-date.
4. Close 'v _ prev', update contracts and lineage graph.
- «Privacy by Design (GDPR)»
- "PII Data Tokenization"
- "Secret Management"
- "Audit and immutable logs"
- "At Rest/In Transit Encryption"
- "Key Management and Rotation"