Origin and data path
1) What is Data Lineage
Data Lineage is a "life story" of data: from place of birth (source) through transformations and transfers to storefronts, reports and models. Lineage answers questions:- Where did the numbers in the report come from?
- Which tables/fields will be affected by the schema change?
- Why did KPI change at 9pm yesterday?
- What data got into a specific model and ML version?
For iGaming, this is critical due to regulation, financial reporting (GGR/NET), anti-fraud, KYC/AML, responsible play and high speed of product changes.
2) Lineage levels and granularity
1. Business lineage - linking metrics and business terms (from the glossary) to showcases/formulas.
2. Technical line (tabular) - relationships between tables/jobs/transformation packages.
3. Field/column-level - which source column forms the destination column, with rules.
4. Runtime-lineage (operational) - actual runs: times, volumes, code/schema versions, hash artifacts.
5. End-to-end - end-to-end path from provider/PSP/CRM to report/dashboard/model.
6. Cross-domain/Mesh - connections between domain data products under contracts.
3) Key value
Trust and audit: explainability of reports and models, rapid investigation of incidents.
Impact analysis: safe changes in schemes/logic, predictability of releases.
Onboarding speed: New analysts and engineers understand the landscape faster.
Compliance: PII traceability, Legal Hold, reporting to regulators.
Cost optimization: identification of dead pipelines and duplicate storefronts.
4) Objects and artifacts
Graph entities: Source (game provider, PSP, CRM), Topic/Stream, Raw/Staging, Bronze/Silver/Gold, DWH, ML features, BI model, Dashboard.
Relationships: transformations (SQL/ELT), jabs (Airflow/DBT/...), models (version), contracts (Avro/Proto/JSON Schema).
Attributes: owner, domain, classification, schema version, quality control, freshness, SLO/SLI.
5) Sources of truth for lineage
Static: parsing SQL/configs (dbt, ETL) → build dependencies.
Dynamic/Runtime - collect metadata at runtime (statement in the orchestrator, query logs).
Event: lineage events when publishing/reading messages in the bus (Kafka/Pulsar), validation of contracts.
Manual (minimum) - Describes complex business logic that is not automatically retrieved.
6) Lineage and Data Contracts
The contract fixes the scheme, semantics and SLA.
Compatibility check (semver) and idempotency are required.
Linige keeps a link to the contract/version and the fact of passing the check (CI/CD + runtime).
7) Lineage in iGaming: Domain Examples
Game events → RTP aggregates, volatility, retention, Game Performance Gold showcase.
Payments/outputs/chargebacks → GGR/NET reports, anti-fraud signals.
KYC/AML → statuses, checks, alerts → compliance cases and reporting.
Responsible Gaming → limits/self-exclusion → risk scoring and intervention triggers.
Marketing/CRM → campaigns, bonuses, wagering → impact on LTV/ARPPU.
8) Graph visualization
Recommendations:- Two modes are "landscape map" (macro) and "through track" (micro) from field to field.
- Filters: by domain, owner, classification (PII), environment (prod/stage), time.
- Overlays: freshness, volumes, DQ errors, schema versions.
- Quick steps: "Show dependents," "Who consumes this column? , ""Path to KPI dashboard."
9) Impact analysis and change management
Before changing the scheme/logic, run what-if: which jabs/showcases/dashboards/models will be affected.
Autogeneration of tickets to owners of dependent artifacts.
Dual-write/blue-green pattern for storefronts: v2 is filled in parallel, metric comparison, switching.
Backfill playbooks: how and how to load historical data, how to check consistency.
10) Linage and data quality (DQ)
Associate DQ rules with graph nodes/fields: validity, uniqueness, consistency, timeliness.
In case of violations, display "red segments" on the tracks and raise alerts to the owners.
Keep a history of DQ incidents and their impact on KPIs.
11) Linage for ML/AI
Traceability - dataset → features → training code → model (version) → inference.
Fix commits, training parameters, framework versions, validation data.
Lineage helps investigate drift, metric regression, and reproduce results.
12) Linage and Privacy/Compliance
Label PII/financial fields, countries, law (GDPR/local), processing basis.
Mark the nodes where masking/aliasing/anonymization is applied.
For DSAR/Right to be forgotten, track in which windows/backups the subject is present.
13) Metrics (SLO/SLI) for Lineage
Coverage:% of tables/fields with column linejet.
Freshness SLI: the proportion of nodes that fit into the SLA update.
DQ pass-rate: the proportion of successful checks by critical paths.
MTTD/MTTR for data incidents.
Change lead time: the average time to negotiate and safely release a schema.
Dead assets: proportion of unclaimed storefronts/job.
14) Tools (categories)
Catalog/Glossary/Lineage: single metadata graph, import from SQL/orchestrators/bus.
Orchestration: collecting runtime metadata, task statuses, SLAs.
Schema Registry/Contracts - compatibility checks, version policies.
DQ/Observability: rules, anomalies, freshness, volumes.
Sec/Access: PII labels, RBAC/ABAC, auditing.
ML Registry: A version of models, artifacts and datasets.
15) Templates (ready to use)
15. 1 Linja unit passport
Name/Domain/Environment: Owner/Steward:- Classification: Public/Internal/Confidential/Restricted (PII)
- Source/Inputs: Tables/Topics + Contract Versions
- Transformation: SQL/job/repo + commit
- Outputs/Consumers: display cases/dashboards/models
- Observability signals: freshness, volume, anomalies
- Incident history: links to tickets/post-mortem
15. 2 Communication card (column-level)
From field: schema. table. col (type, nullable)
In the field: schema. table. col (type, nullable)
Transformation rule: expression/function/dictionary
Quality context: checks, ranges, references
15. 3 Incident Investigation Playbook
1. Identify Affected KPI/Dashboard → 2) Upstream to Source →
2. Check freshness/volumes/DQ at each node → 4) Find the last code/scheme change →
3. Compare production/stage/yesterday → 6) Assign fixation and backfill → 7) Post-mortem and rule for the future.
16) Processes and integrations
On-change: Each merge into the repo that changes the schema/SQL triggers a lineage rebuild and impact analysis.
On-run: each successful/failed job writes runtime metadata to a graph.
Access-hooks: Access requests show the path to PII and responsible owners.
Governance rituals: weekly review of critical paths, monthly report on SLO.
17) Implementation Roadmap
0-30 days (MVP)
1. Identify critical KPIs/dashboards and their end-to-end paths.
2. Connect SQL parsing/jobs for tabular lineage.
3. Enter the node/communication passport and minimum freshness metrics.
4. Describe the PII tags in the key paths (KYC, payments).
60-90 days
1. Go to column-level for top showcases.
2. Integrate orchestrator runtime metadata (time, volume, statuses).
3. Associate DQ rules with a graph, include alerts.
4. Visualization: filters by domain/owner/PII, overlays of freshness.
3-6 months
1. Contracts and register of schemes on the event bus (game/payment feeds).
2. Full track ML-lineage (dannyye→fichi→model→inferens).
3. Impact analysis in CI → automatic tickets to dependency owners.
4. Column-level coverage ≥70% of active storefronts; SLO reporting.
18) Patterns and anti-patterns
Patterns:- Graph-first: a single metadata graph as a "compass" of changes.
- Contract-aware lineage: association with schema versions and validation results.
- Observation overlay: freshness/volumes/DQ over graph.
- Product-thinking: Domain owners publish certified "data products."
- "Picture for picture's sake" without automatic collection and support.
- Hand-held mind-maps instead of parsing and runtime-truth.
- Lack of column detailing in critical KPI paths.
- Linage without binding with accesses/PII and DSAR/Legal Hold processes.
19) Practical checklists
Before releasing data changes
- Contract updated, compatibility passed
- Dependency impact analysis completed
- v2-showcase assembled in parallel, comparison of metrics
- Backfill and rollback plan documented
Weekly review
- Critical paths are green in freshness
- No orphaned job/storefronts
- DQ incidents closed and documented
- Column-level> coverage of target threshold
Result
Lineage turns chaotic data streams into a manageable map of the area: you can see where what came from, who is responsible, what risks and how to safely change. For iGaming, this is a base of trust in KPIs, speed of experiments and mature compliance.