Data lifecycle
1) Purpose and principles
The goal is to enable predictable, compliant, and cost-effective movement of data from inception to final disposition, supporting analytical, operational, and regulatory scenarios.
Basic principles:- Data as a Product: each set has an owner, contract, SLO, documentation.
- Schema-first: schemes are required; changes - through versioning.
- Privacy-by-Design: PII minimization, pseudonymization, regional storage.
- Observation-by-Default: metrics, access logging, lineage.
- Cost-aware: storage levels, TTL, sampling, compression.
2) Life cycle phases
2. 1 Create/Collect
Sources: products (web/mobile), backends, payments, KYC/AML providers, games/studios, marketing, operating logs.
Identifiers: 'event _ id', 'user. pseudo_id`, `session_id`, `trace_id`.
Contracts: JSON/Avro schemes, AsyncAPI/OpenAPI.
Input quality: validation of schemes, mandatory fields, size limits, anti-duplicates.
Privacy: tokenization of sensitive fields, geo-routing ingest (EEA/UK/BR).
2. 2 Ingest & Raw
Transport: HTTP/gRPC → Edge → bus (Kafka/Redpanda).
Raw layer (Bronze): append-only, immutable payloads (for forensics), partitioning by time/market/tenant.
Politicians: dedup by '(event_id, source)', DLQ for "broken" events, Legal Hold tags.
2. 3 Processing and cleaning (Refine)
Normalization (Silver): typing, deduplication, directories, FX/timezones, enrichment.
Quality (DQ): completeness/uniqueness/ranges/reference integrity.
Reprocessing: idempotent conveyors, time-travel, controlled backfills.
2. 4 Service/Use
Gold showcases: BI/reporting (GGR, RG, AML), product and risk models, real-time showcases.
Access: SQL/Trino, semantic metrics layer, API/GraphQL, Feature Store.
SLA freshness: for example, Gold-daily showcases are ready until 06:00 local time.
2. 5 Share and Publish
Internal consumers: Analytics, Product, Risk, Compliance, Marketing, Finance.
External offloads: regulators, partners/providers; immutable packages (PDF/CSV/JSON + hash).
Monitored channels: signed artifacts, audit downloads/exports.
2. 6 Archive/Retain
Retention policies: by data type and jurisdiction (e.g. regulatory - 5-7 years).
Storage layers: hot/warm/cold, WORM/Object Lock for immutability.
Archive indexing: directories, version/market labels, quick metadata search.
2. 7 Remove and Finish (Dispose)
Common removal: TTL/retention; safe cleaning, updating indexes.
Legal transactions: DSAR/RTBF (right to be forgotten), exceptions for legal storage obligations, Legal Hold (freeze removal).
Verification: deletion reports, audit log, cross-replica control.
3) Classification and catalogue
Sensitivity categories: public/internal/confidential/restricted.
Домены: Payments, Gameplay, Compliance/AML, RG, Marketing, Ops, Finance.
Data catalog: description, owner, freshness SLA, schemes, lineage, access levels.
Теги: `jurisdiction`, `tenant`, `pii_class`, `retention_class`, `legal_hold`.
4) Lakehouse model and schematics
Bronze/Silver/Gold: clear rules for transformation and responsibility.
Formats: Parquet + table format with ACID (Delta/Iceberg/Hudi).
Evolution of schemes: semantic versions, longitudinal compatibility, double-entry migrations for breaking changes.
Registry: Schema Registry, CI-validation of contracts, consumer-driven tests.
5) Data quality (DQ)
Quality metrics:- Completeness - The percentage of events/rows actually received.
- Validity: the proportion of records that passed the schema validation.
- Uniqueness: duplicate control.
- Consistency: compliance with reference books and links.
- Freshness: delayed arrival/materialization.
- DQ rules as code (YAML/SQL tests), dashboards, SLO alerts.
- Auto-fallback during degradation (last correct cut).
6) Privacy and compliance
PII minimization: store pseudo-ID, take mappings into an isolated loop.
Masking and RLS/CLS: at the column/row level; dynamic policies.
Regionalization: data residency by market; separate directories/encryption keys.
DSAR/RTBF: controlled projections, selective edits, audit issues.
Legal Hold: freeze marks, unchanging archives, access logging.
7) Access and security
Authentication/authorization: SSO, RBAC/ABAC, attributes of jurisdictions and roles.
Encryption: TLS in-transit; at-rest via KMS/CMK; key rotation.
Access logs: who/what/when/where; alerts for mass exports/scans.
Separation of duties: different roles for prod/analytics/admins/reviewers.
8) Lineage and observability
Technical lineage: from source → transformation → showcases → reports.
Operational lineage: links with releases, feature flags, models, AML/RG rules.
Platform metrics: throughput, lag, failure-rate, cost/query, cost/GB.
Tracing: transferring 'trace _ id' from applications to storefronts/alerts.
9) Time models and retroprocesses
Event-time vs Processing-time: приоритет event-time, watermarks/allowed lateness.
Backfill and reprocessing: idempotent pipelines, time-travel, control of "double counting."
Saving states: TTL, snapshots, disaster recovery.
10) Economics and cost control
Partitioning (date/market/tenant), clustering/Z-ordering.
Sampling for high-frequency analytics (not for transactions/compliance).
Multi-layer storage (hot/warm/cold), automatic TTL.
Budget/chargeback by team, limits on heavy requests and backfill.
11) Processes and RACI
R (Responsible): Data Platform (ingest/storage/orchestration), Data Engineering (transformation), Domain owners (Contracts/DQ/SLO).
A (Accountable): Head of Data/Chief Data Officer.
C (Consulted): Compliance/Legal/DPO, Architecture, SRE, Security.
I (Informed): BI/Продукт/Маркетинг/Финансы/Операции.
12) SLO/SLI (sample targets)
13) Dashboards
Freshness heat map by domain/market.
Completeness/Validity by thread.
Cost of storage and queries (by layer and command).
Lineage map for critical reports (regulatory, GGR, RG/AML).
DSAR/RTBF queues, Legal Hold statuses.
14) Retention policy templates (example)
The actual dates are determined by Legal/DPO and local law.
15) Documentation and standards
Data Product page: owner, destination, SLA, schemas, DQ rules, contacts.
Change log: schema/logic versions, impact analysis, migrations.
Runbooks: reprocessing, backfill, emergency scenarios, frieze button.
16) Implementation Roadmap
MVP (4-6 weeks):1. Data catalog and classification (top domains), basic schemes and register.
2. Lakehouse Bronze/Silver, ingestion with validation and deduplication.
3. 1-2 Gold cases (e.g. GGR and conversion).
4. Minimum DQ rules and Freshness/Completeness dashboard.
5. Retention policies and access RBACs.
Phase 2 (6-12 weeks):- Linage, semantic layer of metrics, DSAR/RTBF procedures.
- Regionalisation (EEA/UK), WORM for regulatory artefacts, Legal Hold.
- Cost optimization, SLO alerts, budget reporting.
- Data Mesh (domain products), consumer-driven contracts and tests.
- Automatic simulation of impact when changing schemes/logic, replays.
- Single compliance panel (regulatory, access, DQ, lineage).
17) Pre-sale checklist
- Schemes approved, contracts in register, compatibility tests.
- DQ rules are active, alerts are configured, SLOs are set.
- RBAC/ABAC roles checked, access logs enabled.
- Retention/deletion/archive policies have been validated by Legal/DPO.
- DSAR/RTBF/Legal Hold procedures are documented and tested.
- Lineage/metrics/cost are displayed in dashboards.
- Runbooks for backfill/reprocessing/DR are ready.
18) Frequent mistakes and how to avoid them
There is no single classification and directory: enter mandatory Data Product cards.
Raw data without schemes: schema-first + CI validation.
No removability: Design TTLs and RTBF processes from the start.
PII and analytics mix: store mappings separately, apply masking.
Gold without owner and SLO: Assign owner and freshness goals.
Unmanaged cost: batches, compression, tiered-storage, quotas.
19) Glossary (brief)
DSAR/RTBF - data subject request/delete right.
Legal Hold - removal freeze for legal reasons.
Lineage - traceability of origin and transformations.
Data Product is a managed product unit of data with SLAs.
DQ - data quality rules and metrics.
Lakehouse - combining data lake and ACID tables.
20) The bottom line
The data lifecycle is a managed arrangement system, not just a file warehouse. Clear contracts and schemes, classification and catalog, measurable quality, privacy and security, cost-effective storage architecture and transparent lineage make data a reliable asset that supports product, compliance and analytics without surprises and hidden risks.