Data flow architecture
1) Purpose and principles
Objectives: to deliver correct, timely and compliant data for analytics, reporting, anti-fraud, personalization and ML.
Principles:- Data as a Product: clear owners, contracts, SLOs and versioning.
- Schema-first: schemes are required; evolution by rules.
- Privacy-by-Design: PII minimization, aliasing, access control.
- Observation-by-Default: traces, metrics, lineage, quality profiles.
- Cost-aware: tiered-storage, sampling noisy events, compression.
2) Source and Event Landscape
Transactional: deposits/withdrawals, bets/payouts, bonuses, chargeback.
User: sessions, clicks, conversions, RG limits, KYC statuses.
Operating: application logs, performance metrics, alerts.
Providers: PSP/KYC/sanctions/game studios (aggregators).
Reference: game catalogs, country/currency directories, tariffs/taxes.
json
{
"event_time":"2025-10-31T19:20:11Z",
"event_type":"payment. deposit",
"schema_version":"1. 3. 0",
"user":{"id":"U-123","country":"EE","age_band":"18-24"},
"payment":{"amount":200. 00,"currency":"EUR","method":"card","psp_ref":"PSP-222"},
"ctx":{"ip":"198. 51. 100. 10","session_id":"s-2233","trace_id":"f4c2..."}
}
3) High-level reference architecture
1. Ingest layer
Gateways (HTTP/gRPC), CDC connectors (from OLTP), queues/buses (Kafka/Redpanda), telemetry collectors.
Validation, normalization, PII edition at the input, contract enforcement.
2. Streaming layer
Stream jabs (Flink/Spark Structured Streaming/Beam) with deduplication, watermark, stateful aggregates.
Fan-out to storage and online services (fichestore, anti-fraud).
3. Batch layer
Orchestration (Airflow/Dagster), incremental downloads, backtests and retroprocesses, SCD types.
4. Storage (Lakehouse)
Bronze: raw events (append-only, immutable).
Silver: cleaned, conformal tables with quality and deduplication.
Gold: showcases/marts for specific cases (BI/regulator/ML).
Table formats with ACID (Delta/Iceberg/Hudi), hot/warm/cold layering.
5. Serving and access
BI/SQL (Trino/Presto/DuckDB), semantic layer (metrics layer), API/GraphQL, Feature Store for online/offline consistency.
6. Governance and safety
Directory/line, DQ rules, political access engine (RBAC/ABAC), masking/Tokenization, WORM archive for reports.
4) Contracts and schemes
Data contracts: OpenAPI/AsyncAPI/JSON Schema/Avro.
Evolution: semantic versions; backward-compatible changes - adding nullable fields; breaking - only with '/v2 'and double entry for the migration period.
Registers: Schema Registry, domain directory (Payments, Gameplay, Marketing).
5) Integration patterns
CDC (Change Data Capture): from OLTP to bus (Debezium), domain key partitioning.
Outbox/Inbox: guaranteed delivery of domain logic events.
Exactly-Once/Effectively-Once: transactions in the state, idempotent sink's, deduplication keys.
Late Data & Watermarks: handling late events; windows with allowed lateness.
Reprocessing: idempotent pipelines, time-travel, snapshot fixes.
6) Lakehouse model: bronze/silver/gold
Bronze (raw):- Time (event_date) and market (jurisdiction) parties.
- Addition only; storage of the original payload for forensics.
- Normalized types, reference books, deduplication by '(event_id, event_time)'.
- FK verification, currency standardization/timezone, enrichment.
- Denormalized showcases (GGR, RG scoring, LTV, cohort tables).
- SLA for updating, aggregates for BI and reporting.
7) Data Quality
Rules: circuit validation, ranges, uniqueness, completeness, referential integrity.
Profiling: distribution, cardinality, "drift" of signs.
Monitoring: p50/p95 pipeline delay, drop-rate, error budget.
Degradation policy: automatic fallback (last snapshot), alerts and t-tests for metrics.
yaml table: silver. payments rules:
- name: amount_positive type: range column: amount min: 0. 01
- name: currency_valid type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
- name: unique_tx type: unique columns: [transaction_id]
slo:
freshness_minutes: 15 completeness_percent: 99. 5
8) Privacy and compliance
PII minimization and masking: store pseudo-ID, separate look-up mappings.
Regionalization: geo-local buckets/catalogs (EEA/UK/BR), "data residency."
Legal operations: DSAR/RTBF (computable projections and selective edits), Legal Hold, unchangeable report archives.
Access logging: audit reads of "sensitive" tables, break-glass and JIT access.
9) Observability and management
Linege-Automatically traces dependencies from the source to the storefront.
Pipeline metrics: throughput, lag, failure-rate, cost/GB, cost/query.
Trace (OTel): 'trace _ id' from applications is thrown into events → we build an end-to-end request path.
Alerts: SLO budgets, freshness/volume/cardinality anomalies.
10) Access and security model
Data categories: public/internal/confidential/restricted.
Policies: row/column-level security; dynamic masking (PAN/IBAN/email).
Key management: KMS/CMK, at-rest/in-transit encryption, rotation.
Segregation of duties: separate roles of prod/analyst/admin/reviewer.
11) Data Mesh and Product Approach
Домены: Payments, Gameplay, Marketing, Risk, Compliance.
Data Product: owner, freshness SLA, field dictionary, tests, versions, consumption metric.
Contracts between domains: versioned, backward-compatible, consumer-driven tests.
12) Fichestor and ML streams
Feature registry: feature description, sources, transformations, SLO.
Online/offline consistency: one transformation code, online materialization delay ≤ 200-500 ms.
Drift monitoring: PSI/KS, auto-alerts and model rollbacks, PII control.
Journal of experiments: metadata, versions, reproducibility, model maps.
13) Finmodel and cost optimization
Partitioning and Z-order/Cluster by frequent predicates.
Cold storage and TTL for unused tables, VACUUM.
Materialized views only for stable query patterns.
Quotas and budgets for heavy jobs; chargeback by team.
14) Regional and multi-tenant topology
Multi-region active-active: replication of themes and tables, independent pipeline perimeters.
Failover/DR: RPO/RTO targets, orchestrator metadata snapshots, recovery check.
Multi-tenancy: directory/key/quota isolation, tenant_id marking.
15) Processes and RACI (in brief)
R: Data Platform (ingest, storage, orchestration), Data Engineering (transformation).
A: Head of Data / Chief Data Officer.
C: Compliance/Legal/DPO, Architecture, SRE.
I: BI/Analytics, Product, Marketing, Finance.
16) SLO/SLI for flows
freshness: p95 delay Silver ≤ 15 min, Gold (daily) ready ≤ 06:00 lock. time.
Completeness: ≥ 99. 5% of events per T window.
Validity: error-rate of DQ checks <0. 5% of the volume.
Serving availability: ≥ 99. 9% for BI/Feature API.
17) Table and partitioning templates
sql
-- Bronze: Deposit events
CREATE TABLE bronze. payment_deposits (
event_time TIMESTAMP,
event_id STRING,
user_pseudo_id STRING,
amount DECIMAL(18,2),
currency STRING,
psp_ref STRING,
payload VARIANT
)
PARTITION BY DATE(event_time)
CLUSTER BY (currency);
-- Silver: normalized model
CREATE TABLE silver. payments AS
SELECT event_id,
CAST(event_time AS TIMESTAMP) AS ts,
user_pseudo_id,
amount,
currency,
psp_ref
FROM bronze. payment_deposits
QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ts) = 1;
18) Orchestration and DevX
Infra-as-Code: pipeline repositories, tests, reviews, GitOps.
Data Contracts CI: circuit linters, DQ tests before deploy.
Backfill framework: secure retro processes with R/W and idempotency constraints.
Catalogs and templates: cookie-cutter generators, best-practices.
19) Implementation Roadmap
MVP (4-6 weeks):1. Event bus + ingest of 2-3 key sources (OLTP CDC, API gateway).
2. Lakehouse Bronze/Silver, ACID format, catalog and basic DQ rules.
3. 1-2 Gold cases (daily GGR and conversion funnel).
4. Lag/completeness metrics, basic lineage, RBAC, and PII masking.
Phase 2 (6-12 weeks):- Streaming units (p95 latency ≤ 5 min), Feature Store, RG/AML showcases.
- Semantic layer of metrics, SLA for reporting; cost-dashboards.
- Regionalization (EEA/UK), DSAR/RTBF procedures, Legal Hold for artifacts.
- Data Mesh: product domains, consumer-driven contracts.
- ML operations with drift monitoring, online/offline auto-negotiation.
- Automatic simulation of schema changes (impact analysis) and "what-if" by cost.
20) Frequent mistakes and how to avoid them
Raw payloads without schemas: implement schema-first, register and CI validation.
No deduplication - event keys and idempotent synks in Silver.
Mix PII with analytics - Separate mappings and mask fields.
Gold without owner: assign owner, SLO and consumption metrics.
There is no reprocessing strategy: time-travel, logic versioning, "double counting" control.
Unmanageable value: batches, compression, TTL, observability of value.
21) Glossary (brief)
CDC - Capture changes from OLTP.
Outbox - we publish domain events transactionally.
Watermark - evaluation of flow completeness for windows.
Lakehouse - data lake + ACID tables.
Data Product - product unit of data with owner and SLO.
Feature Store - consistent distribution of ML features.
22) Bottom line
The data flow architecture is a manageable system of arrangements: clear contracts, observability, security and cost under control. Following the described patterns (schema-first, bronze/silver/gold, CDC + Outbox, DQ and lineage, privacy-by-design), the platform reliably supplies business, compliance and ML with quality data with predictable SLOs and understandable cost of ownership.