Data flow architecture

1) Purpose and principles

Objectives: to deliver correct, timely and compliant data for analytics, reporting, anti-fraud, personalization and ML.

Principles:

Data as a Product: clear owners, contracts, SLOs and versioning.
Schema-first: schemes are required; evolution by rules.
Privacy-by-Design: PII minimization, aliasing, access control.
Observation-by-Default: traces, metrics, lineage, quality profiles.
Cost-aware: tiered-storage, sampling noisy events, compression.

2) Source and Event Landscape

Transactional: deposits/withdrawals, bets/payouts, bonuses, chargeback.
User: sessions, clicks, conversions, RG limits, KYC statuses.
Operating: application logs, performance metrics, alerts.
Providers: PSP/KYC/sanctions/game studios (aggregators).
Reference: game catalogs, country/currency directories, tariffs/taxes.

Event typing (example):

json
{
"event_time":"2025-10-31T19:20:11Z",
"event_type":"payment. deposit",
"schema_version":"1. 3. 0",
"user":{"id":"U-123","country":"EE","age_band":"18-24"},
"payment":{"amount":200. 00,"currency":"EUR","method":"card","psp_ref":"PSP-222"},
"ctx":{"ip":"198. 51. 100. 10","session_id":"s-2233","trace_id":"f4c2..."}
}

3) High-level reference architecture

1. Ingest layer

Gateways (HTTP/gRPC), CDC connectors (from OLTP), queues/buses (Kafka/Redpanda), telemetry collectors.
Validation, normalization, PII edition at the input, contract enforcement.

2. Streaming layer

Stream jabs (Flink/Spark Structured Streaming/Beam) with deduplication, watermark, stateful aggregates.
Fan-out to storage and online services (fichestore, anti-fraud).

3. Batch layer

Orchestration (Airflow/Dagster), incremental downloads, backtests and retroprocesses, SCD types.

4. Storage (Lakehouse)

Bronze: raw events (append-only, immutable).
Silver: cleaned, conformal tables with quality and deduplication.
Gold: showcases/marts for specific cases (BI/regulator/ML).
Table formats with ACID (Delta/Iceberg/Hudi), hot/warm/cold layering.

5. Serving and access

BI/SQL (Trino/Presto/DuckDB), semantic layer (metrics layer), API/GraphQL, Feature Store for online/offline consistency.

6. Governance and safety

Directory/line, DQ rules, political access engine (RBAC/ABAC), masking/Tokenization, WORM archive for reports.

4) Contracts and schemes

Data contracts: OpenAPI/AsyncAPI/JSON Schema/Avro.
Evolution: semantic versions; backward-compatible changes - adding nullable fields; breaking - only with '/v2 'and double entry for the migration period.
Registers: Schema Registry, domain directory (Payments, Gameplay, Marketing).

5) Integration patterns

CDC (Change Data Capture): from OLTP to bus (Debezium), domain key partitioning.
Outbox/Inbox: guaranteed delivery of domain logic events.
Exactly-Once/Effectively-Once: transactions in the state, idempotent sink's, deduplication keys.
Late Data & Watermarks: handling late events; windows with allowed lateness.
Reprocessing: idempotent pipelines, time-travel, snapshot fixes.

6) Lakehouse model: bronze/silver/gold

Bronze (raw):

Time (event_date) and market (jurisdiction) parties.
Addition only; storage of the original payload for forensics.

Silver (clean):

Normalized types, reference books, deduplication by '(event_id, event_time)'.
FK verification, currency standardization/timezone, enrichment.

Gold (serve):

Denormalized showcases (GGR, RG scoring, LTV, cohort tables).
SLA for updating, aggregates for BI and reporting.

7) Data Quality

Rules: circuit validation, ranges, uniqueness, completeness, referential integrity.
Profiling: distribution, cardinality, "drift" of signs.
Monitoring: p50/p95 pipeline delay, drop-rate, error budget.
Degradation policy: automatic fallback (last snapshot), alerts and t-tests for metrics.

Example of a DQ contract (YAML):

yaml table: silver. payments rules:
- name: amount_positive type: range column: amount min: 0. 01
- name: currency_valid type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
- name: unique_tx type: unique columns: [transaction_id]
slo:
freshness_minutes: 15 completeness_percent: 99. 5

8) Privacy and compliance

PII minimization and masking: store pseudo-ID, separate look-up mappings.

Regionalization: geo-local buckets/catalogs (EEA/UK/BR), "data residency."

Legal operations: DSAR/RTBF (computable projections and selective edits), Legal Hold, unchangeable report archives.
Access logging: audit reads of "sensitive" tables, break-glass and JIT access.

9) Observability and management

Linege-Automatically traces dependencies from the source to the storefront.
Pipeline metrics: throughput, lag, failure-rate, cost/GB, cost/query.
Trace (OTel): 'trace _ id' from applications is thrown into events → we build an end-to-end request path.
Alerts: SLO budgets, freshness/volume/cardinality anomalies.

10) Access and security model

Data categories: public/internal/confidential/restricted.
Policies: row/column-level security; dynamic masking (PAN/IBAN/email).
Key management: KMS/CMK, at-rest/in-transit encryption, rotation.
Segregation of duties: separate roles of prod/analyst/admin/reviewer.

11) Data Mesh and Product Approach

Домены: Payments, Gameplay, Marketing, Risk, Compliance.
Data Product: owner, freshness SLA, field dictionary, tests, versions, consumption metric.
Contracts between domains: versioned, backward-compatible, consumer-driven tests.

12) Fichestor and ML streams

Feature registry: feature description, sources, transformations, SLO.
Online/offline consistency: one transformation code, online materialization delay ≤ 200-500 ms.
Drift monitoring: PSI/KS, auto-alerts and model rollbacks, PII control.
Journal of experiments: metadata, versions, reproducibility, model maps.

13) Finmodel and cost optimization

Partitioning and Z-order/Cluster by frequent predicates.
Cold storage and TTL for unused tables, VACUUM.
Materialized views only for stable query patterns.
Quotas and budgets for heavy jobs; chargeback by team.

14) Regional and multi-tenant topology

Multi-region active-active: replication of themes and tables, independent pipeline perimeters.
Failover/DR: RPO/RTO targets, orchestrator metadata snapshots, recovery check.
Multi-tenancy: directory/key/quota isolation, tenant_id marking.

15) Processes and RACI (in brief)

R: Data Platform (ingest, storage, orchestration), Data Engineering (transformation).
A: Head of Data / Chief Data Officer.
C: Compliance/Legal/DPO, Architecture, SRE.
I: BI/Analytics, Product, Marketing, Finance.

16) SLO/SLI for flows

freshness: p95 delay Silver ≤ 15 min, Gold (daily) ready ≤ 06:00 lock. time.
Completeness: ≥ 99. 5% of events per T window.
Validity: error-rate of DQ checks <0. 5% of the volume.
Serving availability: ≥ 99. 9% for BI/Feature API.

17) Table and partitioning templates

sql
-- Bronze: Deposit events
CREATE TABLE bronze. payment_deposits (
event_time TIMESTAMP,
event_id STRING,
user_pseudo_id STRING,
amount DECIMAL(18,2),
currency STRING,
psp_ref STRING,
payload VARIANT
)
PARTITION BY DATE(event_time)
CLUSTER BY (currency);

-- Silver: normalized model
CREATE TABLE silver. payments AS
SELECT event_id,
CAST(event_time AS TIMESTAMP) AS ts,
user_pseudo_id,
amount,
currency,
psp_ref
FROM bronze. payment_deposits
QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ts) = 1;

18) Orchestration and DevX

Infra-as-Code: pipeline repositories, tests, reviews, GitOps.
Data Contracts CI: circuit linters, DQ tests before deploy.
Backfill framework: secure retro processes with R/W and idempotency constraints.
Catalogs and templates: cookie-cutter generators, best-practices.

19) Implementation Roadmap

MVP (4-6 weeks):

1. Event bus + ingest of 2-3 key sources (OLTP CDC, API gateway).

2. Lakehouse Bronze/Silver, ACID format, catalog and basic DQ rules.

3. 1-2 Gold cases (daily GGR and conversion funnel).

4. Lag/completeness metrics, basic lineage, RBAC, and PII masking.

Phase 2 (6-12 weeks):

Streaming units (p95 latency ≤ 5 min), Feature Store, RG/AML showcases.
Semantic layer of metrics, SLA for reporting; cost-dashboards.
Regionalization (EEA/UK), DSAR/RTBF procedures, Legal Hold for artifacts.

Phase 3 (12 + weeks):

Data Mesh: product domains, consumer-driven contracts.
ML operations with drift monitoring, online/offline auto-negotiation.
Automatic simulation of schema changes (impact analysis) and "what-if" by cost.

20) Frequent mistakes and how to avoid them

Raw payloads without schemas: implement schema-first, register and CI validation.
No deduplication - event keys and idempotent synks in Silver.
Mix PII with analytics - Separate mappings and mask fields.
Gold without owner: assign owner, SLO and consumption metrics.
There is no reprocessing strategy: time-travel, logic versioning, "double counting" control.
Unmanageable value: batches, compression, TTL, observability of value.

21) Glossary (brief)

CDC - Capture changes from OLTP.
Outbox - we publish domain events transactionally.
Watermark - evaluation of flow completeness for windows.
Lakehouse - data lake + ACID tables.
Data Product - product unit of data with owner and SLO.
Feature Store - consistent distribution of ML features.

22) Bottom line

The data flow architecture is a manageable system of arrangements: clear contracts, observability, security and cost under control. Following the described patterns (schema-first, bronze/silver/gold, CDC + Outbox, DQ and lineage, privacy-by-design), the platform reliably supplies business, compliance and ML with quality data with predictable SLOs and understandable cost of ownership.

Data flow architecture

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects