Data enrichment
1) Purpose and business value
Enrichment turns "raw" events into useful facts by adding context and signs:- Finance/reporting: fx-normalization of amounts, binding to markets/tax rates, calculation of GGR/NGR.
- Compliance/AML/RG: risk scores, sanctions/PEP tags, RG limits, behavioral characteristics.
- Marketing/product: traffic sources, segments, missions/quests, personalization.
- SRE/operations: geo/ASN for traffic, client/device type, feature flags and releases.
The key result is improved accuracy of models, quality of reports and speed of decision-making.
2) Enrichment sources (example catalogue)
Reference/catalogs: games, providers, markets/jurisdictions, currencies, tax tables, holiday calendar.
KYC/KYB/RG: verification levels, statuses, self-exclusions, limits, age groups.
AML/sanctions/PEP: screening hits, lists, risk levels.
Networks and devices: IP→geo/ASN, device/OS/browser, device fingerprint.
Payment providers (PSP): BIN tables, methods, MCC, risk tags.
FX/time: exchange rates at the event date, local time zones/DST.
Content and marketing: sources/campaigns/UTM, affiliates, segments.
Models and heuristics: pre-trained scoring, embedding, categorical mapping.
3) Types of enrichment
Lookup-joint: point mapping by key (game_id, BIN, ip_range, user_pseudo_id).
Dimension attach: attaching dimensions (dim.) to facts.
Derived fields: computed columns (amount_base, local_time, tax_rate).
Aggregations/velocity: counters for windows (N rates/min, amount of deposits/hour).
Risk/behavioral features: "time since last event," share-of-wallet, nightly activity.
Geo/ASN/Device: country code, region, operator, device/browser type.
Semantic mappings: provider/game classification, player clusters.
ML Features for Online/Offline Modeling (Feature Store)
4) Where to enrich: Batch vs Stream
Stream (real-time): antifraud, RG triggers, SRE alerts - p95 delays ≤ 2-5 s; lookup to caches (Redis/Scylla), asynchronous requests to providers with timeouts.
Batch (micro-batch/daily): Gold showcases (GGR/RG/AML), reconciliations, reports - stability and completeness are more important than latency.
Hybrid: fast online feature + nightly re-enrichment (reconciliation/accuracy).
5) Architectural reference
1. Bronze - raw events (append-only).
2. Silver (clean/conform) - normalization, keys, primary lookup 'and (fx, geo, dim.) .
3. Enrichment Layer - extended characteristics, window aggregates, risk labels.
4. Feature Store - characteristic register (online/offline consistency).
5. Gold - showcases for BI/regulator/models; immutable artifacts.
6. Services - API/GraphQL, reported exports, real-time alerts.
Components: Kafka/Redpanda, Flink/Spark/Beam, Redis/Scylla (lookup), ClickHouse/Pinot (live reading), Lakehouse (Delta/Iceberg/Hudi).
6) Contracts and schemes
Schema-first: 'event _ time', 'schema _ version', stable keys (user_pseudo_id, game_id, transaction_id).
Enrichment marks: 'enrichment. version`, `enrichment. sources`, `fx_source`, `geo_source`, `model_version`.
Versioning: new features are added as nullable; breaking changes - through '/v2 'and double entry.
7) Enrichment examples (SQL/pseudocode)
7. 1 FX normalization and local time
sql
SELECT p.transaction_id,
p.amount_orig,
p.currency,
r.rate AS fx_rate_used,
p.amount_orig r.rate AS amount_base,
p.event_time,
convert_timezone(m.tz, 'UTC', p.event_time) AS local_time,
r.fx_source
FROM bronze.payment_events p
JOIN dim.fx_rates r
ON r.date = DATE(p.event_time) AND r.ccy_from = p.currency AND r.ccy_to = 'EUR'
JOIN dim.markets m ON m.code = p.market;
7. 2 Geo/ASN by IP (pseudocode)
python geo = geo_db.lookup(ip)
asn = asn_db.lookup(ip)
record["geo_country"] = geo.country record["asn"] = asn.number record["enrichment"]["geo_source"] = "mmdb:2025-10-01"
7. 3 Window signs of deposit speed (stream)
sql
SELECT user_pseudo_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS deposits_sum_10m
FROM silver.payments
GROUP BY user_pseudo_id, TUMBLE(event_time, INTERVAL '10' MINUTE);
7. 4 Interfacing with RG limits
sql
SELECT b., r.daily_deposit_limit, r.self_exclusion
FROM silver.bets b
LEFT JOIN dim.rg_limits r USING (user_pseudo_id);
8) Quality of enrichment (DQ)
Minimum rules:- FX: 'fx _ rate _ used' not NULL, 'fx _ source' from whitelist, calculated 'amount _ base ≥ 0'.
- Geo/ASN: the share of successful lookups ≥ 98% (by market), 'country' in the directory.
- RG/AML labels: 'valid _ from/valid _ to' (SCD II) do not intersect; the absence of "holes" in history.
- Aggregates/windows: correctness of windows (no double counting), completeness ≥ 99. 5%.
- Model versions: 'model _ version' present, feature drift control.
yaml table: enriched.payments rules:
- name: fx_present type: not_null column: fx_rate_used severity: critical
- name: country_known type: in_set column: geo_country set_ref: ref.countries severity: major
- name: rg_scd_valid type: scd_validity columns: [valid_from, valid_to]
severity: major
9) Privacy and compliance
PII minimization: enrich by pseudo-ID, real identifiers - in a separate loop.
Geo-localization and residency: routing by region (EEA/UK/BR), separate encryption keys.
DSAR/RTBF: enriched projections must support "concealment "/redaction; keep the legal basis for exceptions.
Legal Hold: freeze deletions for reportable artifacts/cases.
10) Observability and lineage
Linage: from raw event → lookup/aggregates → displays/models; fix source versions ('fx _ source', 'geo _ source', 'bin _ source').
SLI/SLO: freshness p95 (Silver) ≤ 15 мин; successful geo-lookups ≥ 98%; The proportion of records with completed key characteristics ≥ 99% latency enrich stream p95 ≤ 2-5 s.
Dashboards: heat map completeness by sources, map of versions of reference books/models, monitor of "expensive" joins, drift of signs.
11) Cost and performance
Caches/materialization: frequent lookup 'and in Redis/Scylla; periodic snapshots.
Compact signs: store aggregates (not "raw" lists); use Parquet/column formats.
Partitioning: by date/market/tenant; clustering by frequently filtered fields.
Adaptive frequency: heavy enrich-jobs - at night; realtime - critical only.
Chargeback: cost/query and cost/GB accounting by team/feature.
12) Patterns and anti-patterns
Patterns:- Dimension Lookup + SCD II for RG/KYC/providers.
- Async Enrichment with timeouts and fallback (label "unknown" + repeat).
- Feature Store with online/offline negotiation and repeatability tests.
- Rule-as-Code for enrichment (threshold/categorical maps).
- Rigid binding to external APIs in a hot path without cache.
- Unlabeled source versions ('fx _ source', 'geo _ source').
- Denormalization "all with everything" in Silver (cost/complexity explosions).
- Introduction of PII into analytical layers.
13) Processes and RACI
R (Responsible): Data Engineering (pipelines enrich/stream), Domain Owners (feature semantics), MLOps (Feature Store).
A (Accountable): Head of Data / Chief Data Officer.
C (Consulted): Compliance/Legal/DPO, Finance (FX/налоги), Risk (RG/AML), SRE.
I (Informed): BI/Product/Marketing/Operations.
14) Implementation Roadmap
MVP (2-4 weeks):1. Enrichment source catalogue (fx, geo, markets, RG/KYC).
2. Silver-normalization + basic lookup 'and (fx/geo/dim.) .
3. The first aggregates are velocity (deposits/rates) and enriched. v1 tables.
4. Dashboard completeness/freshness, source versions.
Phase 2 (4-8 weeks):- Connection of sanctions/PEP/BCL, PSP BIN tables, device fingerprint.
- Feature Store (core features) + online cache, realtime enrichment Flink.
- DQ rules for enrich layer, lineage and dry-run simulations.
- Personalization (missions/quests) and RG/AML detectors online.
- Value management (quotas, materialization, Z-order), multi-region.
- Automatic generation of feature and catalog documentation.
15) Quality checklist before sale
- Consistent keys and schemas, source versions signed.
- DQ rules on fx/geo/RG/sanctions/windows; alerts and SLOs.
- Caches/timeouts and fallbacks for external lookups.
- Lineage and cost/performance dashboards.
- DSAR/RTBF/Legal Hold procedures for enriched tables.
- Documentation of characteristics (owner, formulas, SLO, impact).
16) Frequent mistakes and how to avoid them
Untagged reference/model versions: Always fix '_ source' and 'model _ version'.
Calculating fx "retroactively": use the rate at the time of the event; store the FX source.
PII blending: Tokenize and isolate mappings.
Double counting in units: check windows and dedup.
Synchronous external calls without cache: enter async + cache/retrai.
There is no repeatability of features: a single online/offline transformation code, compliance tests.
17) Glossary (brief)
Lookup/Dimension attach - attach the reference book to the fact by key.
Feature Store - characteristics register and serving for ML.
SCD II - measurement historization with validity intervals.
FX - exchange rates and normalization of amounts.
ASN - autonomous network system; useful for anti-fraud and geo-analytics.
18) The bottom line
Enrichment is the discipline of turning events into knowledge: consistent keys and schemes, controlled lookup and aggregates, versioned sources, default privacy, DQ and observability. By following the described patterns, you will receive reproducible, economical and compliant showcases and signs ready for reporting, personalization and real-time risk detectors.