Data normalization
1) Purpose
Normalization eliminates duplicates and anomalies of updates, sets uniform directories and keys, makes data consistent and cheap to maintain. In iGaming, this is critical for GGR/NGR, AML/RG analyses, regulatory reporting, antifraud, and ML.
2) Where we normalize
Bronze (raw): not normalized - storage as is (append-only) for forensics.
Silver (clean/conform): basic normalization (3NF/BCNF, directories, keys, SCD).
Gold (serve): target storefronts - controlled denormalization for reading/BI is possible.
3) Basic principles
1. Schema-first-All tables have explicit schemas and keys.
2. Single identifiers: 'user _ pseudo _ id', 'session _ id', 'game _ id', 'provider _ id', 'transaction _ id'.
3. Common directories: currencies, markets/jurisdictions, KYC/RG statuses, game providers, traffic channels.
4. Time and currency: store 'event _ time' (UTC) and normalized 'amount _ base' + 'fx _ source'.
5. Evolution: semantic versions, only compatible changes without "silent" breaks.
6. PII minimization: user - via pseudo-ID; mapping is stored separately, access is restricted.
4) Normal forms quickly
1NF: atomic values, no arrays in columns (arrays → child-tables).
2NF-Attributes depend on the entire compound key.
3NF: no transitive dependencies (attribute depends only on key).
BCNF: each determinant is a key. Use for "kernel" (payments/gameplay).
Practice: Silver-models of payments and gaming activity keep at least 3NF; more stringent BCNF - for reference books and reference tables.
5) Reference domain model (Silver)
5. 1 Reference books
`dim. users' (pseudo-ID, country, age range, RG statuses).
`dim. games' (game_id, provider_id, genre, RTP, volatility).
`dim. providers' (provider_id, type, license).
`dim. markets' (jurisdiction code, regulator).
`dim. fx_rates` (date, ccy_from, ccy_to, rate, fx_source).
5. 2 Facts (narrow event/transaction tables)
`fact. payments` (transaction_id, user_pseudo_id, amount_orig, currency, amount_base, market, event_time, psp_ref, method).
`fact. bets` (bet_id, user_pseudo_id, game_id, stake_base, stake_ccy, outcome, event_time).
`fact. payouts` (payout_id, user_pseudo_id, game_id, amount_base, event_time).
Links: facts ↔ guides on stable keys. We duplicate all amounts in the "source currency" and in the "base" (amount_base), fixing 'fx _ source'.
6) Slowly changing measurements (SCD)
Type I (overwrite): spelling/non-critical corrections.
Type II (history): 'valid _ from/valid _ to/is _ current', audit changes (for example, RG status changes).
Type III (alternative column): "before/after" for short comparisons.
Recommendation: for RG/KYC/marketing channel - SCD II; for game reference books (RTP) - SCD II with impact validation.
Example of SCD II (simplified):sql
CREATE TABLE dim. users_scd (
user_pseudo_id STRING,
country STRING,
rg_status STRING,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN
);
7) Deduplication and Keys
Surrogate keys (BIGINT/UUID) for internal links.
Natural keys (for example, 'transaction _ id' from PSP) - to be validated and stored separately.
Dedup by '(event_id, source)' to ingest + by business keys in Silver.
sql
CREATE TABLE silver. payments AS
SELECT EXCEPT(rn) FROM (
SELECT p., ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY event_time) rn
FROM bronze. payment_events p
)
WHERE rn = 1;
8) Currency standardization and timezones
'event _ time '- always UTC; for shop windows, add the locale/timezone of the market.
Currencies: 'amount _ orig' and 'amount _ base' (for example, EUR) + 'fx _ source', 'fx _ rate _ used'.
Daily fixing of courses: 'dim. fx_rates' with source and hash signature.
sql
SELECT t. transaction_id,
t. amount_orig,
t. currency,
r. rate AS fx_rate_used,
t. amount_orig r. rate AS amount_base
FROM bronze. payment_events t
JOIN dim. fx_rates r
ON r. date = DATE(t. event_time) AND r. ccy_from = t. currency AND r. ccy_to = 'EUR';
9) Consistency of reference books
Unified directory register (games, providers, markets, currencies).
DQ validators: 'in _ set', FK references, uniqueness, consistency of SCD.
Autogeneration of "thin" dimencias from external sources (game providers, countries, PSP).
10) When to denormalize
Denormalization is allowed in Gold for:- stable "wide" reports (GGR, risk showcases);
- Acceleration of BI queries/dashboards
- realtime storefronts (ClickHouse/Pinot) under SLA readings.
- Silver remains the source of truth.
- Denormalized fields - calculated/copied from Silver; versioning logic.
- Any denormalization is documented and tested for correctness.
11) Star and snowflake model
Star: one fact + flat measurements - easier and faster reading, more expensive writing/matching.
Snowflake: measurements are normalized (connected subdirectories) - fewer duplicates, more complex queries.
Recommendation: in Gold more often "star," in Silver - normalized "snowflakes."
12) Evolution of schemes (safe changes)
Back-compatible: adding nullable columns; new reference values with flags.
Breaking: renaming/typing/semantic shifts - only through '/v2 'and double entry for the migration period.
Contracts: JSON/Avro schemes in registry, consumer-tests for compatibility.
13) DQ controls for normalization
Minimum set:- Keys are unique: 'transaction _ id', 'bet _ id'.
- Reference integrity: FK on 'dim.'.
- Currencies: 'currency' from whitelist, 'fx _ rate _ used' not NULL, 'amount _ base> = 0'.
- Time: 'event _ time' in a reasonable window; no "future" events.
- SCD-correct: non-overlapping ranges' valid _ from/valid _ to '.
14) Examples of SQL models
Actual rates (3NF):sql
CREATE TABLE silver. fact_bets (
bet_id STRING PRIMARY KEY,
user_pseudo_id STRING NOT NULL,
game_id STRING NOT NULL,
stake_ccy DECIMAL(18,2) NOT NULL,
currency CHAR(3) NOT NULL,
stake_base DECIMAL(18,2) NOT NULL,
market CHAR(2) NOT NULL,
event_time TIMESTAMP NOT NULL
);
Star for GGR (Gold):
sql
CREATE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
m. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. markets m ON m. code = b. market
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
15) Privacy and compliance
Aliasing a user in Silver; connection with real ID - in a separate protected circuit.
RLS/CLS and field masking (e-mail/PAN not available in analytics).
Regionalization of directories/keys, DPO control for schema extension.
16) Observability and lineage
Data lineage from Bronze → Silver → Gold, version of transformations and contracts.
Metrics: completeness, validity, FK errors, duplicates, "holes" in time, request cost.
Alerts at breaks in directories and FX sources.
17) RACI
R: Data Engineering (Silver/Gold models), Data Platform (circuit register, DQ).
A: Head of Data/Architecture.
C: Compliance/DPO (PII/retention), Finance (FX/GGR), Risk (RG/AML).
I: BI/Product/Marketing/Operations.
18) Implementation Roadmap
MVP (2-4 weeks):1. Directory register (markets, currencies, providers, games).
2. Silver-models' fact. payments`, `fact. bets', 'dim.' (3HF), SCD II for'dim. users`.
3. Currency normalization/time zone, basic DQ rules (FK/uniqueness/in_set).
4. First Gold Showcase (GGR Daily) and reconciliation tests.
Phase 2 (4-8 weeks):- Expansion of SCD, coverage of game events, provider conformal models.
- Schema compatibility autotests, migration simulator, metadata catalog.
- Key/party optimization, clustering/Z-order.
- Denormalization policies for Gold, SLA/value; star/snowflake templates.
- Auto-generation of documentation, lineage graph in dashboards.
- Regional directories and encryption keys, DR exercises.
19) Quality checklist
- Single keys and directories are approved.
- Silver in 3NF, SCD applied to "slow" measurements.
- Currencies/timezones are normalized; 'fx _ source' is fixed.
- DQ rules (FK/uniqueness/range/in_set) are active.
- Denormalizations documented, correctness tests passed.
- Linage and freshness/fullness metrics are visible on dashboards.
20) Frequent mistakes and how to avoid them
PII blending in analytics: separate mappings, use CLS/RLS.
Insufficient normalization of Silver: lead to 3NF, otherwise expensive support and reconciliation errors.
FX "per report": Rates should be captured on an event, not "backdated."
No SCD for key dimensions: lost RG/KYC/channel history.
Gold renormalization: redundant joins → managed denormalization.
Opaque evolution of schemes: use registry and consumer-tests.
21) The bottom line
Normalization is a Silver-level discipline: uniform keys and reference books, 3NF/BCNF for facts and measurements, correct history (SCD) and standardization of time/currencies. With such a "skeleton," Gold cases become predictable, reports are comparable, and the cost of ownership is controlled.