GH GambleHub

Data normalization

1) Purpose

Normalization eliminates duplicates and anomalies of updates, sets uniform directories and keys, makes data consistent and cheap to maintain. In iGaming, this is critical for GGR/NGR, AML/RG analyses, regulatory reporting, antifraud, and ML.

2) Where we normalize

Bronze (raw): not normalized - storage as is (append-only) for forensics.
Silver (clean/conform): basic normalization (3NF/BCNF, directories, keys, SCD).
Gold (serve): target storefronts - controlled denormalization for reading/BI is possible.

3) Basic principles

1. Schema-first-All tables have explicit schemas and keys.
2. Single identifiers: 'user _ pseudo _ id', 'session _ id', 'game _ id', 'provider _ id', 'transaction _ id'.
3. Common directories: currencies, markets/jurisdictions, KYC/RG statuses, game providers, traffic channels.
4. Time and currency: store 'event _ time' (UTC) and normalized 'amount _ base' + 'fx _ source'.
5. Evolution: semantic versions, only compatible changes without "silent" breaks.
6. PII minimization: user - via pseudo-ID; mapping is stored separately, access is restricted.

4) Normal forms quickly

1NF: atomic values, no arrays in columns (arrays → child-tables).
2NF-Attributes depend on the entire compound key.
3NF: no transitive dependencies (attribute depends only on key).
BCNF: each determinant is a key. Use for "kernel" (payments/gameplay).

Practice: Silver-models of payments and gaming activity keep at least 3NF; more stringent BCNF - for reference books and reference tables.

5) Reference domain model (Silver)

5. 1 Reference books

`dim. users' (pseudo-ID, country, age range, RG statuses).
`dim. games' (game_id, provider_id, genre, RTP, volatility).
`dim. providers' (provider_id, type, license).
`dim. markets' (jurisdiction code, regulator).
`dim. fx_rates` (date, ccy_from, ccy_to, rate, fx_source).

5. 2 Facts (narrow event/transaction tables)

`fact. payments` (transaction_id, user_pseudo_id, amount_orig, currency, amount_base, market, event_time, psp_ref, method).
`fact. bets` (bet_id, user_pseudo_id, game_id, stake_base, stake_ccy, outcome, event_time).
`fact. payouts` (payout_id, user_pseudo_id, game_id, amount_base, event_time).

Links: facts ↔ guides on stable keys. We duplicate all amounts in the "source currency" and in the "base" (amount_base), fixing 'fx _ source'.

6) Slowly changing measurements (SCD)

Type I (overwrite): spelling/non-critical corrections.
Type II (history): 'valid _ from/valid _ to/is _ current', audit changes (for example, RG status changes).
Type III (alternative column): "before/after" for short comparisons.

Recommendation: for RG/KYC/marketing channel - SCD II; for game reference books (RTP) - SCD II with impact validation.

Example of SCD II (simplified):
sql
CREATE TABLE dim. users_scd (
user_pseudo_id STRING,
country STRING,
rg_status STRING,
valid_from TIMESTAMP,
valid_to  TIMESTAMP,
is_current BOOLEAN
);

7) Deduplication and Keys

Surrogate keys (BIGINT/UUID) for internal links.
Natural keys (for example, 'transaction _ id' from PSP) - to be validated and stored separately.
Dedup by '(event_id, source)' to ingest + by business keys in Silver.

Payment deduction (example):
sql
CREATE TABLE silver. payments AS
SELECT EXCEPT(rn) FROM (
SELECT p., ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY event_time) rn
FROM bronze. payment_events p
)
WHERE rn = 1;

8) Currency standardization and timezones

'event _ time '- always UTC; for shop windows, add the locale/timezone of the market.
Currencies: 'amount _ orig' and 'amount _ base' (for example, EUR) + 'fx _ source', 'fx _ rate _ used'.
Daily fixing of courses: 'dim. fx_rates' with source and hash signature.

Normalization of amounts (example):
sql
SELECT t. transaction_id,
t. amount_orig,
t. currency,
r. rate AS fx_rate_used,
t. amount_orig r. rate AS amount_base
FROM bronze. payment_events t
JOIN dim. fx_rates r
ON r. date = DATE(t. event_time) AND r. ccy_from = t. currency AND r. ccy_to = 'EUR';

9) Consistency of reference books

Unified directory register (games, providers, markets, currencies).
DQ validators: 'in _ set', FK references, uniqueness, consistency of SCD.
Autogeneration of "thin" dimencias from external sources (game providers, countries, PSP).

10) When to denormalize

Denormalization is allowed in Gold for:
  • stable "wide" reports (GGR, risk showcases);
  • Acceleration of BI queries/dashboards
  • realtime storefronts (ClickHouse/Pinot) under SLA readings.
Rules:
  • Silver remains the source of truth.
  • Denormalized fields - calculated/copied from Silver; versioning logic.
  • Any denormalization is documented and tested for correctness.

11) Star and snowflake model

Star: one fact + flat measurements - easier and faster reading, more expensive writing/matching.
Snowflake: measurements are normalized (connected subdirectories) - fewer duplicates, more complex queries.

Recommendation: in Gold more often "star," in Silver - normalized "snowflakes."

12) Evolution of schemes (safe changes)

Back-compatible: adding nullable columns; new reference values with flags.
Breaking: renaming/typing/semantic shifts - only through '/v2 'and double entry for the migration period.
Contracts: JSON/Avro schemes in registry, consumer-tests for compatibility.

13) DQ controls for normalization

Minimum set:
  • Keys are unique: 'transaction _ id', 'bet _ id'.
  • Reference integrity: FK on 'dim.'.
  • Currencies: 'currency' from whitelist, 'fx _ rate _ used' not NULL, 'amount _ base> = 0'.
  • Time: 'event _ time' in a reasonable window; no "future" events.
  • SCD-correct: non-overlapping ranges' valid _ from/valid _ to '.

14) Examples of SQL models

Actual rates (3NF):
sql
CREATE TABLE silver. fact_bets (
bet_id STRING PRIMARY KEY,
user_pseudo_id STRING NOT NULL,
game_id STRING NOT NULL,
stake_ccy DECIMAL(18,2) NOT NULL,
currency CHAR(3) NOT NULL,
stake_base DECIMAL(18,2) NOT NULL,
market CHAR(2) NOT NULL,
event_time TIMESTAMP NOT NULL
);
Star for GGR (Gold):
sql
CREATE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
m. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. markets m ON m. code = b. market
JOIN dim. games g  ON g. game_id = b. game_id
GROUP BY 1,2,3;

15) Privacy and compliance

Aliasing a user in Silver; connection with real ID - in a separate protected circuit.
RLS/CLS and field masking (e-mail/PAN not available in analytics).
Regionalization of directories/keys, DPO control for schema extension.

16) Observability and lineage

Data lineage from Bronze → Silver → Gold, version of transformations and contracts.
Metrics: completeness, validity, FK errors, duplicates, "holes" in time, request cost.
Alerts at breaks in directories and FX sources.

17) RACI

R: Data Engineering (Silver/Gold models), Data Platform (circuit register, DQ).
A: Head of Data/Architecture.
C: Compliance/DPO (PII/retention), Finance (FX/GGR), Risk (RG/AML).
I: BI/Product/Marketing/Operations.

18) Implementation Roadmap

MVP (2-4 weeks):

1. Directory register (markets, currencies, providers, games).

2. Silver-models' fact. payments`, `fact. bets', 'dim.' (3HF), SCD II for'dim. users`.

3. Currency normalization/time zone, basic DQ rules (FK/uniqueness/in_set).

4. First Gold Showcase (GGR Daily) and reconciliation tests.

Phase 2 (4-8 weeks):
  • Expansion of SCD, coverage of game events, provider conformal models.
  • Schema compatibility autotests, migration simulator, metadata catalog.
  • Key/party optimization, clustering/Z-order.
Phase 3 (8-12 weeks):
  • Denormalization policies for Gold, SLA/value; star/snowflake templates.
  • Auto-generation of documentation, lineage graph in dashboards.
  • Regional directories and encryption keys, DR exercises.

19) Quality checklist

  • Single keys and directories are approved.
  • Silver in 3NF, SCD applied to "slow" measurements.
  • Currencies/timezones are normalized; 'fx _ source' is fixed.
  • DQ rules (FK/uniqueness/range/in_set) are active.
  • Denormalizations documented, correctness tests passed.
  • Linage and freshness/fullness metrics are visible on dashboards.

20) Frequent mistakes and how to avoid them

PII blending in analytics: separate mappings, use CLS/RLS.
Insufficient normalization of Silver: lead to 3NF, otherwise expensive support and reconciliation errors.

FX "per report": Rates should be captured on an event, not "backdated."

No SCD for key dimensions: lost RG/KYC/channel history.
Gold renormalization: redundant joins → managed denormalization.
Opaque evolution of schemes: use registry and consumer-tests.

21) The bottom line

Normalization is a Silver-level discipline: uniform keys and reference books, 3NF/BCNF for facts and measurements, correct history (SCD) and standardization of time/currencies. With such a "skeleton," Gold cases become predictable, reports are comparable, and the cost of ownership is controlled.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.