GH GambleHub

Working with Historical Data

1) Purpose and principles

Purpose: Store and process past states so that reports, models, and investigations are reproducible, accurate, and compliant.

Principles:
  • Time-aware by design-Explicit time models in schemas and queries.
  • Reproducibility: The same report for date D always produces the same result.
  • Auditability: lineage, immutable layers, WORMs where needed.
  • Cost-aware: archive layers, compression, cold storage with understandable SLAs.
  • Privacy-by-design: PII management for retrospective transactions and legal requests.

2) Time models

Event-time: time of the actual event (rate, deposit).
Processing-time when the system has processed the record (may differ).
Bitemporal: storing both event- and processing-time for retroactive edits.
Validity intervals: 'valid _ from', 'valid _ to', 'is _ current'.

As-of queries: data sampling "as they knew at the time of T."

Field Template:
sql event_time TIMESTAMP, -- event time processed_at TIMESTAMP, -- TIMESTAMP valid_from processing time, -- start of version validity valid_to TIMESTAMP, -- end of validity (NULL if current)
is_current   BOOLEAN

3) Storage layers and formats

Lakehouse: Bronze (raw append-only) → Silver (clean/SCD/normalization) → Gold (showcases).
ACID-форматы: Delta/Iceberg/Hudi (MERGE/Upsert, time-travel, snapshots).
Tiered storage: hot/warm/cold + WORM for regulatory artifacts.
Partitioning: 'event _ date', 'market', 'tenant'; clustering/Z-order by frequent predicates (user/game/provider).

4) Measurement History (SCD)

SCD I: overwrite - for non-critical edits.
SCD II: full story; recommended for RG/KYC/traffic channels/game attributes.
SCD III: "before/after" - rare comparison cases.

Example SCD II:
sql
MERGE INTO dim. users_scd t
USING stage. users u
ON t. user_pseudo_id = u. user_pseudo_id AND t. is_current = TRUE
WHEN MATCHED AND (t. rg_status <> u. rg_status OR t. country <> u. country) THEN
UPDATE SET is_current = FALSE, valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (user_pseudo_id, country, rg_status, valid_from, valid_to, is_current)
VALUES (u. user_pseudo_id, u. country, u. rg_status, CURRENT_TIMESTAMP, NULL, TRUE);

5) Fact story: snapshots and bitemporal

Snapshots: a snapshot of end-of-day/month aggregates (such as wallet balance) - speed up the re-creation of historical reports.
Bitemporal facts: fix event-time and processing-time to distinguish late fixes from retrospective calculations.
Exactly-once history: dedup by 'event _ id' + idempotent MERGE.

6) Time-travel and reproducibility

Time-travel: reading tables "at the time of T" for debugging, incidents, reconciliations.
Logic versioning: transformation artifacts (SQL/DBT versions, containers) and "logic_version" labels in output tables.
Frozen outputs: Gold reporting artifacts are captured and not rewritten, hash and export log are available.

Example of as-of request:
sql
SELECT
FROM silver. fact_bets VERSION AS OF 1678901234567
WHERE event_date = DATE '2025-10-31';

7) Backfill и Reprocessing

Backfill: Primary/preload historical range.
Reprocessing: recalculation after fixing bugs or changing business rules.

Gardereils:
  • Idempotency (MERGE/upsert), ranges, quotas, dry-run with metric comparison.
  • Marking the result: 'recalc _ reason', 'logic _ version', 'reprocessed _ at'.
Runbook:

1. Freeze current Gold; 2) DLQ/DQ verification; 3) Silver run; 4) comparison of metrics; 5) Gold rebuild; 6) publication and signature.

8) Reconciliation

Checksums: reconciliation of sales volumes/quantities with OLTP, PSP/providers.
Loop check: independent pipeline on sample (A/B comparison).
Tolerances such as GGR ≤ 0 discrepancy. 2% for the day.

SQL samples:
sql
-- Duplicates
SELECT transaction_id, COUNT() c
FROM silver. payments
GROUP BY transaction_id
HAVING COUNT() > 1;

-- Unknown Currencies/Markets
SELECT p. currency
FROM silver. payments p
LEFT JOIN ref. currencies r ON r. code = p. currency
WHERE r. code IS NULL;

9) Currencies, time, calendar: historical correctness

FX on the date of the event: fix 'fx _ rate _ used' and 'fx _ source'.
Local Market Time: DST/Timezones via Calendar Directory.
Holidays/seasonality: a separate calendar table, used in models and reports.

FX normalization example:
sql
SELECT p. transaction_id,
p. amount_orig,
r. rate AS fx_rate_used,
p. amount_orig r. rate AS amount_base,
r. fx_source
FROM bronze. payment_events p
JOIN dim. fx_rates r
ON r. date = DATE(p. event_time) AND r. ccy_from = p. currency AND r. ccy_to = 'EUR';

10) PII, Compliance and Legal Hold

PII minimization: pseudonymization, separate protected mapping.
DSAR/RTBF: computable projections and selective edits of historical layers; legal storage exceptions are documented.
Legal Hold: flags "freeze" deletions on ranges/objects, WORM for reportable artifacts.
Audit: immutable access and export logs.

11) DQ and lineage for history

DQ-as-code (example):
yaml table: silver. fact_bets slo:
completeness_percent: 99. 5 freshness_minutes: 60 rules:
- name: unique_bet type: unique columns: [bet_id]
severity: critical
- name: market_known type: in_set column: market set_ref: ref. markets
- name: ts_in_range type: temporal expression: "event_time BETWEEN date_sub(now(), interval 5 year) AND now()"

Lineage: fix versions of inputs/transformations/outputs; dependency graph is required for retroformations.

12) Performance and cost

Partitioning: by date/market/tenant; aggressive clustering by 'user _ pseudo _ id '/' game _ id', if we filter often.
Formats: Parquet + statistics/compression; regular VACUUM/OPTIMIZE.
Materialization: precompute for "expensive" historical aggregations; snapshots for quarterly/annual reporting.
Archiving: converting old batches to cold storage (SLAs for recovery are documented).
Sampling: only for research tasks, not for regulation/finance.

13) Historical Features for ML

Feature registry: each feature has a formula, owner, SLO, 'model _ version'.
Online/offline consistency: one transformation codebase, repeatability tests.
Characteristic drift: PSI/KS by period, storage of historical distributions.

14) Query patterns

As-of: reproducibility of reports.
Cohort analysis: cohorts of registrations/first deposits, rolling windows.
Slowly changing facts: корректные join’ы с SCD II (`event_time BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31')`).

Example of join'a with SCD II:
sql
SELECT b. bet_id, u. rg_status
FROM silver. fact_bets b
JOIN dim. users_scd u
ON u. user_pseudo_id = b. user_pseudo_id
AND b. event_time >= u. valid_from
AND (u. valid_to IS NULL OR b. event_time < u. valid_to);

15) Processes and RACI

R (Responsible): Data Engineering (models/SCD/backfill), Data Platform (ACID/archive), Finance/Compliance (reconciliations/storage requirements).
A (Accountable): Head of Data/CDO.
C (Consulted): Legal/DPO (DSAR/RTBF/Legal Hold), SRE (cost/SLA), Architecture.
I (Informed): BI/Product/Marketing/Operations.

16) Implementation Roadmap

MVP (3-5 weeks):

1. ACID tables with time-travel (Delta/Iceberg/Hudi) and basic partitioning.

2. SCD II for key dimensions (users/games/providers).

3. Daily snapshots of critical aggregates (GGR Daily).

4. DQ-as-code (uniqueness/in_set/temporal) + lineage-graph.

Phase 2 (5-10 weeks):
  • Bitemporal facts, as-of API/SQL templates, runbooks backfill/reprocessing.
  • FX/calendar/DST-enrichment, OLTP↔DWH/provaydery reconciliations.
  • Cold storage archiving, WORM for reporting packages, Legal Hold.
Phase 3 (10-16 weeks):
  • Complete automation of "replay & what-if," comparison of metrics and regression alerts.
  • Historical features and drift control ML, chargeback on storage cost.
  • Documentation "as-of" metrics and reproducible reports.

17) Pre-sale checklist

  • Tables support time-travel; VACUUM/RETENTION policies are consistent.
  • SCD II is implemented for critical measurements; join's tested.
  • Pictures of key units on D/M are available and checked with sparkles.
  • DQ rules are active; lineage displays inputs/outputs and logic versions.
  • DSAR/RTBF/Legal Hold tested on historical layers.
  • Cold storage archiving and recovery documented and verified.
  • Cost/GB, cold share, recovery SLA

18) Frequent mistakes and how to avoid them

No explicit time model: add event/processing/validity.
FX "retroactive": always the course at the time of the event, store 'fx _ source'.
Invalid join's with SCD: use validity interval, not'is _ current'.
Mutating Gold showcases: reportable outputs must be immutable (or versioned).
No lineage/DQ: no provability and checkpoints - enter them from day one.
Unmanageable cost: turn off hot parties, vacuum, convert to cold.

19) Glossary

As-of Query - data request "as they looked at the time of T."

Bitemporal - simultaneous fixation of event and processing time.
Snapshot - materialized snapshot of status/aggregates at the end of the period.
Time-travel - reading historical versions of tables.
WORM - Write Once Read Many.

20) The bottom line

Working with historical data is not just "long storage," but the discipline of time: explicit event/processing/bitemporal, SCD and snapshots models, reproducible as-of requests, rigorous reconciliations and compliance controls, observability, and a cost-effective storage architecture. By following this guide, you will have a solid historical foundation for reporting, analytics and ML that is resilient to audit and changes in business logic.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.