Working with Historical Data
1) Purpose and principles
Purpose: Store and process past states so that reports, models, and investigations are reproducible, accurate, and compliant.
Principles:- Time-aware by design-Explicit time models in schemas and queries.
- Reproducibility: The same report for date D always produces the same result.
- Auditability: lineage, immutable layers, WORMs where needed.
- Cost-aware: archive layers, compression, cold storage with understandable SLAs.
- Privacy-by-design: PII management for retrospective transactions and legal requests.
2) Time models
Event-time: time of the actual event (rate, deposit).
Processing-time when the system has processed the record (may differ).
Bitemporal: storing both event- and processing-time for retroactive edits.
Validity intervals: 'valid _ from', 'valid _ to', 'is _ current'.
As-of queries: data sampling "as they knew at the time of T."
Field Template:sql event_time TIMESTAMP, -- event time processed_at TIMESTAMP, -- TIMESTAMP valid_from processing time, -- start of version validity valid_to TIMESTAMP, -- end of validity (NULL if current)
is_current BOOLEAN
3) Storage layers and formats
Lakehouse: Bronze (raw append-only) → Silver (clean/SCD/normalization) → Gold (showcases).
ACID-форматы: Delta/Iceberg/Hudi (MERGE/Upsert, time-travel, snapshots).
Tiered storage: hot/warm/cold + WORM for regulatory artifacts.
Partitioning: 'event _ date', 'market', 'tenant'; clustering/Z-order by frequent predicates (user/game/provider).
4) Measurement History (SCD)
SCD I: overwrite - for non-critical edits.
SCD II: full story; recommended for RG/KYC/traffic channels/game attributes.
SCD III: "before/after" - rare comparison cases.
sql
MERGE INTO dim. users_scd t
USING stage. users u
ON t. user_pseudo_id = u. user_pseudo_id AND t. is_current = TRUE
WHEN MATCHED AND (t. rg_status <> u. rg_status OR t. country <> u. country) THEN
UPDATE SET is_current = FALSE, valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (user_pseudo_id, country, rg_status, valid_from, valid_to, is_current)
VALUES (u. user_pseudo_id, u. country, u. rg_status, CURRENT_TIMESTAMP, NULL, TRUE);
5) Fact story: snapshots and bitemporal
Snapshots: a snapshot of end-of-day/month aggregates (such as wallet balance) - speed up the re-creation of historical reports.
Bitemporal facts: fix event-time and processing-time to distinguish late fixes from retrospective calculations.
Exactly-once history: dedup by 'event _ id' + idempotent MERGE.
6) Time-travel and reproducibility
Time-travel: reading tables "at the time of T" for debugging, incidents, reconciliations.
Logic versioning: transformation artifacts (SQL/DBT versions, containers) and "logic_version" labels in output tables.
Frozen outputs: Gold reporting artifacts are captured and not rewritten, hash and export log are available.
sql
SELECT
FROM silver. fact_bets VERSION AS OF 1678901234567
WHERE event_date = DATE '2025-10-31';
7) Backfill и Reprocessing
Backfill: Primary/preload historical range.
Reprocessing: recalculation after fixing bugs or changing business rules.
- Idempotency (MERGE/upsert), ranges, quotas, dry-run with metric comparison.
- Marking the result: 'recalc _ reason', 'logic _ version', 'reprocessed _ at'.
1. Freeze current Gold; 2) DLQ/DQ verification; 3) Silver run; 4) comparison of metrics; 5) Gold rebuild; 6) publication and signature.
8) Reconciliation
Checksums: reconciliation of sales volumes/quantities with OLTP, PSP/providers.
Loop check: independent pipeline on sample (A/B comparison).
Tolerances such as GGR ≤ 0 discrepancy. 2% for the day.
sql
-- Duplicates
SELECT transaction_id, COUNT() c
FROM silver. payments
GROUP BY transaction_id
HAVING COUNT() > 1;
-- Unknown Currencies/Markets
SELECT p. currency
FROM silver. payments p
LEFT JOIN ref. currencies r ON r. code = p. currency
WHERE r. code IS NULL;
9) Currencies, time, calendar: historical correctness
FX on the date of the event: fix 'fx _ rate _ used' and 'fx _ source'.
Local Market Time: DST/Timezones via Calendar Directory.
Holidays/seasonality: a separate calendar table, used in models and reports.
sql
SELECT p. transaction_id,
p. amount_orig,
r. rate AS fx_rate_used,
p. amount_orig r. rate AS amount_base,
r. fx_source
FROM bronze. payment_events p
JOIN dim. fx_rates r
ON r. date = DATE(p. event_time) AND r. ccy_from = p. currency AND r. ccy_to = 'EUR';
10) PII, Compliance and Legal Hold
PII minimization: pseudonymization, separate protected mapping.
DSAR/RTBF: computable projections and selective edits of historical layers; legal storage exceptions are documented.
Legal Hold: flags "freeze" deletions on ranges/objects, WORM for reportable artifacts.
Audit: immutable access and export logs.
11) DQ and lineage for history
DQ-as-code (example):yaml table: silver. fact_bets slo:
completeness_percent: 99. 5 freshness_minutes: 60 rules:
- name: unique_bet type: unique columns: [bet_id]
severity: critical
- name: market_known type: in_set column: market set_ref: ref. markets
- name: ts_in_range type: temporal expression: "event_time BETWEEN date_sub(now(), interval 5 year) AND now()"
Lineage: fix versions of inputs/transformations/outputs; dependency graph is required for retroformations.
12) Performance and cost
Partitioning: by date/market/tenant; aggressive clustering by 'user _ pseudo _ id '/' game _ id', if we filter often.
Formats: Parquet + statistics/compression; regular VACUUM/OPTIMIZE.
Materialization: precompute for "expensive" historical aggregations; snapshots for quarterly/annual reporting.
Archiving: converting old batches to cold storage (SLAs for recovery are documented).
Sampling: only for research tasks, not for regulation/finance.
13) Historical Features for ML
Feature registry: each feature has a formula, owner, SLO, 'model _ version'.
Online/offline consistency: one transformation codebase, repeatability tests.
Characteristic drift: PSI/KS by period, storage of historical distributions.
14) Query patterns
As-of: reproducibility of reports.
Cohort analysis: cohorts of registrations/first deposits, rolling windows.
Slowly changing facts: корректные join’ы с SCD II (`event_time BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31')`).
sql
SELECT b. bet_id, u. rg_status
FROM silver. fact_bets b
JOIN dim. users_scd u
ON u. user_pseudo_id = b. user_pseudo_id
AND b. event_time >= u. valid_from
AND (u. valid_to IS NULL OR b. event_time < u. valid_to);
15) Processes and RACI
R (Responsible): Data Engineering (models/SCD/backfill), Data Platform (ACID/archive), Finance/Compliance (reconciliations/storage requirements).
A (Accountable): Head of Data/CDO.
C (Consulted): Legal/DPO (DSAR/RTBF/Legal Hold), SRE (cost/SLA), Architecture.
I (Informed): BI/Product/Marketing/Operations.
16) Implementation Roadmap
MVP (3-5 weeks):1. ACID tables with time-travel (Delta/Iceberg/Hudi) and basic partitioning.
2. SCD II for key dimensions (users/games/providers).
3. Daily snapshots of critical aggregates (GGR Daily).
4. DQ-as-code (uniqueness/in_set/temporal) + lineage-graph.
Phase 2 (5-10 weeks):- Bitemporal facts, as-of API/SQL templates, runbooks backfill/reprocessing.
- FX/calendar/DST-enrichment, OLTP↔DWH/provaydery reconciliations.
- Cold storage archiving, WORM for reporting packages, Legal Hold.
- Complete automation of "replay & what-if," comparison of metrics and regression alerts.
- Historical features and drift control ML, chargeback on storage cost.
- Documentation "as-of" metrics and reproducible reports.
17) Pre-sale checklist
- Tables support time-travel; VACUUM/RETENTION policies are consistent.
- SCD II is implemented for critical measurements; join's tested.
- Pictures of key units on D/M are available and checked with sparkles.
- DQ rules are active; lineage displays inputs/outputs and logic versions.
- DSAR/RTBF/Legal Hold tested on historical layers.
- Cold storage archiving and recovery documented and verified.
- Cost/GB, cold share, recovery SLA
18) Frequent mistakes and how to avoid them
No explicit time model: add event/processing/validity.
FX "retroactive": always the course at the time of the event, store 'fx _ source'.
Invalid join's with SCD: use validity interval, not'is _ current'.
Mutating Gold showcases: reportable outputs must be immutable (or versioned).
No lineage/DQ: no provability and checkpoints - enter them from day one.
Unmanageable cost: turn off hot parties, vacuum, convert to cold.
19) Glossary
As-of Query - data request "as they looked at the time of T."
Bitemporal - simultaneous fixation of event and processing time.
Snapshot - materialized snapshot of status/aggregates at the end of the period.
Time-travel - reading historical versions of tables.
WORM - Write Once Read Many.
20) The bottom line
Working with historical data is not just "long storage," but the discipline of time: explicit event/processing/bitemporal, SCD and snapshots models, reproducible as-of requests, rigorous reconciliations and compliance controls, observability, and a cost-effective storage architecture. By following this guide, you will have a solid historical foundation for reporting, analytics and ML that is resilient to audit and changes in business logic.