Indexing Analytical Repositories
1) Why indexing an iGaming platform
Analytics speed: reports on GGR/NET, conversions, RG/AML and A/B experiments fit into SLA.
Cost: fewer bytes to scan → lower calculation/warehouse bill.
Reliability: stable p95/p99 dashboard latency and API metrics.
Scale: dozens of brands/markets/PSP/providers without "full scan" hellish value.
2) Load model (before indexing)
Факты: `payments`, `game_rounds`, `sessions`, `bonus_events`.
Dimensions: 'dim _ user' (without PII), 'dim _ provider', 'dim _ psp', 'dim _ country'.
Requests: "last N days," aggregations by 'brand/country/provider/psp', filters by status field, join by surrogate-keys, search by JSON attributes (payment method, device), top-K/percentile.
We select indices based on selectivity, cardinality and frequency of use.
3) Types of indices and when to take them
3. 1 Classics
B-tree: equality/ranges for highly selective columns ('user _ surrogate _ id', 'occurred _ at', 'amount').
Hash: pure equality; less often in analytics (against ranges weak).
Bitmap: low cardinality and frequent connected filters ('country', 'kyc _ level', 'rg _ state', 'brand'). Different for summing masks.
3. 2 Column specificity
Min-max (data skipping): automatic statistics "minimum/maximum" in parquet stripes/parts → the engine skips blocks. Works better when sorted by filtered fields.
Bloom indexes: quick probabilistic tests of belonging to a value in a block (useful for 'user _ id', 'transaction _ id', 'psp').
BRIN (Block Range Index): cheap "pointers" to block ranges if the data is naturally ordered (time). Cheap but effective for time-series.
3. 3 Advanced/Specialized
GiST/GIN (inverted): JSON/arrays/text, filters by nested attributes ('metadata. method = 'Papara'`, `device. os in [...]`).
Join/Projection (ClickHouse/MPP): materials for accelerating join/agg (pre-join key is stored next to the fact, preliminary aggregations).
Vector (ANN): search for similar embeddings (recommendations/anti-fraud behavior) - IVF/HNSW/Flat as "nearest neighbor index."
Z-ordering/Z-order (lakehouse/Databricks )/Cluster keys (Snowflake )/ORDER BY (ClickHouse): multidimensional clustering of data on disk for better data skipping.
4) Partitioning, sorting, clustering
Parties (date/country/brand): large (day/week) to avoid the "curse of small files." We select fields with high selectivity in WHERE/access rights.
Sorting within a party: 'ORDER BY (occurred_at, brand, psp)' or Z-order by '(brand, country, provider)' - this is how min-max and bloom work better.
Cluster/Recluster: periodic reclassification to maintain locality.
TTL and retention: automatic deletion of old parts/segments.
5) Materialized views and projections
MV for hot slices: 'payments _ 7d _ by _ brand _ psp', 'rounds _ 1d _ by _ provider'. We support streaming upserts.
ClickHouse/Aggregate tables-Presets, roll-up levels (chas→den→nedelya).
Result cache: query result cache/warehouse result cache for repeatable dashboards (validated by query token and data freshness).
6) Semi-structured data (JSON/VARIANT)
Indexes by path: inverted/GIN index on json paths ('$ .device. os`, `$.psp. details. method`).
Materialization of important attributes in columns: for stable filters (payment method, device, application version).
Key statistics: collecting distributions for a selective plan.
7) Data Lakes: Iceberg/Delta/Hudi
Manifest indexes: metadata about parquet files (min-max, null-count, bloom) → partition pruning + file skipping.
File compression/merging: regular merge of small files into the "optimal" size (128-1024 MB).
Clustering/Z-order: repackaging files for correlating fields (e.g. 'brand, country, occurred _ at').
Delete/Update indexes: position deltas and bloom to speed up merge-on-read.
8) How to choose indices: practical checklist
1. Collect the top N requests (90% of the load) → the/join/group filter fields.
2. For each field, evaluate the selectivity 'sel = 1 - distinct (value )/rows' and cardinality.
3. Batch by time + 1-2 measurements with stable filters/accesses.
4. Sort/cluster keys to match filters and join keys.
5. Add bloom for point id, bitmap for low cardinality.
6. Hot aggregations → MV/projections.
7. JSON paths → inverted indexes + materialization.
8. On the lakes - compaction and clustering on a schedule.
9. Enter SLO: p95 latency, bytes scanned/request, skipped data rate.
9) Support and maintenance
ANALYZE/statistics: update cardinalities and histograms; otherwise, the optimizer is "blind."
VACUUM/OPTIMIZE/RECLUSTER: defragmentation and reclassification.
Monitoring the use of indexes: "covering rate," "unused index list," "bytes scanned/bytes skipped."
Auto-advisers: periodic recommendations for cluster keys and sorting based on query log.
Regression tests: before depleting new keys - comparing request profile and cost.
10) Metrics and SLO indexing
Technical: p95/p99 latency, scanned bytes/query, skipped bytes%, files touched, cache hit-rate.
Economy: $/request, $/dashboard, $/TB scan.
Operations: compaction time, reclassification queue, share of "small files."
Quality of plans: the proportion of queries using indices/projections, the accuracy of cardinalities.
11) iGaming cases (ready-made recipes)
11. 1 Payments/PSPs: Drops/Refusals
Party: 'by day'. Sort '(brand, country, occurred_at)'.
Bloom: `transaction_id`, `user_id`. Bitmap: `psp`, `status`.
MV: `payments_7d_by_brand_psp(status, declines)`.
Result: p95 ↓ with 8. 2s to 1. 1s, scanned bytes ↓ на 87%.
11. 2 Game Rounds: Provider/Game
Z-order / ORDER BY: `(provider, game_id, occurred_at)`.
Projection/agg: `rounds_1d_by_provider_game`.
BRIN (if Postgres-like storage): by 'occurred _ at'.
Result: top-K games/hour - sub-second on hot cache.
11. 3 RG/AML Restriction/Self Exclusion Events
Bitmap: `rg_state`, `kyc_level`. JSON-path GIN: `$.reason`.
MV: "active restrictions for 30 days" + materialization of the user-level without PII.
The result: quick samples for compliance without full scan billion events.
11. 4 Antifraud: routes and devices
Materializing JSON→kolonki: 'device. os`, `device. model`, `payment. method`.
Bloom: `graph_device_id`. Cluster: `(brand, country, device. os)`.
Vector index: embeddings "7d deposit behavior" → fast k-NN for similar anomalies.
12) Security and privacy
Zero-PII in indexed fields and plan logs.
On-disk encryption: Indexes/statistics are encrypted in the same way as data.
K-anonymity of aggregates: MV/projections are published only by ≥N groups.
Geo/tenant-isolation: parties/keys include 'brand/country/license'.
Legal Hold: indices/manivests also fall into the "freeze."
13) Anti-patterns
Index "all in a row" → volume explosion and write-amplification.
Small parties (hour/minutes) → storm of planks and "small files."
Sort keys that do not match the zero data skipping → filters.
Lack of statistics → bad plans, full scan.
JSON without path indices and without materializing hot attributes.
Ignore compaction and recluster → degradation in 2-4 weeks.
14) Templates (ready to use)
14. 1 Clustering/Indexing Policy (YAML)
yaml dataset: gold. payments partition_by: ["date"]
order_by: ["brand","country","occurred_at"]
indexes:
bloom: ["transaction_id","user_surrogate_id"]
bitmap: ["psp","status","rg_state"]
materialized_views:
- name: mv_payments_7d_brand_psp group_by: ["brand","psp","status"]
window: "7d"
slo:
p95_latency_ms: 1200 scanned_bytes_per_query_max_mb: 256 maintenance:
compact_small_files: true recluster_cron: "0 /6 "
privacy:
pii_in_index: false
14. 2 Lake Compaction Plan (Iceberg/Delta)
yaml compaction:
target_file_size_mb: 512 small_file_threshold_mb: 64 zorder_by: ["brand","country","occurred_at"]
run_every: "PT6H"
max_concurrency: 4
14. 3 Indexes for JSON fields
sql
-- GIN/inverted index on device attributes
CREATE INDEX idx_device_json ON gold. sessions
USING GIN ((device_json));
-- Materialization of critical pathways
ALTER TABLE gold. sessions ADD COLUMN device_os TEXT;
UPDATE gold. sessions SET device_os = device_json->>'os';
CREATE BITMAP INDEX idx_device_os ON gold. sessions(device_os);
14. 4 Index Monitoring SLOs
yaml monitoring:
skipped_bytes_share_min: 0. 70 index_usage_rate_min: 0. 85 stats_freshness_max_hours: 24 small_files_share_max: 0. 10
15) Implementation Roadmap
0-30 days (MVP)
1. Collecting top N requests and scan profiles.
2. Partitioning by date + sort matched with filters.
3. Enable data skipping (min-max) and bloom for id fields.
4. One MV for hot metric (payments 7d).
5. Dashboard SLI: p95, scanned bytes, skipped share, small files.
30-90 days
1. JSON paths: inverted indexes + materialization.
2. Lake: Compaction and Z-order/clustering by 2-3 keys.
3. Key/Projection Auto Advisor; regular ANALYZE.
4. Revision of batches (day→week) where "small files."
3-6 months
1. MV/projection catalog with versioning and SLA.
2. Vector indices for recommendations/anti-fraud.
3. Unified SLO policy and budgets $/request; degradation alerts.
4. Index privacy audit, geo/tenant isolation.
16) RACI
Data Platform (R): parties/indexes/compacts, auto-advisers, monitoring.
Analytics/BI (R): MV/projections for dashboards, query profiling.
Domain Owners (C): criteria for hot slices and filters.
Security/DPO (A/R): privacy, PII policies, geo/tenant keys.
SRE/Observability (C): SLO/alerting, kapasiti for compactions.
Finance (C): budgets $/query and savings from indexes.
17) Related Sections
Data Schemas and Evolution, Data Validation, DataOps Practices, Anomaly and Correlation Analysis, Analytics and Metrics API, Data Clustering, Dimension Reduction, MLOps: Model Exploitation.
Total
Analytical storage indexing is a strategy, not "create an index on everything." Correct partitions and sorting, data skipping and bloom, thoughtful MV/projections and regular compression give fast and predictable queries at a controlled cost and without risk to privacy. For iGaming, this means operational solutions for payments, providers and RG/AML - within SLA and budget.