Real-time analytics
1) Purpose and business value
Real-time analytics (RTA) provides reactions in seconds, not hours:- AML/Antifraud: structuring deposits, velocity attacks, risk transactions.
- Responsible Gaming (RG): exceeding limits, risk patterns, self-exclusion.
- SRE/Operations: early detection of SLA degradation, error bursts, cluster overheating.
- Product and marketing: personalization triggers, missions/quests, real-time segmentation.
- Operational reporting: near-real-time GGR/NGR, dashboards of halls/providers.
Targets: p95 end-to-end 0. 5–5 с, completeness ≥ 99. 5%, availability ≥ 99. 9%.
2) Reference architecture
1. Ingest/Edge — `/events/batch` (HTTP/2/3), gRPC, OTel Collector; validation of schemes, anti-duplicates, geo-routing.
2. Event bus - Kafka/Redpanda (participation by 'user _ id/tenant/market', DLQ, retention 3-7 days).
3. Stream processing - Flink/Spark Structured Streaming/Beam: stateful operators, CEP, watermarks, allowed lateness, deadup.
4. Online enrichment - Redis/Scylla/ClickHouse lookups (RG limits, KYC, BIN→MCC, IP→Geo/ASN), asynchronous calls with timeouts and fallback.
5. Serving - ClickHouse/Pinot/Druid (operational showcases 1-5 minutes), Feature Store (online signs), webhooks/ticketing/SOAR.
6. Lakehouse - Bronze/Silver/Gold for long-term consolidation, replay and reconciliation.
7. Observability - pipeline metrics, tracing (OTel), logs, lineage and cost-dashboards.
3) Signals and taxonomy
Payments: 'payment. deposit/withdraw/chargeback`.
Gaming: 'game. bet/payout ', sessions.
Authentication and behavior: 'auth. login/failure`, device-switch, velocity.
Operating: latency, error-rate, hearth restarts, saturation.
Compliance: sanction screening, RG flags, DSAR events.
Each type has a domain owner, a schema, a freshness SLO, and a late data policy.
4) Windows, watermarks and late data
Windows: tumbling (fixed) , hopping, session.
Watermark: "knowledge by time" boundary (usually 2-5 min).
Belated events: additional issue of adjustments, flag 'late = true', DLQ with a strong delay.
sql
SELECT user_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS sum_10m
FROM stream.payments
GROUP BY user_id, TUMBLE(event_time, INTERVAL '10' MINUTE);
5) CEP and stateful aggregations
Key: 'user _ id', 'device _ id', 'payment. account_id`.
Status: sliding counters/sums, bloom filters for deduplication, TTL.
CEP patterns: structuring (<threshold, ≥N times, per T window), device-switch, RG-fatigue.
python if cnt_deposits(last=10MIN) >= 3 and sum_deposits(last=10MIN) > THRESH and all(d.amount < REPORTING_THRESHOLD):
emit_alert("AML_STRUCTURING", user_id, snapshot())
6) Exactly-Once, order and idempotence
At-least-once delivery in bus + dedup by 'event _ id' on processing (TTL 24-72 h).
Order: partitioning by keys (local order is guaranteed).
Sink: transactional commits (2-phase) or idempotent upsert/merge.
Outbox/Inbox: transactional publishing of domain events from OLTP.
7) Online enrichment and Feature Store
Lookup: RG limits, KYC statuses, BIN→MCC, IP→Geo/ASN, markets/taxes, FX at the time of the event.
Asynchronous calls: sanctions/APP API with timeouts; on error - 'unknown' + retray/cache.
Feature Store: online/offline negotiation; one transformation codebase.
8) Real-time storefronts and surfing
ClickHouse/Pinot/Druid: second/minute aggregates, materialized views, SLA for a delay of 1-5 min.
API/GraphQL: low latency for dashboards/widgets.
Alerts: webhooks/Jira/SOAR with enriched context (trace_id, last events).
sql
CREATE MATERIALIZED VIEW mv_ggr_1m
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (toStartOfMinute(event_time), market, provider_id) AS
SELECT toStartOfMinute(event_time) AS ts_min,
market,
provider_id,
sumState(stake_base) AS s_stake,
sumState(payout_base) AS s_payout
FROM stream.game_events
GROUP BY ts_min, market, provider_id;
9) Metrics, SLI/SLO and dashboards
Recommended SLI/SLOs:- p95 ingest→alert ≤ 2 s (critical rules), ≤ 5 s (other).
- Completeness of T ≥ 99 window. 5%; Schema validity ≥ 99. 9%; Trace coverage ≥ 98%.
- Stream service availability ≥ 99. 9%; late-ratio ≤ 1%.
- Lag by parties/topics; busy time of operators; state size.
- Funnel "sobytiye→pravilo→keys," precision/recall by domain.
- Heat card late/completeness; hot key map.
10) Streaming DQ (Quality)
Ingest-validations: schema/enums/size-limits, anti-duplicates.
On stream: completeness/dup-rate/late-ratio, window correctness (without double counting).
Reaction policies: critical → DLQ + pager; major/minor → tagging + report.
yaml stream: payments rules:
- name: schema_valid type: schema severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
- name: dedup_window type: unique keys: [event_id]
window_minutes: 1440
11) Privacy, security and residency
PII minimization: ID aliasing, sensitive field masking, PAN/IBAN tokenization.
Data residency: regional pipelines (EEA/UK/BR), individual KMS keys.
DSAR/RTBF: selective editing on downstream storefronts; Legal Hold for cases/reports.
Audit: unchangeable logs of access/rule changes, release logging.
12) Economics and productivity
Sharding/keys: avoid "hot" keys (salting/composite), balance of parties.
Status: TTL, compact snapshots, RocksDB/state backend tuning.
Pre-aggregations: reduce in the early stages for noisy themes.
Sampling: only for non-critical metrics (not transactions/compliance).
Chargeback: Theme/job budgets, replay quotas and heavy requests.
13) Processes and RACI
R: Streaming Platform (info/releases), Domain Analytics (rules/features), MLOps (scoring/Feature Store).
A: Head of Data/Risk/Compliance by domain.
C: DPO/Legal (PII/retention), SRE (SLO/incidents), Architecture.
I: Product, Support, Marketing, Finance.
14) Implementation Roadmap
MVP (2-4 weeks):1. Kafka/Redpanda + 2 critical topics (for example, 'payments', 'auth').
2. Flink job with watermark, deduplication and 1 CEP rule (AML or RG).
3. Operational showcase at ClickHouse/Pinot (1-5 min), lag/completeness dashboards.
4. Incident channel (webhooks/Jira), basic SLOs and alerts.
Phase 2 (4-8 weeks):- Online enrichment (Redis/Scylla), Feature Store, asynchronous lookups.
- Rules management as code, canary/A-B, streaming DQ.
- Regionalization of conveyors, DSAR/RTBF procedures, Legal Hold for cases.
- Multi-region active-active, replay & what-if simulator, auto-threshold calibration.
- Gold-stream storefronts (GGR/RG/AML), near-real-time reporting.
- Cost-dashboards, chargeback, DR-exercises.
15) Examples (fragments)
Flink CEP — device-switch:sql
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
FIRST(A.device_id) AS d1,
LAST(B.device_id) AS d2,
COUNT() AS cnt
PATTERN (A B+)
DEFINE
B AS B.device_id <> PREV(device_id) AND B.ip_asn <> PREV(ip_asn)
) MR
Kafka Streams - idempotent filter:
java if (seenStore.putIfAbsent(eventId, now()) == null) {
context.forward(event);
}
16) Pre-sale checklist
- Schemes/contracts in Registry, back-compat tests are green.
- Included watermark/allowed lateness, dedup, and DLQ.
- Configured SLO and alerts (lag/late/dup/state size).
- Enrichment with caches and timeouts; fallback «unknown».
- RBAC/dual-control on rules/models; change log enabled.
- Documentation of rules/shop windows; runbook 'and replay/rollback.
17) Frequent mistakes and how to avoid them
Ignore event-time: without watermarks, the metrics "float."
No deduplication: false alerts, double counting.
Hot keys: distortion of parties → salting/resharding.
Synchronous front-end APIs in the hot path: async + cache only.
Unmanaged cost: preaggregations, TTL states, quotas, cost monitoring.
No simulator: rollouts without replay → regression.
18) The bottom line
Real-time analytics is not "fast BI," but a managed circuit with contracts, stateful logic, CEP, watermarks, online enrichment and strict SLOs. By following these practices, the platform receives accurate signals and decisions within seconds, maintaining compliance, product scenarios and operational resilience at a controlled cost.