Streaming and streaming analytics
1) Purpose and value
The streaming circuit provides on-the-fly decision making:- Antifraud/AML: identification of structuring of deposits, velocity attacks, anomalies of providers.
- Responsible Gaming (RG): exceeding limits, risk patterns, self-exclusion.
- Operations/SRE: SLA degradation, error bursts, early incident signals.
- Product/marketing: personalization events, missions/quests, real-time segmentation.
- Reporting near-real-time: GGR/NGR showcases, operating panels.
Target characteristics: p95 end-to-end 0. 5-5 s, completeness ≥ 99. 5%, managed value.
2) Reference architecture
1. Ingest/Edge
`/events/batch` (HTTP/2/3), gRPC, OTel Collector.
Validation of schemes, anti-duplicates, geo-routing.
2. Event bus
Kafka/Redpanda (partitioned by 'user _ id/tenant/market').
Retention 3-7 days, compression, DLQ/" quarantine "for" broken "messages.
3. Streaming
Flink / Spark Structured Streaming / Beam.
Stateful statements, CEP, watermark, allowed lateness, deduplication.
Enrichment (Redis/Scylla/ClickHouse-Lookup), asynchronous I/O with timeouts.
4. Serving/Operational Displays
ClickHouse/Pinot/Druid for minute/second aggregation and dashboards.
Feature Store (online) for scoring models.
Alert topics → SOAR/ticketing/webhooks.
5. Long-term storage (Lakehouse)
Bronze (raw), Silver (clean), Gold (serve) — Parquet + Delta/Iceberg/Hudi.
Replay/backtests, time-travel.
6. Observability
Pipeline metrics, tracing (OTel), logs, lineage.
3) Schemes and contracts
Schema-first: JSON/Avro/Protobuf + Registry, 'schema _ version' in each event.
Evolution: back-compatible - new nullable fields; breaking - '/v2 '+ double publication.
Required fields are 'event _ time' (UTC), 'event _ id', 'trace _ id', 'user. pseudo_id`, `market`, `source`.
4) Windows, watermarks and late data
Windows:- Tumbling, Hopping, Session.
- Watermark: event-time "knowledge" threshold; e.g. 2-5 minutes.
- Late data: pre-issue adjustments, "late = true," DLQ with a strong lag.
sql
SELECT user_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS sum_10m
FROM stream. payments
GROUP BY user_id, TUMBLE(event_time, INTERVAL '10' MINUTE);
5) Stateful aggregations and CEP
Key: 'user _ id', 'device _ id', 'payment. account_id`.
Status: sliding sums/counters, sessions, bloom filters for deduplication.
CEP patterns: structuring (<threshold, ≥N times, per T window), device-switch, RG-fatigue.
python if deposits. count(last=10MIN) >= 3 and deposits. sum(last=10MIN) > THRESH and all(d. amount < REPORTING_THRESHOLD):
emit_alert("AML_STRUCTURING", user_id, window_snapshot())
6) Exactly-Once, order and idempotence
Bus: at-least-once + partition keys provide local order.
Idempotence: 'event _ id' + dedup state (TTL 24-72 h).
Sink: transactional commits (2-phase) or upsert/merge-idempotency.
Outbox/Inbox: guaranteed publication of domain events from OLTP.
7) Real-time enrichment
Lookup: Redis/Scylla (RG limits, KYC status, BIN→MCC, IP→Geo/ASN).
Asynchronous calls: sanctions/APP API with timeouts and fallback ("unknown").
FX/timezone: normalization of amounts and local market time ('fx _ source', 'tz').
8) Serving and real-time storefronts
ClickHouse/Pinot/Druid: aggregations by minutes/seconds, materialized views.
Gold-stream: operational tables GGR/RG/AML, SLA for ≤ delay 1-5 min.
API/GraphQL: low latency for dashboards and external integrations.
sql
CREATE MATERIALIZED VIEW mv_ggr_1m
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (toStartOfMinute(event_time), market, provider_id) AS
SELECT toStartOfMinute(event_time) AS ts_min,
market,
provider_id,
sumState(stake_base) AS s_stake,
sumState(payout_base) AS s_payout
FROM stream. game_events
GROUP BY ts_min, market, provider_id;
9) Observability and SLO
SLI/SLO (landmarks):- p95 ingest→alert ≤ 2 s (critical), ≤ 5 s (balance).
- Completeness of T ≥ 99 window. 5%.
- Schema errors ≤ 0. 1%; The percentage of events with 'trace _ id' ≥ 98%.
- Stream service availability ≥ 99. 9%.
- Party/topic lags, busy time operators, state size.
- Funnel "sobytiye→pravilo→keys," map of "hot" keys, late-ratio.
- Cost: cost/GB, cost/query, cost of checkpoints/replays.
10) Privacy and compliance
PII minimization: ID pseudonymization, field masking, PAN/IBAN tokenization.
Data residency: regional pipelines (EEA/UK/BR), individual encryption keys.
Legal operations: DSAR/RTBF on downstream storefronts, Legal Hold for cases/reports.
Audit: access logs, unchanging solution archives.
11) Economics and productivity
Keys and sharding: Avoid "hot" keys (salting/composite key).
Condition: reasonable TTL, snapshots, tuning RocksDB/backend state.
Preaggregation: up-front reduce for noisy streams.
Sampling: valid on non-critical metrics (not on transactions/compliance).
Chargeback: budgets for themes/jobs, quotas and team allocation.
12) Streaming DQ (Quality)
Ingest-validation (schema, enums, size), dedup '(event_id, source)'.
On the stream: completeness/dup-rate/late-ratio, window control (no double counting).
Reaction policies: critical → DLQ + alert; major/minor → tag and then clear.
yaml stream: payments rules:
- name: schema_valid type: schema severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
- name: dedup_window type: unique keys: [event_id]
window_minutes: 1440
13) Access security and release control
RBAC/ABAC: separate roles for reading threads, changing rules/models.
Dual control: rollouts of rules and models through "2 keys."
Canary/A/B: dark rule and model runs, precision/recall control.
Secrets: KMS/CMK, regular rotation, prohibition of secrets in the logs.
14) Processes and RACI
R (Responsible): Streaming Platform (infra/releases), Domain Analytics (rules/features), MLOps (scoring).
A (Accountable): Head of Data/Risk/Compliance by domain.
C (Consulted): DPO/Legal (PII/retention), SRE (SLO/Incidents), Architecture.
I (Informed): Product, Support, Marketing, Finance.
15) Implementation Roadmap
MVP (2-4 weeks):1. Kafka/Redpanda + two critical topics ('payments', 'auth').
2. Flink job with watermark, deduplication and one CEP rule (AML or RG).
3. ClickHouse/Pinot showcase 1-5 min, dashboards lag/completeness.
4. Incident channel (webhooks/Jira), basic SLOs and alerts.
Phase 2 (4-8 weeks):- Online enrichment (Redis/Scylla), Feature Store, asynchronous lookups.
- Rule management as code, canary releases, A/B.
- Streaming DQ, regionalization of pipelines, DSAR/RTBF procedures.
- Multi-region active-active, what-if replay simulator, auto-calibration of thresholds.
- Full Gold-stream showcases (GGR/RG/AML), reporting near-real-time.
- Value dashboards, chargeback, DR exercises.
16) Examples (fragments)
Flink CEP — device switch:sql
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
FIRST(A. device_id) AS d1,
LAST(B. device_id) AS d2,
COUNT() AS cnt
PATTERN (A B+)
DEFINE
B AS B.device_id <> PREV(device_id) AND B.ip_asn <> PREV(ip_asn)
) MR
Kafka Streams - idempotent filter:
java if (seenStore. putIfAbsent(eventId, now()) == null) {
context. forward(event);
}
17) Pre-sale checklist
- Schemes and contracts in Registry, back-compat tests are green.
- Included watermark/allowed lateness, dedup, and DLQ.
- Configured SLO and alerts (lag/late/dup/state size).
- Enrichment with caches and timeouts, fallback "unknown."
- RBAC/dual-control to rules/models, all changes are logged.
- Rules, storefronts and runbook documentation and replay/rollback.
18) Frequent mistakes and how to avoid them
Ignore event-time: without watermarks, the metrics "float."
No deduplication: false alerts and double counting.
Hot keys: distortion of parties → salting/resharding.
Synchronous front-end APIs in the hot path: async + cache only.
Unmanaged cost: preaggregations, TTL states, quotas, cost-dashboards.
Lack of simulator: rollouts without "replay" lead to regressions.
19) Glossary (brief)
CEP - Complex Event Processing.
Watermark - window readiness limit by event-time.
Allowed Lateness - tolerance of late events.
Stateful Operator - an operator with a saved state.
Feature Store - coordinated feature surfing (online/offline).
20) The bottom line
Streaming and streaming analytics are a managed system: contracts, windows and watermarks, stateful logic and CEP, enrichment and real-time storefronts, SLO and observability, privacy and value under control. By following the practices described, the platform receives reliable risk detectors, operational panels, and personalization with predictable latency and cost.