Real-time signal processing
1) Purpose and business value
A real-time stream is needed to respond "here and now":- Antifraud/AML: structuring deposits, "mulling," velocity attacks.
- Responsible Gaming (RG): exceeding limits, risk patterns of behavior.
- Risk/Compliance: online registration/transaction sanction screening.
- Personalization: Bonus/mission triggers, reactive campaigns.
- Operations/SRE: SLA degradation, flurries of errors, anomalies of metrics.
Key objectives: low latency (p95 0. 5-5 s), high completeness (≥99. 5%), surge resistance.
2) Signal taxonomy
Transactional: 'payment. deposit/withdraw/chargeback`.
Gaming: 'game. bet/payout`, `game. session_start/stop`.
Authentication: 'auth. login/failure ', device change/geo.
Behavioral: rate of bets, exponential growth of the amount, night activity.
Operating rooms: 'api. latency`, `error. rate ', a "storm" of hearth restarts.
Each type has a schema, domain owner, criticality, SLO, and late data rules.
3) Real-time loop reference architecture
1. Ingest and bus: HTTP/gRPC → Edge → Kafka/Redpanda (partitioning by 'user _ id/tenant').
2. Streaming-движок: Flink/Spark Structured Streaming/Beam; stateful statements, CEP.
3. Online enrichment: lookup tables (Redis/Scylla/ClickHouse Read-Only), provider cache (sanctions/CUS).
- Alert topics/cue (case management, SOAR).
- Fichestor online (scoring models).
- Gold stream showcases (operational dashboards).
- "Warm" storage for fast analytics (ClickHouse/Pinot/Druid).
- 5. Archive/forensics: immutable folding in Lake (Parquet, time-travel).
- 6. Observability: tracing/metrics/logs + lineage.
4) Windows, watermarks and "late data"
Window views:- Tumbling: fixed windows (e.g. 1 min) - simple aggregates.
- Hopping: overlapping (e.g. step 30 s, window 2 min) - "smooth" metrics.
- Session: inactivity gaps - behavioral analysis.
- Watermarks: the "knowledge of time" boundary for event-time; allow lateness (e.g. 2 min).
- Belated strategies: additional issue of adjustments, postscript "late = true," DLQ.
5) Stateful statements and deduplication
Key: by 'user _ id', 'payment. account_id`, `device_id`.
Status: adders, sliding counters, bloom filters for idempotency.
Dedup: storing '(event_id, seen_at)' in state/kv; TTL = 24-72 hours.
Exactly-Once: transactional sink 'and (2-phase), idempotent upsert operations.
6) Stream enrichment
Lookup joys: RG limits, user risk rate, KYC level, geo/ASN.
Asynchronous calls: sanctions registry/anti-fraud providers (async I/O, timeouts and fallback).
Currency normalization/timezone: unification to UTC and base currency; fix 'fx _ source'.
7) CEP: detecting complex patterns
Examples of rules:- Structuring: deposit ≥3 for 10 minutes, each
X. - Device-switch: 3 different devices in 15 minutes + IP/ASN change.
- RG-fatigue: total bets for 1 hour> limit + loss ≥ Y.
- Ops-storm: p95 latency> 2 × base, 5xx> 3% in 5-min window.
CEP is conveniently expressed in Flink CEP/SQL or event template libraries.
8) Online features and models
Feature pipelines: counters, velocity-metrics, "time since the last event," share-of-wallet.
Online/offline consistency: one transformation codebase; transience tests.
Scoring: light models (logit/GBDT) synchronously; heavy - asynchronously through the queue.
Drift control: PSI/KS and alerts; "dark launches" for new models.
9) Delivery guarantees and procedure
At-least-once in the tire + idempotency at the reception.
Key partitioning provides a local order.
Retries & backpressure: exponential retrays with jitter, automatic pressure control.
10) SLO/SLI (recommended)
11) Observability of real-time contour
Pipeline metrics: throughput, lag per partition, busy time, checkpoint duration.
Signal quality: completeness, duplication rate, late ratio.
Dashboards: heat map of lags by topic, alert funnel (sobytiye→pravilo→keys), hot key map.
Tracing: associate alert with initiating events (trace_id).
12) Security and privacy
PII minimization: tokenization of identifiers, masking of sensitive fields.
Geo-residency: regional conveyors (EEA/UK/BR).
Audit: unchangeable decision logs (who, what, why), Legal Hold for cases.
Access: RBAC to rules/models, double control on kickouts.
13) Cost and performance
Hot keys: redistribution (key salting), composite keys.
Condition: reasonable TTL, incremental materialization, RocksDB tuning.
Windows: optimal size and allowed lateness; pre-aggregation layers for "noisy" streams.
Sampling: on non-critical flows and at the metric level (not on transactions/compliance).
14) Examples (simplified)
Flink SQL - structured deposits (10-min window, step 1 min):sql
CREATE VIEW deposits AS
SELECT user_id, amount, ts
FROM kafka_deposits
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY ts
MEASURES
FIRST(A. ts) AS start_ts,
SUM(A. amount) AS total_amt,
COUNT() AS cnt
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A{3,})
WITHIN INTERVAL '10' MINUTE
) MR
WHERE total_amt > 500 AND cnt >= 3;
Anti-velocity pseudo code by bid:
python key = event. user_id window = sliding(minutes=5, step=30) # hopping window count = state. counter(key, window)
sum_amt = state. sum(key, window)
if count > 30 or sum_amt > THRESH:
emit_alert("RG_VELOCITY", key, snapshot(state))
Kafka Streams event_id:
java if (!kvStore.putIfAbsent(event. getId(), now())) {
forward(event); // unseen -> process
}
15) Processes and RACI
R (Responsible): Streaming Platform (info, status, releases), Domain Analytics (rules/features).
A (Accountable): Head of Data/Risk/Compliance by its domains.
C (Consulted): DPO/Legal (PII/retention), SRE (SLO/Incidents), Architecture.
I (Informed): Product/Support/Marketing.
16) Implementation Roadmap
MVP (2-4 weeks):1. 2-3 critical signals (e.g. 'payment. deposit`, `auth. login`, `game. bet`).
2. Kafka + Flink, basic dedup and watermark; one CEP rule for anti-fraud and one for RG.
3. ClickHouse/Pinot for operational storefronts; dashboards lag/completeness.
4. Incident channel (webhook/Jira) and manual triage.
Phase 2 (4-8 weeks):- Online fichestor, scoring light models; asynchronous lookups (sanctions/CCL).
- Rule management as code, canary rolls, A/B rules.
- Regionalization and PII controls, Legal Hold for cases.
- Signal catalog, auto-generation of documentation, replay & what-if simulator.
- Auto-calibration of thresholds (Bayesian/quantile), precision/recall metrics online.
- DR-exercises, multi-region active-active, chargeback models by command.
17) Quality checklist before sale
- Schemes and contracts, validation in ingest.
- Windows configured, watermarks, allowed lateness + DLQ.
- Dedup and idempotent sink 'i.
- Lag/throughput/state size metrics, SLO alerts.
- Security: RBAC on rules/models, PII masking.
- Documentation: owner, SLO, examples, dependency maps.
- Rollback procedures and frieze button.
18) Frequent mistakes and how to avoid them
Ignore event-time: use watermarks, otherwise the metrics will "slide."
No deduplication - duplicates will produce false alerts → type idempotency.
Hot keys: distortion of parties → salting/resharding.
Windows too hard: loss of late → allowed lateness + corrective emissions.
PII blending: Separate tokenization and analytic flow.
No simulator: Test rules on a "replay" before rolling out.
19) Glossary (brief)
CEP - Complex Event Processing, pattern detection.
Watermark - time threshold for window readiness.
Allowed Lateness - admission of late events.
Stateful Operator is a persistent operator.
Feature Store - store of online/offline characteristics for ML.
20) The bottom line
Real-time signal processing is a controlled pipeline with clear circuits, windows and watermarks, stateful logic, online enrichment and strict SLOs. By following these practices, you get fast and reliable risk detectors, sustainable personalisation triggers and operational dashboards that scale sparingly and compliantly.