GH GambleHub

Streaming

What is Streaming

Streaming is a continuous reaction to endless sequences of events (transaction log, clicks, payments, telemetry), with minimal delay and a guarantee that the states are correct. Unlike the batch, where "we take all the accumulated over the period," the stream processes the data as it arrives, maintains the state and takes into account the time of the event.

Key concepts

Event is an immutable fact with 'event _ time' and unique 'event _ id'.
Event time vs processing time - the first comes from the source, the second - when the operator actually saw the event.

Windows - group events by time:
  • Tumbling, Hopping/Sliding, Session.
  • Watermarks - an assessment that "events before T have already arrived," allowing you to close windows and limit the waiting for late data.
  • Lateness - events with 'event _ time' less than the current watermark; finishing rules are often applied.
  • State - local tables/keyed state for aggregates, joins, deduplication.
  • Backpressure - pressure when the downstream throughput is exceeded; is controlled by protocol and buffers.

Architectural basis

1. Source: event broker (Kafka/NATS/Pulsar), CDC from DB, queues, files/log collectors.
2. Streaming engine: calculates windows, aggregates, joyns, patterns (CEP), manages state and checkpoints.
3. Sink: OLTP/OLAP database, search engine, cache, topics, storages for showcases/reports.
4. Schema registry: controlling payload evolution and compatibility.
5. Observability: metrics, tracing, logs, dashboards of lag and watermarks.

Time semantics and order

Always prefer event time: this is the only invariant for delays and interruptions.
Events can come out of order; order is guaranteed only within the party key.

Watermarks allow:
  • close windows and emit results;
  • limit "how much we are waiting for" delayed events ('allowed _ lateness').
  • For late events, use retractions/upserts: recalculation of aggregates and corrective events.

Condition and reliability

Keyed state: data of aggregates (sums, counters, structures for deduplication) are shuffled by keys.
Checkpoint/Savepoint - periodic status snapshots for recovery; savepoint - managed snapshot for code version migrations.

Exactly-once in effect is achieved:
  • transactional "read-processed-write" (commit sink + read position);
  • idempotent sinks (upsert/merge) + deduplication tables;
  • by versioning aggregates (optimistic concurrency).

Windows, aggregations, join

Windows:
  • Tumbling: simple periodic reports (minute, hour).
  • Hopping/Sliding: "sliding" metrics (in 5 minutes in 1-minute increments).
  • Session: natural for custom sessions and anti-fraud.
  • Aggregations: sum/count/avg/approx-distinct (HyperLogLog), percentiles (TDigest/CKMS).
  • Stream-Stream join: requires buffering both sides by key and time, respect'allowed _ skew '.
  • Stream-Table join (KTable) -Attaches a directory or current state (for example, "active user limits").

Working with lagged and duplicate data

Deduplication: by 'event _ id' or '(producer_id, sequence)'; Store the "seen" keys with the TTL ≥ the redo window.
Late events: Allow window post-processing for'X 'after closing (retractions/upserts).
False duplicates: adjust the aggregates idempotently and fix the "ALREADY_APPLIED" in the logs.

Scale and performance

Key sharding: provides parallelism; watch for hot keys.
Backpressure: limit parallelism, use batches and compression when publishing.
Watermarks: Don't be too aggressive - hard watermarks reduce anticipation but increase the proportion of late updates.
Status: select the format (RocksDB/state store/in memory) taking into account the size and access patterns; clean the TTL.
Autoscaling: by lag, CPU, state size, GC time.

Reliability and restarts

Idempotent sink or transaction commit with offset fixation is the basis of correctness.

Reprocessing after restart is allowed; the effect must remain "exactly once."

DLQ/parking lot: send problem records to a separate thread with reasons; provide reprocessing.

Observability (what to measure)

Lag by source (by time and by message).
Watermark/current event time and the proportion of late events.
Throughput/latency operators, p95/p99 end-to-end.
State size/rocksdb I/O, checkpoint rate/duration.
DLQ rate, deduplication/retray percentage.
CPU/GC/heap, pause time.

Safety and compliance

Data classification: mark PII/PCI in diagrams, store the minimum, encrypt state and snapshots.
Access control: separate ACLs for topic/state tables and for sinks.
Retentions: consistent with legal requirements (GDPR/right to be forgotten).
Audit: log 'event _ id', 'trace _ id', outcome: 'APPLIED/ALREADY _ APPLIED/RETRIEVED'.

Implementation patterns

1. CDC → normalization → domain events: do not broadcast raw database changes, map to understandable business facts.
2. Outbox for producers: transaction fact + event - in one database transaction.
3. Core vs Enriched: minimal payload in critical flow, enrichment - asynchronous.
4. Replay-friendliness: projections/showcases must be reassembled from the log.
5. Idempotency by design: operation/event key, upsert schemes, versions of aggregates.

Testing

Unit/Property-based: invariants of aggregates and transformations.
Stream tests: fixed event stream with out-of-order and duplicates → window and deduplication checks.
Golden windows: reference windows/aggregates and allowable late adjustments.

Fault-injection: fall between "recorded effect" and "committed offset."

Replay tests: reassembling the showcase from the beginning of the log = current state.

Cost and optimization

Windows and watermark affect latency/resources: the longer the window and the greater the'allowed _ lateness', the greater the state.
Codecs and compression: balance CPU/network.
Batching output: fewer network calls and transactions.
Filtering early ("pushdown"): discard excess as close to the source as possible.

Antipatterns

Tie to processing time where event time is needed → incorrect analytics.
Lack of idempotency in sink → double effects at restarts.
Global "mega-keys": one hot partition breaks parallelism.
Raw CDCs as public events: leaked DB schemas, fragility in evolution.
No DLQ: "poisonous" messages block the entire pipeline.
Fixed hard delay instead of watermark: either eternal waiting or data loss.

Examples of domains

Payments/Finance

Stream 'payment.', windows for anti-fraud (session + CEP), grandfather by 'operation _ id'.
Exactly-once effect when posted to accounting ledger (upsert + version).

Marketing/Advertising

Sliding windows of CTR/conversions, Join clicks and impressions with tolerance '± Δ t', aggregation for bidding.

iGaming/online services

Real-time balance/limits, missions/achievements (session windows), anti-fraud patterns and alerts.

Mini templates (pseudo code)

Window with watermarks and late updates

pseudo stream
.withEventTime(tsExtractor)
.withWatermark(maxAllowedLag=2m)
.window(TUMBLING, size=1m, allowedLateness=30s)
.keyBy(user_id)
.aggregate(sum, retractions=enable)
.sink (upsert_table )//idempotent upsert by (user_id, window_start)

Transactional sink with offset fixation

pseudo begin tx upsert target_table using event, key=(k)
update consumer_offsets set pos=:pos where consumer=:c commit

Production checklist

  • Event time and watermark strategy defined; windows and'allowed _ lateness' are selected.
  • Idempotent sink or transaction commit with offset.
  • Schema registry and compatibility modes are enabled; additive evolution.
  • Metrics: lag, watermark, p95/p99, DLQ, state size, checkpoint duration.
  • Tests: out-of-order, duplicates, restarts, replay.
  • PII/retention policies for state and snapshots.
  • Scaling plan and backpressure strategies.
  • Documentation of window contracts and adjustments (late updates).

FAQ

Event time required?
If correctness of metrics and consistency are important, yes. Processing time is suitable for technical calculations/monitoring, but distorts analytics.

Is it needed exactly-once?
Point: for critical effects. More often, at-least-once + idempotent sink is enough.

How to choose windows?
Build on business SLAs: "last 5 minutes →" hopping, "user sessions →" session, "minute reports →" tumbling.

What to do with late data?
Allow limited'allowed _ lateness' and issue adjustments (upsert/retract). The client showcase must be able to update.

Total

As well as low latency, streaming is a discipline of time, condition and contracts. The right choice of event time, windows and watermarks, plus idempotent effects, observability and tests make the pipeline reliable, reproducible and economical - and give businesses here and now solutions, not every other night.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.