WebSocket streams and events
TL; DR
Work stream = trusted channel (WSS) + summarized offsets + idempotent events + strict limits and backpressure. Do: JWT authentication, authorization for topics, heartbeats, seq/offset + resume-token, at-least-once + deadup. For scale - user/tenant sharding, sticky routing, and queue (Kafka/NATS/Redis Streams) as a source of truth.
1) iGaming business cases (what we really stream)
Balance/limits: instantaneous changes in balance, RG limits, locks.
Bets/rounds/results: confirmation, status, calculation of winnings.
Tournaments/leaderboards: positions, timers, prize events.
Payments: payout/refund status, KYC/AML flags - like notifications (and criticism remains in REST + webhooks).
Service events: chat messages, push banners, session statuses, maintenance.
2) Protocol and connection
WSS only (TLS 1. 2+/1. 3). Maximum of 1 active connection per default device/session.
Ping/Pong: the client sends' ping'every 20-30 seconds, the response timeout is 10 seconds. The server drops the connection at 3 consecutive timeouts.
Compression: 'permessage-deflate', frame size limit (for example, ≤ 64 KB).
Payload format: JSON for external, Protobuf/MsgPack for internal/mobile.
3) Authentication and authorization
JWT handshake in query/header ('Sec-WebSocket-Protocol '/' Authorization'), TTL token short (≤ 15 min), refresh by out-of-band (REST).
Tenant-scoped claims: `sub`, `tenant`, `scopes`, `risk_flags`.
ACLs to topics/channels: subscribing only to allowed 'topics' (for example: 'user: {id}', 'tournament: {id}', 'game: {table}').
Connection re-creation when the token expires: "soft window" 60 s.
4) Subscription model
The client sends commands after connect:json
{ "op":"subscribe", "topics":["user:123", "tournament:456"], "resume_from":"1748852201:987654" }
{ "op":"unsubscribe", "topics":["tournament:456"] }
'resume _ from '- offset (see § 5) if the client reconnects.
The server responds with ack/nack, the failed ACLs are in'nack 'with'reason'.
5) Delivery guarantees and summary
Purpose: at-least-once per channel + idempotency in the client.
Each event has a monotonous' seq'within the "part" (usually user/room) and a global' event _ id'for deduplication.
With a re-connection, the client sends' resume _ from '= the last confirmed' seq '(or' offset'of the broker). The server loads missed events from the "source of truth" (Kafka/NATS/Redis Streams).
If the lag exceeds retention (for example, 24 hours), the server sends a 'snapshot' of the state and a new 'seq'.
- Store'last _ seq '/' event _ id'in durable storage (IndexedDB/Keychain).
- Dedup by 'event _ id', skip events with 'seq ≤ last_seq', detect holes (gap) → auto-' resync' snapshot request.
6) Message scheme (envelope)
json
{
"ts": "2025-11-03T12:34:56. 789Z",
"topic": "user:123",
"seq": "1748852201:987654", // partition:offset
"event_id": "01HF..", // UUID/KSUID
"type": "balance. updated",
"data": { "currency":"EUR", "delta"--5. 00, "balance":125. 37 },
"trace_id": "4e3f.., "//for correlation
"signature": "base64 (hmac (...)) "//optional for partners
}
'type '- domain taxonomy (see event dictionary).
PII/PCI - exclude/mask at the gateway level.
7) Backpressure, quotas and protection against "expensive" customers
Server → Client: per-connection send-queue with sliding window. Full - resetting subscriptions to "noisy" topics or disconnect with code '1013 '/' policy _ violation'.
Client → Server: limits on'subscribe/unsubscribe '(for example, ≤ 10/sec), topic list limit (≤ 50), minimum resubscription interval.
Rate limits by IP/tenant/key. Anomalies → temporary blocking.
Priority: vital events (balance, RG-limits) - priority queue.
8) Protection and safety
WAF/bot profile on handshake endpoint, Origin allowed list.
mTLS between edge gateway and stream nodes.
DoS protection: SYN cookies on L4, limits on the number of open WS/keep-alive interval.
Anti-replay: 'timestamp' in optional payload signature (for partners) with a valid window of 5 min.
Tenant isolation: physical/logical sharding, keys/tokens per-tenant.
9) Transportation architecture
Gateway (edge): TLS terminal, authN/Z, quotas, routing per party.
Stream nodes: stateless workers with sticky routing by 'hash (user_id)% N'.
Event broker: Kafka/NATS/Redis Streams - source of truth and replay buffer.
State-service: stores snapshots (balance, positions in the tournament).
Multi-region: asset-asset; GSLB by nearest region; home-region is fixed at login; with a feiler - a "cold" summary from another region.
10) Order, consistency, idempotency
Ordering is guaranteed within the party (user/room), not globally.
Consistency: the event may come before the REST response; UX must be able to live with an intermediate state (optimistic UI + reconciliation).
Idempotence: reprocessing 'event _ id' does not change the state of the client.
11) Errors, reconnect and storms
Closing codes: '1000' (normal), '1008' (policy), '1011' (internal), '1013' (server overload).
Client exponential backoff + jitter: 1s, 2s, 4s... max 30s.
During mass reconnects ("thundering herd") - the server gives' retry _ after'and "gray" responses with a prompt to use SSE fallback for read-only.
12) Cash and snapshots
Each subscription can start with a snapshot of the current state, then a stream of diff events.
Data _ version schema versioning and compatibility (field extension does not break clients).
13) Observability and SLO
Metrics:- Connections: active, established/sec, distribution by tenant/region.
- Delivery: p50/p95 delays from broker to client, drop-rate, resend-rate.
- Reliability: share of successful resumes without a snapshot, gap detector.
- Errors: 4xx/5xx on handshake, closing codes, limit hits.
- Load: RPS of 'subscribe' commands, queue size, CPU/NET.
- Establishing WS p95 ≤ 500 ms (within the region).
- End-to-end latency event p95 ≤ 300 ms (user-partition).
- Resume success ≥ 99%, message loss = 0 (по at-least-once).
- Uptime Stream Endpoint ≥ 99. 95%.
14) Schema and version management
Dictionary of events with owners, examples and semantics.
"Soft" evolution: only adding optional fields; deletion - after the '@ deprecated' period.
Contract tests against client SDKs, linters on JSON Schema/Protobuf.
15) Incident playbooks (embed in your shared playbook)
Latency growth: switch parties to backup nodes, increase the size of the batch at the broker, enable prioritization of vital events.
Reconnect storm: activate 'retry _ after', temporarily raise handshake limits, enable SSE fallback.
Token leak: JWKS rotation, revocation of affected tokens, forced reconnect with re-auth.
Loss of broker party: transfer to snapshot mode, replay after recovery.
16) API Mini Specification (Simplified)
Handshake (HTTP GET → WS):
GET /ws? tenant=acme&client=web
Headers:
Authorization: Bearer <JWT>
X-Trace-Id: <uuid>
Client commands:
json
{ "op":"subscribe", "topics":["user:123"], "resume_from":"1748852201:42" }
{ "op":"unsubscribe", "topics":["user:123"] }
{ "op":"ping", "ts":"2025-11-03T12:34:56Z" }
Server Responses:
json
{ "op":"ack", "id":"subscribe:user:123" }
{ "op":"event", "topic":"user:123", "seq":"1748852201:43", "type":"balance. updated", "data":{...} }
{ "op":"snapshot", "topic":"user:123", "seq":"1748852201:42", "state":{...} }
{ "op":"error", "code":"acl_denied", "reason":"no access to topic tournament:456" }
{ "op":"pong", "ts":"..." }
17) UAT checklist
- Summary from the offset after 1/10/60 minutes of downtime of the client.
- Dedup: redelivery of the same 'event _ id' does not change state.
- Gap detector → automatic 'snapshot' and alignment.
- Quotas and backpressure: the loaded client receives policy-disconnect.
- Multiregion: failover region while maintaining offset.
- Security: Token rocker expired by JWT, trying to subscribe outside ACL.
- RG/event balance comes before/after REST - UI correctly "stitches."
18) Frequent errors
No 'seq/offset' and renewal - lose events and trust.
Mixing critical payment commands in WS mutations - use REST.
Lack of backpressure/quotas - "suspended" connections and an avalanche of memory.
Global orderliness is expensive and unnecessary; enough order in the party.
PII logging in events - privacy violations and PCI/GDPR.
Lack of a dictionary of events and versioning - clients break down.
Summary
WebSocket streams give reactive UX and operational signals if they are built as a summarized, protected and limited channel: WSS + mTLS/JWT, ACL on topics, seq/offset + resume, at-least-once with deduplication, backpressure/quotas, broker as a source of truth, observability and SLO. So streams remain fast for the user and manageable for the platform - without compromises on security and money.