GH GambleHub

PSP-X latency & loss

(Section: Technology and Infrastructure)

Brief Summary

Chaos Engineering is a scientific method for production: you formulate a stability hypothesis, disrupt the environment in a controlled way, and prove that user value (SLO/business metrics) is preserved. For iGaming, these are payment checks (PSPs), game initialization, lead queues, multi-region and peak loads - in conditions of delays, failures and a "storm" of retrays - before this happens to live users.

1) Principles of chaos engineering

1. Steady-state to hypothesis. Determine the rate: Availability, p95/p99, TTW, payment conversion.
2. Small blast radius. Experiment first in staging/canary, 1-5% traffic/1-2 poda/one region.
3. Observability first. Metrics/logs/trails + experiment annotations.
4. Guardrails и abort. Hard SLO/business KPI thresholds for automatic shutdown.
5. Repeatability and automation. Experiments as code (IaC/GitOps), game-day plan.
6. Blameless culture. The experiment is not a search for blame, but a search for weaknesses.

2) Steady-state and success metrics

TexSLI: p95/p99 API, error-rate, saturation (CPU/IO), queue lag (withdrawals/deposits), latency providers.
Business SLI: conversion of 'attempt→success', TTW p95, success of 'game init', share of PSP failures by code.

Hypothesis (example):
💡 "At 5% packet loss and + 200ms RTT to PSP-X, deposit conversion will decrease <0. 3 pp, p95 '/deposit' ≤ 250 ms, and TTW will remain ≤ 3 minutes thanks to retraces, degradation mode and smart routing"

3) Classes of experiments (what "break")

Network: latency/jitter/packet loss/blackhole, DNS failures, MTU anomalies.
Ресурсы: CPU throttle, memory pressure/OOM, disk IOPS/space, file-descriptor exhaustion.
Processes and sites: kill/evict pods, node failure, zone/region failure.
Dependencies: PSP timeouts/errors, unavailable game provider, CDN/cache degradation.
Queues/streaming: Kafka lag growth, consumer pause, party/leader gap.
Data/DB: replication delays, index degradation, read-only mode.
Releases/ficheflags: migration misses, erroneous configs, kill-switch.
Front/RUM: LCP/INP drop, customer crashes at peak.
Data/ML: aging features, increasing latency model, falling tokens/s, degradation of quality.

4) Process: from hypothesis to improvement

1. Formulate a hypothesis (specific SLO/business KPI + expected behavior of protection).
2. Design the experiment: type of failure, duration, blast radius, guardrails/abort.
3. Prepare observability: release/experiment compare dashboards, annotations.
4. Run under IM/TL control, notify on-call/business (if affected).
5. Measure result: SLO, p95/p99, TTW, conversion, lags, retrays.
6. Form action items: limits, timeouts, retrays with jitter, outlier-ejection, PDB/HPA/KEDA, pullback flow.
7. Automate (include in the game-day reg package/CI infrastructure checks).

5) Guardrails and stop criteria

Abort immediately if:
  • SLO fast-burn activated (e.g. 14 × budget per hour),
  • Your payment conversion ↓ more than 0. 3 p.p.,
  • TTW p95> 3 min in a row 10-15 min,
  • error-rate > 1. 5% and growing in two windows.
  • Communications: pre-approved channel/status template, "red button" in ChatOps ('/experiment abort ').

6) Experimental examples (Kubernetes/cloud)

6. 1 Network delays to PSP (Canary Depression)

Purpose: to check retrays/timeouts/routing.
Injection: + 200ms RTT and 3% packet loss for 'payments-api' → 'pspX' only.

Pseudo-manifesto (idea for network chaos):
yaml apiVersion: chaos/v1 kind: NetworkChaos metadata: { name: psp-latency-canary }
spec:
selector: { labelSelectors: { app: payments-api, track: canary } }
direction: to target:
selector: { namespace: prod, ipBlocks: ["10. 23. 0. 0/16"]} # addresses pspX egress action: delay delay:
latency: "200ms"
jitter: "50ms"
correlation: "0. 5"
loss:
loss: "3"
correlation: "0. 3"
duration: "10m"
mode: one # minimum blast radius

Expected: p95 '/deposit '<250 ms, error-rate <1%, conversion ≥ baseline − 0. 3 pp; if degraded, PSP route auto-switch.

6. 2 Node and PDB failure

Purpose: Check PDB/anti-affinity/HPA.
Injection: drain/terminate one node with 'games-api' pods.

Waiting: no loss of availability, peak p99 does not go beyond SLO, autoscaler gets the cues, PDB prevents a "double whammy."

6. 3 Kafka lag и KEDA

Purpose: stable withdrawal of funds when accumulating messages.
Injection: Freeze the consumers for 5-10 minutes, then turn on.

Waiting: KEDA scales the workers, TTW p95 remains ≤ 3 min after resorption, no duplicates (idempotence, keys).

6. 4 game provider DNS glitch

Purpose: fallback/caching/retrays.
Injection: NXDOMAIN/timeout for domain'providerA. example`.

Waiting: fast folback on 'providerB', in UI - degradation mode and status banner; 'game init success' ≥ 99. 5%.

6. 5 Read-only DB

Purpose: Write loss behavior.
Injection: Switch cue to read-only for 10-15 min.

Waiting: the code processes errors correctly, critical routes are limited, queues hold requests, there are no losses/double write-offs.

7) Automation and GitOps

Experiments as code: store scripts/parameters/guardrails in Git, review via PR.
Game-day plan: schedule, owners, metrics, abort conditions, communication checklist.
Annotations in Grafana: start/end of the experiment, config, final SLOs.

8) Observability during chaos

Exemplars: from p95/p99 to specific 'trace _ id'.
Логи: поля `experiment_id`, `fault_type`, `retry_attempt`, `degrade_mode=true`.
Traces: external calls are marked 'fault. injected = true ', retras/timeouts are visible.
Dashboards: "SLO-card," release/experiment compare, Payments/Game init/Queues.

9) The specifics of iGaming: what to check first

1. Payments and TTW: PSP timeouts, route folback, idempotency.
2. Initialization of games: inaccessibility/slowness of studios, CDN failures.
3. Lead/bonus queues: lag growth, reprocessing.
4. Multi-region: zone failure/POP, change of leader, database replication.
5. Peaks: auto-scale, rate-limit, circuit-breaker, warm-up caches.
6. RG/Compliance: correct logging in case of failures, no PII in telemetry.

10) Governance

Calendar and windows: experiments outside of peak tournaments, coordination with business.
Роли: Experiment Lead, Observer (SRE), Business Rep; IM on the hotline.
Data policies: no PII in artifacts; WORM stores for auditing.
Legal boundaries: Exclude scenarios that violate SLA without agreement.

11) Game-day: script template



12) Typical finds and actions

Too aggressive retrays → storm requests → add timeouts/jitters/limits.
No outlier ejection → poison instance spoils p99 → enable culling.
Fragile migrations → read-only breaks the flow of → expand→migrate→contract + phicheflags.
The wrong HPA signal will → scale late → switch to RPS/lag metrics.
Cache common for versions → rollbacks spoil data → version keys.

13) Chaos practice maturity checklist

1. Steady-state and SLO are described, dashboards are ready.
2. Experiments as code, review/audit in Git.
3. Guardrails/abort automated (Alertmanager/ChatOps).
4. Observability: exemplars, trace/log correlation, annotations.
5. Game-day quarterly, scenarios cover payments/games/queues/multi-region.
6. Post-experimental action items are part of the sprint plan; performance monitoring.
7. Retray/timeout/circuit-breaker threshold policies in config repo.
8. Security/PII policies enforced, artifacts without sensitive data.
9. Auto-remediation by SLO (rollback/scale/reroute) tested chaos.
10. Process metrics:% completed without abort, MTTR on exercise, class incident reduction.

14) Anti-patterns

"Breaking everything in the prod" without SLO/guardrails/observability.
Experiments without hypotheses and measurable targets.
Big blast radius on first launch.
Retrays without timeouts/jitter → cascaded fault tolerance.
Chaos instead of prevention: treat symptoms, ignore root causes.
Absence of RCA/action items after exercise.
Experiments during peak hours without business approval.

Summary

Chaos engineering is a methodical proof of resilience: you reproduce real failures in advance, measure the impact on SLO and business metrics, and strengthen architecture - from retrays and circuit-breaker to multi-regional orchestration and auto-remediation. With regular game-day and guardrails discipline, the iGaming platform retains p95/p99, conversion and TTW even during the hottest periods.
Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.