PSP-X latency & loss
(Section: Technology and Infrastructure)
Brief Summary
Chaos Engineering is a scientific method for production: you formulate a stability hypothesis, disrupt the environment in a controlled way, and prove that user value (SLO/business metrics) is preserved. For iGaming, these are payment checks (PSPs), game initialization, lead queues, multi-region and peak loads - in conditions of delays, failures and a "storm" of retrays - before this happens to live users.
1) Principles of chaos engineering
1. Steady-state to hypothesis. Determine the rate: Availability, p95/p99, TTW, payment conversion.
2. Small blast radius. Experiment first in staging/canary, 1-5% traffic/1-2 poda/one region.
3. Observability first. Metrics/logs/trails + experiment annotations.
4. Guardrails и abort. Hard SLO/business KPI thresholds for automatic shutdown.
5. Repeatability and automation. Experiments as code (IaC/GitOps), game-day plan.
6. Blameless culture. The experiment is not a search for blame, but a search for weaknesses.
2) Steady-state and success metrics
TexSLI: p95/p99 API, error-rate, saturation (CPU/IO), queue lag (withdrawals/deposits), latency providers.
Business SLI: conversion of 'attempt→success', TTW p95, success of 'game init', share of PSP failures by code.
3) Classes of experiments (what "break")
Network: latency/jitter/packet loss/blackhole, DNS failures, MTU anomalies.
Ресурсы: CPU throttle, memory pressure/OOM, disk IOPS/space, file-descriptor exhaustion.
Processes and sites: kill/evict pods, node failure, zone/region failure.
Dependencies: PSP timeouts/errors, unavailable game provider, CDN/cache degradation.
Queues/streaming: Kafka lag growth, consumer pause, party/leader gap.
Data/DB: replication delays, index degradation, read-only mode.
Releases/ficheflags: migration misses, erroneous configs, kill-switch.
Front/RUM: LCP/INP drop, customer crashes at peak.
Data/ML: aging features, increasing latency model, falling tokens/s, degradation of quality.
4) Process: from hypothesis to improvement
1. Formulate a hypothesis (specific SLO/business KPI + expected behavior of protection).
2. Design the experiment: type of failure, duration, blast radius, guardrails/abort.
3. Prepare observability: release/experiment compare dashboards, annotations.
4. Run under IM/TL control, notify on-call/business (if affected).
5. Measure result: SLO, p95/p99, TTW, conversion, lags, retrays.
6. Form action items: limits, timeouts, retrays with jitter, outlier-ejection, PDB/HPA/KEDA, pullback flow.
7. Automate (include in the game-day reg package/CI infrastructure checks).
5) Guardrails and stop criteria
Abort immediately if:- SLO fast-burn activated (e.g. 14 × budget per hour),
- Your payment conversion ↓ more than 0. 3 p.p.,
- TTW p95> 3 min in a row 10-15 min,
- error-rate > 1. 5% and growing in two windows.
- Communications: pre-approved channel/status template, "red button" in ChatOps ('/experiment abort ').
6) Experimental examples (Kubernetes/cloud)
6. 1 Network delays to PSP (Canary Depression)
Purpose: to check retrays/timeouts/routing.
Injection: + 200ms RTT and 3% packet loss for 'payments-api' → 'pspX' only.
yaml apiVersion: chaos/v1 kind: NetworkChaos metadata: { name: psp-latency-canary }
spec:
selector: { labelSelectors: { app: payments-api, track: canary } }
direction: to target:
selector: { namespace: prod, ipBlocks: ["10. 23. 0. 0/16"]} # addresses pspX egress action: delay delay:
latency: "200ms"
jitter: "50ms"
correlation: "0. 5"
loss:
loss: "3"
correlation: "0. 3"
duration: "10m"
mode: one # minimum blast radius
Expected: p95 '/deposit '<250 ms, error-rate <1%, conversion ≥ baseline − 0. 3 pp; if degraded, PSP route auto-switch.
6. 2 Node and PDB failure
Purpose: Check PDB/anti-affinity/HPA.
Injection: drain/terminate one node with 'games-api' pods.
Waiting: no loss of availability, peak p99 does not go beyond SLO, autoscaler gets the cues, PDB prevents a "double whammy."
6. 3 Kafka lag и KEDA
Purpose: stable withdrawal of funds when accumulating messages.
Injection: Freeze the consumers for 5-10 minutes, then turn on.
Waiting: KEDA scales the workers, TTW p95 remains ≤ 3 min after resorption, no duplicates (idempotence, keys).
6. 4 game provider DNS glitch
Purpose: fallback/caching/retrays.
Injection: NXDOMAIN/timeout for domain'providerA. example`.
Waiting: fast folback on 'providerB', in UI - degradation mode and status banner; 'game init success' ≥ 99. 5%.
6. 5 Read-only DB
Purpose: Write loss behavior.
Injection: Switch cue to read-only for 10-15 min.
Waiting: the code processes errors correctly, critical routes are limited, queues hold requests, there are no losses/double write-offs.
7) Automation and GitOps
Experiments as code: store scripts/parameters/guardrails in Git, review via PR.
Game-day plan: schedule, owners, metrics, abort conditions, communication checklist.
Annotations in Grafana: start/end of the experiment, config, final SLOs.
8) Observability during chaos
Exemplars: from p95/p99 to specific 'trace _ id'.
Логи: поля `experiment_id`, `fault_type`, `retry_attempt`, `degrade_mode=true`.
Traces: external calls are marked 'fault. injected = true ', retras/timeouts are visible.
Dashboards: "SLO-card," release/experiment compare, Payments/Game init/Queues.
9) The specifics of iGaming: what to check first
1. Payments and TTW: PSP timeouts, route folback, idempotency.
2. Initialization of games: inaccessibility/slowness of studios, CDN failures.
3. Lead/bonus queues: lag growth, reprocessing.
4. Multi-region: zone failure/POP, change of leader, database replication.
5. Peaks: auto-scale, rate-limit, circuit-breaker, warm-up caches.
6. RG/Compliance: correct logging in case of failures, no PII in telemetry.
10) Governance
Calendar and windows: experiments outside of peak tournaments, coordination with business.
Роли: Experiment Lead, Observer (SRE), Business Rep; IM on the hotline.
Data policies: no PII in artifacts; WORM stores for auditing.
Legal boundaries: Exclude scenarios that violate SLA without agreement.
11) Game-day: script template
12) Typical finds and actions
Too aggressive retrays → storm requests → add timeouts/jitters/limits.
No outlier ejection → poison instance spoils p99 → enable culling.
Fragile migrations → read-only breaks the flow of → expand→migrate→contract + phicheflags.
The wrong HPA signal will → scale late → switch to RPS/lag metrics.
Cache common for versions → rollbacks spoil data → version keys.
13) Chaos practice maturity checklist
1. Steady-state and SLO are described, dashboards are ready.
2. Experiments as code, review/audit in Git.
3. Guardrails/abort automated (Alertmanager/ChatOps).
4. Observability: exemplars, trace/log correlation, annotations.
5. Game-day quarterly, scenarios cover payments/games/queues/multi-region.
6. Post-experimental action items are part of the sprint plan; performance monitoring.
7. Retray/timeout/circuit-breaker threshold policies in config repo.
8. Security/PII policies enforced, artifacts without sensitive data.
9. Auto-remediation by SLO (rollback/scale/reroute) tested chaos.
10. Process metrics:% completed without abort, MTTR on exercise, class incident reduction.
14) Anti-patterns
"Breaking everything in the prod" without SLO/guardrails/observability.
Experiments without hypotheses and measurable targets.
Big blast radius on first launch.
Retrays without timeouts/jitter → cascaded fault tolerance.
Chaos instead of prevention: treat symptoms, ignore root causes.
Absence of RCA/action items after exercise.
Experiments during peak hours without business approval.
Summary
Chaos engineering is a methodical proof of resilience: you reproduce real failures in advance, measure the impact on SLO and business metrics, and strengthen architecture - from retrays and circuit-breaker to multi-regional orchestration and auto-remediation. With regular game-day and guardrails discipline, the iGaming platform retains p95/p99, conversion and TTW even during the hottest periods.