GH GambleHub

Health-check mechanisms

1) Why

Health-checks are the first barrier against cascading failures: they correctly remove nodes from rotation, prevent retray storms, simplify degradation and accelerate recovery, preserving SLO and reducing MTTR.


2) Basic types of checks

Liveness - the process is "alive" (no deadlock/leak/panic). Error → restart instance.
Readiness - the service is able to service traffic with target SLOs (pools are raised, the cache is warmed up, dependent resources are normal). The error → be excluded from balancing, but not restarted.
Startup - the service is ready to go to liveness/readiness (long bootstrap, migrations, warmup). Protects against premature restarts.
Deep-health (domain-specific): business invariants (the rate passes end-to-end, the deposit is authorized by the active PSP). Used for degradation signals, but not for immediate restart.
External/Synthetic: active pings outside (API path, front script, PSP/KYC endpoint) - measure user availability.


3) Sample design: general rules

1. Cheap liveness: do not go to external dependencies; check the event loop, heap/FD, watchdog.
2. Readiness by SLO: we check the local resources required for maintenance (database pools, warm cache, limits). External dependencies - through non-blocking "can-serve?" signals.
3. Latency-budget: each sample has its own SLA (for example, ≤100 -200 ms); if exceeded - "degraded," but not 5xx on liveness.
4. Backoff & Jitter: sample intervals 5-15 seconds, timeout 1-2 seconds, with exponential delay in errors to avoid synchronous storms.
5. Hysteresis: N success/error responses for status change (e.g. 'successThreshold = 2', 'failureThreshold = 3').
6. Versioning: endpoints '/healthz ', '/readyz', '/startupz 'are stable; deep-checks under '/health/... 'with named checks.
7. No secret and PII: answers are only statuses and short codes.
8. Explainability: JSON with a list of sub-checks: '{"status ": "degraded, ""checks ": [{"name ": "db,"" ok": true," latencyMs": 18}, {" name":" psp. eu","ok":false,"reason":"timeout"}]}`.


4) Examples of deep-checks by layer

4. 1 DB/Cache/Storage

DB: short transaction'SELECT 1'to read replica and pool check; latency/replication-lag thresholds.
Cache: 'GET '/' SET' test key + hit-ratio guard (low hit → warning).
Object Storage: HEAD of an existing object (no download).

4. 2 Queues/Streaming

Broker: ping-topic publish + consume within local partition; consumer-lag thresholds.
DLQ: No spike in dead-letter messages per window.

4. 3 External providers (PSP/KYC/AML)

PSP: lightweight auth-probe (non-monetary), verification of contract/certificate/quotas; if there are no safe samples, we use proxy metrics (success of authorizations in 5-10 minutes by banks/GEO).
KYC/AML: health-API and SLA queues; in case of degradation - switching to an alternative stream/provider.

4. 4 API/Front

Synthetics: transaction path (login → deposit-initiation → bet "in the sand") in EU/LATAM/APAC.

RUM signal: the proportion of JS/HTTP and LCP/TTFB errors - triggers "outside."


5) Platform integration

5. 1 Kubernetes / Cloud

'startupProbe'protects bootstrap (migrations/cache warmup).
'livenessProbe'is minimalistic; 'readinessProbe' takes into account pools/cache/local queues.
Параметры: `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `failureThreshold`, `successThreshold`.
PodDisruptionBudget and maxUnavailable considering readiness.
HPA/KEDA: queue scaling/SLI; readiness affects routing.

5. 2 Balancers/gateways/mesh

Health-routing at the L7 level (HTTP 200/429/503 semantics).
Outlier detection (envoy/mesh) - output from the pool by error-rate/latency percentiles.
Circuit-breaker: limits for simultaneous requests/connections to dependency, integration with health signals.

5. 3 Autoscaling and degradation

Readiness = FALSE → traffic is removed, but the pod is alive (can warm up).
Deep-degrade (PSP down) → feature flags for graceful mode (for example, temporarily hide payment methods, enable waiting-room).


6) Timeout and retreat policies

Timeout <SLO budget: 'timeout = min (⅓ p99, 1-2s)' for synchronous dependencies.
Idempotence: mandatory for retrays; use idempotency-keys.
Exponential backoff + jitter: prevents synchronous shaft effects.

Retray budgets: caps per-request/tenant, protection against "retry-storms."


7) Status signals and alerting

Green/Yellow/Red: summary statuses on the service dashboard.
Burn-rate alerts by SLO: fast (1 h) and slow (6-24 h).
Correlation-hints: Release/Feature Flag/Plan Activity Notes.
Auto-actions: with "red" deep-check - turn on the provider's fallback, increase the sampling of tracks.


8) Smart strategies for iGaming

Payment-aware readiness: the readiness of the betting service takes into account the state of the PSP router and the limits on banks/GEO.
Odds/Lines publishing: readiness at the publisher depends on summary lag by line source and distribution time in the/edge cache.
Tournament spikes: a temporary policy of more aggressive outlier-detection and waiting-room.


9) Antipatterns

Liveness, which goes to the database/PSP → mass restarts for an external problem.
One "universal" health-endpoint without separation startup/readiness/liveness.
Hard timeouts without backoff/jitter → retray storm.
No hysteresis → routing flapping.
Deep-check, which triggers restarts (its purpose is diagnostics and routing, not restart).
Hidden 5xx in health endpoints (masking real status).


10) Interface templates

/startupz → `200 OK {"uptimeSec": ..., "version":"..."}`

Checks: init scripts, migrations completed, keys and configs loaded.

/healthz (liveness) → `200 OK {"heapOk": true,"fdOk":true,"eventLoop":"ok"}`

Checks: cycle of events, process resources, absence of panic/oom flags.

/readyz (readiness) →

`200 OK/503 {"canServe": true,"db":{"ok":true,"latencyMs":12},"cache":{"ok":true},"queue":{"ok":true,"lag":0},"localQuota":{"ok":true}}`

/health/payments (deep) →

`200/206/503 {"psp. eu": {"ok":false,"reason":"timeout"}, "psp. alt":{"ok":true}, "routerMode":"failover"}`


11) Health-circuit quality metrics (KPI/KRI)

Pod exit time from 'NotReady' to 'Ready' (warmup-SLO).
Frequency of flapping readiness per service.
% mistakenly restarted pod (root-cause - external dependency).
MTTR of incidents where health mechanisms played a role (before/after).
Share of automatic failover/feature-degrade without on-call.
Synthetics accuracy vs RUM (false positives/misses).


12) Implementation Roadmap (4-8 weeks)

Ned. 1-2: critical path inventory; post startup/liveness/readiness; enter JSON responses with sub-checks and hysteresis.
Ned. 3-4: add deep-checks: database/cache/broker; synthetics for login/deposit/bet in 2-3 GEO; enable outlier-detection on the/mesh gateway.
Ned. 5–6: payment-aware readiness и PSP-fallback; waiting-room for the front; autoscaling by lag/queues; alerts by burn-rate.
Ned. 7-8: chaos days (disabling PSP/database replicas), backoff/jitter check; timeout fintuning, PDB; KPI report and correction.


13) Artifacts

Health Spec (per service): list of checks, time budgets, hysteresis, actions with red status.

Runbooks: "Readiness = FALSE: What are we doing? , ""PSP-fallback: steps and return criteria."

Routing Policy: outlier-detection rules, circuit-breakers, percentile thresholds.
Synthetic Playbook: scripts and geographies, SLO synthetics, schedule.
Release Gate: release blocks with red deep-check key dependencies.


Result

A well-designed health-checks loop is a layered system of signals: easy liveness for process viability, readiness for traffic service capability, startup for secure start, and domain-specific deep-checks for managed degradation and routing. In conjunction with autoscaling, outlier-routing, synthetics and SLO-alerting, it reduces the risk of cascading failures, reduces MTTR and stabilizes the business critical paths of iGaming platforms.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.