Health-check mechanisms

1) Why

Health-checks are the first barrier against cascading failures: they correctly remove nodes from rotation, prevent retray storms, simplify degradation and accelerate recovery, preserving SLO and reducing MTTR.

2) Basic types of checks

Liveness - the process is "alive" (no deadlock/leak/panic). Error → restart instance.
Readiness - the service is able to service traffic with target SLOs (pools are raised, the cache is warmed up, dependent resources are normal). The error → be excluded from balancing, but not restarted.
Startup - the service is ready to go to liveness/readiness (long bootstrap, migrations, warmup). Protects against premature restarts.
Deep-health (domain-specific): business invariants (the rate passes end-to-end, the deposit is authorized by the active PSP). Used for degradation signals, but not for immediate restart.
External/Synthetic: active pings outside (API path, front script, PSP/KYC endpoint) - measure user availability.

3) Sample design: general rules

1. Cheap liveness: do not go to external dependencies; check the event loop, heap/FD, watchdog.
2. Readiness by SLO: we check the local resources required for maintenance (database pools, warm cache, limits). External dependencies - through non-blocking "can-serve?" signals.
3. Latency-budget: each sample has its own SLA (for example, ≤100 -200 ms); if exceeded - "degraded," but not 5xx on liveness.
4. Backoff & Jitter: sample intervals 5-15 seconds, timeout 1-2 seconds, with exponential delay in errors to avoid synchronous storms.
5. Hysteresis: N success/error responses for status change (e.g. 'successThreshold = 2', 'failureThreshold = 3').
6. Versioning: endpoints '/healthz ', '/readyz', '/startupz 'are stable; deep-checks under '/health/... 'with named checks.
7. No secret and PII: answers are only statuses and short codes.
8. Explainability: JSON with a list of sub-checks: '{"status ": "degraded, ""checks ": [{"name ": "db,"" ok": true," latencyMs": 18}, {" name":" psp. eu","ok":false,"reason":"timeout"}]}`.

4) Examples of deep-checks by layer

4. 1 DB/Cache/Storage

DB: short transaction'SELECT 1'to read replica and pool check; latency/replication-lag thresholds.
Cache: 'GET '/' SET' test key + hit-ratio guard (low hit → warning).
Object Storage: HEAD of an existing object (no download).

4. 2 Queues/Streaming

Broker: ping-topic publish + consume within local partition; consumer-lag thresholds.
DLQ: No spike in dead-letter messages per window.

4. 3 External providers (PSP/KYC/AML)

PSP: lightweight auth-probe (non-monetary), verification of contract/certificate/quotas; if there are no safe samples, we use proxy metrics (success of authorizations in 5-10 minutes by banks/GEO).
KYC/AML: health-API and SLA queues; in case of degradation - switching to an alternative stream/provider.

4. 4 API/Front

Synthetics: transaction path (login → deposit-initiation → bet "in the sand") in EU/LATAM/APAC.

RUM signal: the proportion of JS/HTTP and LCP/TTFB errors - triggers "outside."

5) Platform integration

5. 1 Kubernetes / Cloud

'startupProbe'protects bootstrap (migrations/cache warmup).
'livenessProbe'is minimalistic; 'readinessProbe' takes into account pools/cache/local queues.
Параметры: `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `failureThreshold`, `successThreshold`.
PodDisruptionBudget and maxUnavailable considering readiness.
HPA/KEDA: queue scaling/SLI; readiness affects routing.

5. 2 Balancers/gateways/mesh

Health-routing at the L7 level (HTTP 200/429/503 semantics).
Outlier detection (envoy/mesh) - output from the pool by error-rate/latency percentiles.
Circuit-breaker: limits for simultaneous requests/connections to dependency, integration with health signals.

5. 3 Autoscaling and degradation

Readiness = FALSE → traffic is removed, but the pod is alive (can warm up).
Deep-degrade (PSP down) → feature flags for graceful mode (for example, temporarily hide payment methods, enable waiting-room).

6) Timeout and retreat policies

Timeout <SLO budget: 'timeout = min (⅓ p99, 1-2s)' for synchronous dependencies.
Idempotence: mandatory for retrays; use idempotency-keys.
Exponential backoff + jitter: prevents synchronous shaft effects.

Retray budgets: caps per-request/tenant, protection against "retry-storms."

7) Status signals and alerting

Green/Yellow/Red: summary statuses on the service dashboard.
Burn-rate alerts by SLO: fast (1 h) and slow (6-24 h).
Correlation-hints: Release/Feature Flag/Plan Activity Notes.
Auto-actions: with "red" deep-check - turn on the provider's fallback, increase the sampling of tracks.

8) Smart strategies for iGaming

Payment-aware readiness: the readiness of the betting service takes into account the state of the PSP router and the limits on banks/GEO.
Odds/Lines publishing: readiness at the publisher depends on summary lag by line source and distribution time in the/edge cache.
Tournament spikes: a temporary policy of more aggressive outlier-detection and waiting-room.

9) Antipatterns

Liveness, which goes to the database/PSP → mass restarts for an external problem.
One "universal" health-endpoint without separation startup/readiness/liveness.
Hard timeouts without backoff/jitter → retray storm.
No hysteresis → routing flapping.
Deep-check, which triggers restarts (its purpose is diagnostics and routing, not restart).
Hidden 5xx in health endpoints (masking real status).

10) Interface templates

/startupz → `200 OK {"uptimeSec": ..., "version":"..."}`

Checks: init scripts, migrations completed, keys and configs loaded.

/healthz (liveness) → `200 OK {"heapOk": true,"fdOk":true,"eventLoop":"ok"}`

Checks: cycle of events, process resources, absence of panic/oom flags.

/readyz (readiness) →

`200 OK/503 {"canServe": true,"db":{"ok":true,"latencyMs":12},"cache":{"ok":true},"queue":{"ok":true,"lag":0},"localQuota":{"ok":true}}`

/health/payments (deep) →

`200/206/503 {"psp. eu": {"ok":false,"reason":"timeout"}, "psp. alt":{"ok":true}, "routerMode":"failover"}`

11) Health-circuit quality metrics (KPI/KRI)

Pod exit time from 'NotReady' to 'Ready' (warmup-SLO).
Frequency of flapping readiness per service.
% mistakenly restarted pod (root-cause - external dependency).
MTTR of incidents where health mechanisms played a role (before/after).
Share of automatic failover/feature-degrade without on-call.
Synthetics accuracy vs RUM (false positives/misses).

12) Implementation Roadmap (4-8 weeks)

Ned. 1-2: critical path inventory; post startup/liveness/readiness; enter JSON responses with sub-checks and hysteresis.
Ned. 3-4: add deep-checks: database/cache/broker; synthetics for login/deposit/bet in 2-3 GEO; enable outlier-detection on the/mesh gateway.
Ned. 5–6: payment-aware readiness и PSP-fallback; waiting-room for the front; autoscaling by lag/queues; alerts by burn-rate.
Ned. 7-8: chaos days (disabling PSP/database replicas), backoff/jitter check; timeout fintuning, PDB; KPI report and correction.

13) Artifacts

Health Spec (per service): list of checks, time budgets, hysteresis, actions with red status.

Runbooks: "Readiness = FALSE: What are we doing? , ""PSP-fallback: steps and return criteria."

Routing Policy: outlier-detection rules, circuit-breakers, percentile thresholds.
Synthetic Playbook: scripts and geographies, SLO synthetics, schedule.
Release Gate: release blocks with red deep-check key dependencies.

Total

A well-designed health-checks loop is a layered system of signals: easy liveness for process viability, readiness for traffic service capability, startup for secure start, and domain-specific deep-checks for managed degradation and routing. In conjunction with autoscaling, outlier-routing, synthetics and SLO-alerting, it reduces the risk of cascading failures, reduces MTTR and stabilizes the business critical paths of iGaming platforms.

Health-check mechanisms

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects