Health-check mechanisms
1) Why
Health-checks are the first barrier against cascading failures: they correctly remove nodes from rotation, prevent retray storms, simplify degradation and accelerate recovery, preserving SLO and reducing MTTR.
2) Basic types of checks
Liveness - the process is "alive" (no deadlock/leak/panic). Error → restart instance.
Readiness - the service is able to service traffic with target SLOs (pools are raised, the cache is warmed up, dependent resources are normal). The error → be excluded from balancing, but not restarted.
Startup - the service is ready to go to liveness/readiness (long bootstrap, migrations, warmup). Protects against premature restarts.
Deep-health (domain-specific): business invariants (the rate passes end-to-end, the deposit is authorized by the active PSP). Used for degradation signals, but not for immediate restart.
External/Synthetic: active pings outside (API path, front script, PSP/KYC endpoint) - measure user availability.
3) Sample design: general rules
1. Cheap liveness: do not go to external dependencies; check the event loop, heap/FD, watchdog.
2. Readiness by SLO: we check the local resources required for maintenance (database pools, warm cache, limits). External dependencies - through non-blocking "can-serve?" signals.
3. Latency-budget: each sample has its own SLA (for example, ≤100 -200 ms); if exceeded - "degraded," but not 5xx on liveness.
4. Backoff & Jitter: sample intervals 5-15 seconds, timeout 1-2 seconds, with exponential delay in errors to avoid synchronous storms.
5. Hysteresis: N success/error responses for status change (e.g. 'successThreshold = 2', 'failureThreshold = 3').
6. Versioning: endpoints '/healthz ', '/readyz', '/startupz 'are stable; deep-checks under '/health/... 'with named checks.
7. No secret and PII: answers are only statuses and short codes.
8. Explainability: JSON with a list of sub-checks: '{"status ": "degraded, ""checks ": [{"name ": "db,"" ok": true," latencyMs": 18}, {" name":" psp. eu","ok":false,"reason":"timeout"}]}`.
4) Examples of deep-checks by layer
4. 1 DB/Cache/Storage
DB: short transaction'SELECT 1'to read replica and pool check; latency/replication-lag thresholds.
Cache: 'GET '/' SET' test key + hit-ratio guard (low hit → warning).
Object Storage: HEAD of an existing object (no download).
4. 2 Queues/Streaming
Broker: ping-topic publish + consume within local partition; consumer-lag thresholds.
DLQ: No spike in dead-letter messages per window.
4. 3 External providers (PSP/KYC/AML)
PSP: lightweight auth-probe (non-monetary), verification of contract/certificate/quotas; if there are no safe samples, we use proxy metrics (success of authorizations in 5-10 minutes by banks/GEO).
KYC/AML: health-API and SLA queues; in case of degradation - switching to an alternative stream/provider.
4. 4 API/Front
Synthetics: transaction path (login → deposit-initiation → bet "in the sand") in EU/LATAM/APAC.
RUM signal: the proportion of JS/HTTP and LCP/TTFB errors - triggers "outside."
5) Platform integration
5. 1 Kubernetes / Cloud
'startupProbe'protects bootstrap (migrations/cache warmup).
'livenessProbe'is minimalistic; 'readinessProbe' takes into account pools/cache/local queues.
Параметры: `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `failureThreshold`, `successThreshold`.
PodDisruptionBudget and maxUnavailable considering readiness.
HPA/KEDA: queue scaling/SLI; readiness affects routing.
5. 2 Balancers/gateways/mesh
Health-routing at the L7 level (HTTP 200/429/503 semantics).
Outlier detection (envoy/mesh) - output from the pool by error-rate/latency percentiles.
Circuit-breaker: limits for simultaneous requests/connections to dependency, integration with health signals.
5. 3 Autoscaling and degradation
Readiness = FALSE → traffic is removed, but the pod is alive (can warm up).
Deep-degrade (PSP down) → feature flags for graceful mode (for example, temporarily hide payment methods, enable waiting-room).
6) Timeout and retreat policies
Timeout <SLO budget: 'timeout = min (⅓ p99, 1-2s)' for synchronous dependencies.
Idempotence: mandatory for retrays; use idempotency-keys.
Exponential backoff + jitter: prevents synchronous shaft effects.
Retray budgets: caps per-request/tenant, protection against "retry-storms."
7) Status signals and alerting
Green/Yellow/Red: summary statuses on the service dashboard.
Burn-rate alerts by SLO: fast (1 h) and slow (6-24 h).
Correlation-hints: Release/Feature Flag/Plan Activity Notes.
Auto-actions: with "red" deep-check - turn on the provider's fallback, increase the sampling of tracks.
8) Smart strategies for iGaming
Payment-aware readiness: the readiness of the betting service takes into account the state of the PSP router and the limits on banks/GEO.
Odds/Lines publishing: readiness at the publisher depends on summary lag by line source and distribution time in the/edge cache.
Tournament spikes: a temporary policy of more aggressive outlier-detection and waiting-room.
9) Antipatterns
Liveness, which goes to the database/PSP → mass restarts for an external problem.
One "universal" health-endpoint without separation startup/readiness/liveness.
Hard timeouts without backoff/jitter → retray storm.
No hysteresis → routing flapping.
Deep-check, which triggers restarts (its purpose is diagnostics and routing, not restart).
Hidden 5xx in health endpoints (masking real status).
10) Interface templates
/startupz → `200 OK {"uptimeSec": ..., "version":"..."}`
Checks: init scripts, migrations completed, keys and configs loaded.
/healthz (liveness) → `200 OK {"heapOk": true,"fdOk":true,"eventLoop":"ok"}`
Checks: cycle of events, process resources, absence of panic/oom flags.
/readyz (readiness) →
`200 OK/503 {"canServe": true,"db":{"ok":true,"latencyMs":12},"cache":{"ok":true},"queue":{"ok":true,"lag":0},"localQuota":{"ok":true}}`
/health/payments (deep) →
`200/206/503 {"psp. eu": {"ok":false,"reason":"timeout"}, "psp. alt":{"ok":true}, "routerMode":"failover"}`
11) Health-circuit quality metrics (KPI/KRI)
Pod exit time from 'NotReady' to 'Ready' (warmup-SLO).
Frequency of flapping readiness per service.
% mistakenly restarted pod (root-cause - external dependency).
MTTR of incidents where health mechanisms played a role (before/after).
Share of automatic failover/feature-degrade without on-call.
Synthetics accuracy vs RUM (false positives/misses).
12) Implementation Roadmap (4-8 weeks)
Ned. 1-2: critical path inventory; post startup/liveness/readiness; enter JSON responses with sub-checks and hysteresis.
Ned. 3-4: add deep-checks: database/cache/broker; synthetics for login/deposit/bet in 2-3 GEO; enable outlier-detection on the/mesh gateway.
Ned. 5–6: payment-aware readiness и PSP-fallback; waiting-room for the front; autoscaling by lag/queues; alerts by burn-rate.
Ned. 7-8: chaos days (disabling PSP/database replicas), backoff/jitter check; timeout fintuning, PDB; KPI report and correction.
13) Artifacts
Health Spec (per service): list of checks, time budgets, hysteresis, actions with red status.
Runbooks: "Readiness = FALSE: What are we doing? , ""PSP-fallback: steps and return criteria."
Routing Policy: outlier-detection rules, circuit-breakers, percentile thresholds.
Synthetic Playbook: scripts and geographies, SLO synthetics, schedule.
Release Gate: release blocks with red deep-check key dependencies.
Result
A well-designed health-checks loop is a layered system of signals: easy liveness for process viability, readiness for traffic service capability, startup for secure start, and domain-specific deep-checks for managed degradation and routing. In conjunction with autoscaling, outlier-routing, synthetics and SLO-alerting, it reduces the risk of cascading failures, reduces MTTR and stabilizes the business critical paths of iGaming platforms.