Liveness/Readiness samples
2) Design principles
1. Separate semantics.
Readiness: external ability to service requests (takes into account critical dependencies).
Liveness: The detectability of the "incurable" state of the process.
2. Fail-fast, but not false-fast. Adjust the timeouts/threshold 'failureThreshold' so that short bursts do not lead to unnecessary restarts.
3. No heavy operations in samples. The check should be fast (≤100 -200 ms) and without side effects.
4. Graceful degradation. In case of partial unavailability of dependencies - Readiness = OK, if there is a safe follback (cache/coarsening).
5. Deterministic I/O. Statuses depend only on the current state, not on "random" external tests.
3) Semantics of HEALTH-endpoints
3. 1 HTTP approach (recommended)
'GET/healthz/liveness' → 200 if the process is "alive" (event-loop is spinning, GC is not stuck, watchdog "heart" is beating).
'GET/healthz/readiness' → 200 if the instance is ready for critical class traffic. Checks: connection pool, local caches, business logic kernel availability.
'GET/healthz/startup '→ 200 after initialization (migrations/cache warm-up/loading models).
- You cannot go to external databases/APIs in liveness - this will lead to "suicides" during dependency incidents.
- In readiness, you can check critical dependencies, but with timeouts and degradation: if there is a valid follback, do not bring it down.
3. 2 gRPC Health Checking
Use the'grpc 'standard. health. v1. Health/Check 'with service-scoped states (' SERVING ',' NOT _ SERVING '). For Kubernetes - grpc probes (or http proxy).
3. 3 Internal triggers
Watchdog "soft" stop: with SIGTERM set Readiness = FAIL → wait for 'terminationGracePeriodSeconds' → end, working out queues.
4) Timings and thresholds (tuning)
Key fields of Kubernetes samples:- `initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `successThreshold`, `failureThreshold`.
- readiness: `period=5s, timeout=0. 2–0. 5s, failure=2`
- liveness: `period=10s, timeout=0. 2–0. 5s, failure=3`
- startup: 'period = 5s, failure = 60' (up to ~ 5 min)
- readiness/liveness activated after startup success
- readiness reflects readiness for processing (connection to a broker, whether there is DLQ degradation),
- liveness - inner heartbeat loop.
Backoff on failures: in the application, use the exponential backoff to reconnect to dependencies, otherwise readiness will "saw."
5) Configurations (fragments)
5. 1 Kubernetes, HTTP probes
yaml livenessProbe:
httpGet: { path: /healthz/liveness, port: 8080 }
periodSeconds: 10 timeoutSeconds: 1 failureThreshold: 3
readinessProbe:
httpGet: { path: /healthz/readiness, port: 8080 }
periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 2
startupProbe:
httpGet: { path: /healthz/startup, port: 8080 }
periodSeconds: 5 failureThreshold: 60
5. 2 Kubernetes, gRPC sample
yaml readinessProbe:
grpc:
port: 9090 service: my. app. Service periodSeconds: 5 timeoutSeconds: 1
5. 3 Graceful shutdown
yaml terminationGracePeriodSeconds: 30 lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","curl -s localhost:8080/healthz/drain && sleep 5"]
'/healthz/drain 'inside the service translates Readiness = FAIL (stop-accepting), gives time to complete active requests.
6) Dependencies and degradation
Critical (cannot be serviced without them): authorization database for '/login ', payment gateway for '/pay'. Can be checked in readiness with timeout ≤80% of 'timeoutSeconds' samples.
Non-critical: analytics, email, cache layer if there is a load. Do not include them in readiness; use a follbeck.
Feature-flags: If partially degraded, disable dependent features while maintaining Readiness = OK.
7) Queues and background handlers
Consumers/Workers:- Readiness = OK if a subscription/connection to the broker is installed and there is a resource to process.
- When DLQ/lag overflow → Readiness may remain OK (if we accept and add), but SLI "freshness/lag" lights up - alert according to the data.
- Liveness: control the poll cycle/heartbeat, deathdetector.
Idempotence: Accelerates recovery from restart liveness.
8) Sidecar/mesh/ingress
When using service mesh (Istio/Linkerd), probe can go through sidecar:- Enable 'readinessGate' (K8s) to account for sidecar status,
- Ensure that samples do not fall within mTLS barriers (or add exceptions).
- Ingress/Envoy/Nginx: Prox '/healthz/' locally, do not "bring out" internal parts.
9) Security and privacy
Health endpoints should not disclose configs, library versions, error strings - only "OK/FAIL" + minimum cause code.
Restrict outside access (NetworkPolicy/ACL). For public - let's just liveness-ping without details.
Logs of health checks - at the DEBUG level, with throttling.
10) Observability and SLO
Export metrics: 'health _ readiness {status}', 'health _ liveness {status}', sample processing time.
Associate Readiness flags with availability SLOs (drop from endpoints → 5xx/connection reset).
- "Frequent restarts by liveness> N/hour" - a symptom of deadlock/leaks.
- "Flap Readiness> X/15 min" - a symptom of addiction/network problems.
- Correlation with deploy ('service. version`).
11) Testing
Unit/Contract: Endpoints '/healthz/' return correct statuses when each dependency is disabled.
Chaos: disabling database/cache/broker: Readiness should fall or enable follback strictly according to the model. Liveness - does not trigger if the process is "alive."
Load/Soak: Under load, health endpoints must remain fast (do not push content).
Canary: Check Readiness stability before increasing traffic.
12) Frequent mistakes and how to avoid them
Liveness checks databases/external APIs. The result is endless restarts for incidents. The solution: limit liveness to "process life."
Heavy checks in samples. Leads to false failures. Solution: light checks + individual background-health monitors.
No Startup Probe. Slow starts are "killed" by liveness. Solution: add startup with a wide window.
No graceful shutdown. Rare 5xx in depla. Solution: preStop + unbalance.
Flap storms. Too aggressive thresholds. Solution: raise 'failureThreshold', increase 'timeoutSeconds', add backoff.
The same endpoints for everything. Mixing semantics. Solution: individual 'liveness/readiness/startup'.
13) Mini Implementation Patterns
Simple HTTP handler (pseudocode):python
@app. get("/healthz/liveness")
def liveness():
return 200
@app. get("/healthz/readiness")
def readiness():
ok_core = core_is_ready () # local pools/caches/initialization ok_db = db. ping (timeout = 50 _ ms) # only if the DB is critical return 200 if (ok_core and ok_db) else 503
@app. get("/healthz/startup")
def startup():
return 200 if INIT_DONE else 503
@app. post("/healthz/drain")
def drain():
set_readiness(False); return 200
gRPC health (idea):
go
// use google. golang. org/grpc/health/grpc_health_v1 healthServer. SetServingStatus("my. app. Service", SERVING) // or NOT_SERVING
ReadinessGate (true with mesh):
yaml spec:
readinessGates:
- conditionType: "proxy. istio. io/ready"
14) Checklists
Before selling
- Liveness/readiness/startup endpoints are separated, their semantics are described.
- Liveness does not touch external dependencies; Readiness only tests critical with timeouts and follbeck.
- Configured'initialDelay/period/timeout/failureThreshold 'for service profile.
- graceful shutdown enabled: 'preStop' + unbalance.
- Health metrics/logs are connected; alerts to restarts/flap.
- Dependency failure and slow start tests passed.
Operation
- Weekly report on restarts and readiness flags.
- Tuning thresholds after incidents; connection with releases.
- Regular chaos tests of disabling dependencies.
- Relevance of semantics when dependency criticality changes.
15) FAQ
Q: Is it possible to close everything with one breakdown?
A: Undesirable. Separate 'startup', 'readiness', 'liveness' - this reduces false positives and speeds up RCA.
Q: Do I check the cache in readiness?
A: If there is a correct (albeit slower) mode without a cache, do not bring down readiness, just turn on degradation.
Q: What to do with frequent restarts for liveness?
A: Rule out a dektor/leak first; then loosen the thresholds and add a watchdog in the application.
Q: How do we account for multi-tenancy?
A: Readiness should reflect the ability to serve any rental traffic. For private problems of a particular tenant - do not change readiness, but signal with separate SLI/alerts.
- "Observability: logs, metrics, traces"
- "Distributed Traces"
- "SLO/SLA and Metrics"
- "Webhook Delivery Guarantees"
- "In Transit Encryption"
- "Secret Management"