Liveness/Readiness samples

2) Design principles

1. Separate semantics.

Readiness: external ability to service requests (takes into account critical dependencies).
Liveness: The detectability of the "incurable" state of the process.
2. Fail-fast, but not false-fast. Adjust the timeouts/threshold 'failureThreshold' so that short bursts do not lead to unnecessary restarts.
3. No heavy operations in samples. The check should be fast (≤100 -200 ms) and without side effects.
4. Graceful degradation. In case of partial unavailability of dependencies - Readiness = OK, if there is a safe follback (cache/coarsening).
5. Deterministic I/O. Statuses depend only on the current state, not on "random" external tests.

3) Semantics of HEALTH-endpoints

3. 1 HTTP approach (recommended)

'GET/healthz/liveness' → 200 if the process is "alive" (event-loop is spinning, GC is not stuck, watchdog "heart" is beating).
'GET/healthz/readiness' → 200 if the instance is ready for critical class traffic. Checks: connection pool, local caches, business logic kernel availability.
'GET/healthz/startup '→ 200 after initialization (migrations/cache warm-up/loading models).

Rules:

You cannot go to external databases/APIs in liveness - this will lead to "suicides" during dependency incidents.
In readiness, you can check critical dependencies, but with timeouts and degradation: if there is a valid follback, do not bring it down.

3. 2 gRPC Health Checking

Use the'grpc 'standard. health. v1. Health/Check 'with service-scoped states (' SERVING ',' NOT _ SERVING '). For Kubernetes - grpc probes (or http proxy).

3. 3 Internal triggers

Watchdog "soft" stop: with SIGTERM set Readiness = FAIL → wait for 'terminationGracePeriodSeconds' → end, working out queues.

4) Timings and thresholds (tuning)

Key fields of Kubernetes samples:

`initialDelaySeconds`, `periodSeconds`, `timeoutSeconds`, `successThreshold`, `failureThreshold`.

Recommendations for start profiles: Web/API with fast start:

readiness: `period=5s, timeout=0. 2–0. 5s, failure=2`
liveness: `period=10s, timeout=0. 2–0. 5s, failure=3`

Hard start (JIT/models/warm-up):

startup: 'period = 5s, failure = 60' (up to ~ 5 min)
readiness/liveness activated after startup success

Batch/consumer:

readiness reflects readiness for processing (connection to a broker, whether there is DLQ degradation),
liveness - inner heartbeat loop.

Backoff on failures: in the application, use the exponential backoff to reconnect to dependencies, otherwise readiness will "saw."

5) Configurations (fragments)

5. 1 Kubernetes, HTTP probes

yaml livenessProbe:
httpGet: { path: /healthz/liveness, port: 8080 }
periodSeconds: 10 timeoutSeconds: 1 failureThreshold: 3

readinessProbe:
httpGet: { path: /healthz/readiness, port: 8080 }
periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 2

startupProbe:
httpGet: { path: /healthz/startup, port: 8080 }
periodSeconds: 5 failureThreshold: 60

5. 2 Kubernetes, gRPC sample

yaml readinessProbe:
grpc:
port: 9090 service: my. app. Service periodSeconds: 5 timeoutSeconds: 1

5. 3 Graceful shutdown

yaml terminationGracePeriodSeconds: 30 lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","curl -s localhost:8080/healthz/drain && sleep 5"]

'/healthz/drain 'inside the service translates Readiness = FAIL (stop-accepting), gives time to complete active requests.

6) Dependencies and degradation

Critical (cannot be serviced without them): authorization database for '/login ', payment gateway for '/pay'. Can be checked in readiness with timeout ≤80% of 'timeoutSeconds' samples.
Non-critical: analytics, email, cache layer if there is a load. Do not include them in readiness; use a follbeck.
Feature-flags: If partially degraded, disable dependent features while maintaining Readiness = OK.

7) Queues and background handlers

Consumers/Workers:

Readiness = OK if a subscription/connection to the broker is installed and there is a resource to process.
When DLQ/lag overflow → Readiness may remain OK (if we accept and add), but SLI "freshness/lag" lights up - alert according to the data.
Liveness: control the poll cycle/heartbeat, deathdetector.

Idempotence: Accelerates recovery from restart liveness.

8) Sidecar/mesh/ingress

When using service mesh (Istio/Linkerd), probe can go through sidecar:

Enable 'readinessGate' (K8s) to account for sidecar status,
Ensure that samples do not fall within mTLS barriers (or add exceptions).
Ingress/Envoy/Nginx: Prox '/healthz/' locally, do not "bring out" internal parts.

9) Security and privacy

Health endpoints should not disclose configs, library versions, error strings - only "OK/FAIL" + minimum cause code.
Restrict outside access (NetworkPolicy/ACL). For public - let's just liveness-ping without details.
Logs of health checks - at the DEBUG level, with throttling.

10) Observability and SLO

Export metrics: 'health _ readiness {status}', 'health _ liveness {status}', sample processing time.
Associate Readiness flags with availability SLOs (drop from endpoints → 5xx/connection reset).

Alerts:

"Frequent restarts by liveness> N/hour" - a symptom of deadlock/leaks.
"Flap Readiness> X/15 min" - a symptom of addiction/network problems.
Correlation with deploy ('service. version`).

11) Testing

Unit/Contract: Endpoints '/healthz/' return correct statuses when each dependency is disabled.

Chaos: disabling database/cache/broker: Readiness should fall or enable follback strictly according to the model. Liveness - does not trigger if the process is "alive."

Load/Soak: Under load, health endpoints must remain fast (do not push content).
Canary: Check Readiness stability before increasing traffic.

12) Frequent mistakes and how to avoid them

Liveness checks databases/external APIs. The result is endless restarts for incidents. The solution: limit liveness to "process life."

Heavy checks in samples. Leads to false failures. Solution: light checks + individual background-health monitors.
No Startup Probe. Slow starts are "killed" by liveness. Solution: add startup with a wide window.
No graceful shutdown. Rare 5xx in depla. Solution: preStop + unbalance.
Flap storms. Too aggressive thresholds. Solution: raise 'failureThreshold', increase 'timeoutSeconds', add backoff.
The same endpoints for everything. Mixing semantics. Solution: individual 'liveness/readiness/startup'.

13) Mini Implementation Patterns

Simple HTTP handler (pseudocode):

python
@app. get("/healthz/liveness")
def liveness():
return 200

@app. get("/healthz/readiness")
def readiness():
ok_core = core_is_ready () # local pools/caches/initialization ok_db = db. ping (timeout = 50 _ ms) # only if the DB is critical return 200 if (ok_core and ok_db) else 503

@app. get("/healthz/startup")
def startup():
return 200 if INIT_DONE else 503

@app. post("/healthz/drain")
def drain():
set_readiness(False); return 200

gRPC health (idea):

go
// use google. golang. org/grpc/health/grpc_health_v1 healthServer. SetServingStatus("my. app. Service", SERVING) // or NOT_SERVING

ReadinessGate (true with mesh):

yaml spec:
readinessGates:
- conditionType: "proxy. istio. io/ready"

14) Checklists

Before selling

Liveness/readiness/startup endpoints are separated, their semantics are described.
Liveness does not touch external dependencies; Readiness only tests critical with timeouts and follbeck.
Configured'initialDelay/period/timeout/failureThreshold 'for service profile.
graceful shutdown enabled: 'preStop' + unbalance.
Health metrics/logs are connected; alerts to restarts/flap.
Dependency failure and slow start tests passed.

Operation

Weekly report on restarts and readiness flags.
Tuning thresholds after incidents; connection with releases.
Regular chaos tests of disabling dependencies.
Relevance of semantics when dependency criticality changes.

15) FAQ

Q: Is it possible to close everything with one breakdown?
A: Undesirable. Separate 'startup', 'readiness', 'liveness' - this reduces false positives and speeds up RCA.

Q: Do I check the cache in readiness?
A: If there is a correct (albeit slower) mode without a cache, do not bring down readiness, just turn on degradation.

Q: What to do with frequent restarts for liveness?
A: Rule out a dektor/leak first; then loosen the thresholds and add a watchdog in the application.

Q: How do we account for multi-tenancy?
A: Readiness should reflect the ability to serve any rental traffic. For private problems of a particular tenant - do not change readiness, but signal with separate SLI/alerts.

Related Materials:

"Observability: logs, metrics, traces"
"Distributed Traces"
"SLO/SLA and Metrics"
"Webhook Delivery Guarantees"
"In Transit Encryption"
"Secret Management"

Liveness/Readiness samples

Operation

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects