Timeout и circuit control
1) Why do you need it
Systems fall not from one "fatal" failure, but from the accumulation of delays and "wave" retraces. Timeouts limit waiting time and free up resources, and circuit control (breaker + shedding + adaptive competition) prevents degradation from spreading along the chain of dependencies. The goal is to keep p95/p99 within target boundaries and maintain availability for partial failures.
2) Basic definitions
2. 1 Types of timeouts (L7/L4)
Connect timeout - Establish a TCP/TLS connection.
TLS/Handshake timeout - handshakes TLS/HTTP2 preface.
Write timeout - send a request (including a body).
Read timeout - wait for the first byte of the response and/or the whole body.
Idle/Keep-Alive timeout - inactive connection.
Overall deadline - "hard" deadline for the entire request (end-to-end).
2. 2 Deadline budget
Select the target 'deadline _ total' and divide by stages:- `ingress (gateway) + authZ + app + DB/cache + outbound PSP`.
- gateway: 30 ms,
- application: 120 ms,
- DB: 120 ms,
- PSP: 100 ms,
- margin: 30 ms.
2. 3 Propagation and cancellation
'deadline '/' timeout'must be passed down the chain (context, headers, gRPC Deadline). On expiration - cancel background operations (abort/ctx cancel), clear locks/semaphores.
3) Timeout setting strategies
1. Top-down: based on SLO and p95 - set end-to-end deadline, then split into sub-timeouts.
2. Identify "expensive" paths (file downloads, reports, external PSPs) - individual longer, but limited.
- idempotent (GET/status repetitions) - shorter, more aggressive;
- write/monetary - slightly longer, but with a single repetition and idempotency.
4. Graduation by plans/tenants (enterprise can have longer timeout, but less parallelism).
4) Circuit breaker: models and parameters
4. 1 Triggering policies
Failure-rate - error rate ≥ X% on query/time N window.
Consequential failures: M consecutive failures.
Slow-call rate - the proportion of calls longer than the threshold T.
Error classes: timeouts/5xx/connection-reset → "fatal," 4xx - do not take into account.
4. 2 Conditions
Closed - skips everything, accumulates statistics.
Open - instant failure (saves resources, does not crush the dependency).
Half-open - small "samples" (N requests) for "water test."
4. 3 Useful additions
Bulkhead: a pool of threads/connections per dependency so that one does not "suck out" everything.
Adaptive concurrency: automatic concurrency restriction (AIMD/Vegas-like algorithms) by observed latency.
Load Shedding: early failure/degradation in case of lack of local resource (queues, CPU, GC pauses).
5) Interaction: timeouts, retreats, limits
First deadline, then retray: each repetition should fit into a common deadline.
Backoff + jitter for replays; respect 'Retry-After' and retry-budget.
Rate limiting: With breaker open - lower limits so as not to intensify the storm.
Idempotency: mandatory on write operations (to avoid takes with "dumb" timeouts).
Where to retract: preferably at the edge (client/gateway) rather than deep inside.
6) Practical target values (benchmarks)
Public read API: end-to-end '200-500 ms', read timeout '100-300 ms'.
Critical write (payments): '300-800 ms' e2e; external PSP ≤ '250-400 ms'.
Connect/TLS: '50-150 ms' (more - network/soldering problem).
Idle: '30-90 s' (mobile clients - shorter to save battery).
Adjust the values for p95/p99 and regions.
7) Configs and examples
7. 1 Envoy (cluster + route, pseudo)
yaml clusters:
- name: payments_psp connect_timeout: 100ms type: STRICT_DNS lb_policy: ROUND_ROBIN circuit_breakers:
thresholds:
- priority: DEFAULT max_connections: 2000 max_requests: 2000 max_retries: 50 outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50
routes:
- match: { prefix: "/api/v1/payments" }
route:
cluster: payments_psp timeout: 350ms # per-request deadline idle_timeout: 30s retry_policy:
retry_on: "reset,connect-failure,refused-stream,5xx,gateways"
num_retries: 1 per_try_timeout: 200ms
7. 2 NGINX (perimeter)
nginx proxy_connect_timeout 100ms;
proxy_send_timeout 200ms; # write proxy_read_timeout 300ms; # read (первый байт/все тело)
keepalive_timeout 30s;
send_timeout 15s;
Быстрый отказ при перегрузке limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 50;
7. 3 gRPC (client, Go-pseudo)
go ctx, cancel:= context.WithTimeout(context.Background(), 350time.Millisecond)
defer cancel()
resp, err:= client.Pay(ctx, req) // Deadline передается вниз
7. 4 HTTP client (Go)
go client:= &http.Client{
Timeout: 350 time.Millisecond, // общий дедлайн на запрос
Transport: &http.Transport{
TLSHandshakeTimeout: 100 time.Millisecond,
ResponseHeaderTimeout: 250 time.Millisecond,
IdleConnTimeout: 30 time.Second,
MaxIdleConnsPerHost: 100,
},
}
7. 5 Resilience4j (Java, pseudo)
yaml resilience4j.circuitbreaker.instances.psp:
slidingWindowType: TIME_BASED slidingWindowSize: 60 failureRateThreshold: 50 slowCallDurationThreshold: 200ms slowCallRateThreshold: 30 permittedNumberOfCallsInHalfOpenState: 5 waitDurationInOpenState: 30s
resilience4j.timelimiter.instances.psp:
timeoutDuration: 350ms
8) Observability and alerting
8. 1 Metrics
`http_client_requests{endpoint, status}`, `client_latency_bucket`
8. 2 Trails
Spans: ingress → handler → DB/Redis → external.
Attributes: 'timeout _ ms _ target', 'circuit _ state', 'queue _ time _ ms'.
Examplars: Tie p99 peaks to specific trace-id.
8. 3 Alerts
'p99 _ latency {critical} '> targets X minutes in a row.
'timeout _ rate {dependency} 'hopped> Y%.
Frequent transitions to 'open '/' flapping' breaker.
Growth of'shed _ requests _ total'with high CPU/GC.
9) Adaptive Concurrency & Load Shedding
9. 1 Idea
Automation reduces parallelism as latency tails grow:- AIMD: increase slowly, decrease sharply.
- Vegas-like: keep queue time.
- Token-based: each request "burns" the token; tokens are issued based on the measured speed.
9. 2 Implementation
Local semaphores per-route; the goal is to keep 'queue _ time' below the threshold.
Global "fuse" (marginal RPS/competitive) on the gateway.
If there is a shortage of CPU/connections, early failure before logic execution (429/503 with'Retry-After ').
10) Testing and chaos scenarios
Latency injection: artificially add 50-300 ms per dependency.
Packet loss/dup/drop (tc/tbf, Toxiproxy).
Knob turning: reduce connection pools, increase load to saturation.
Kill/Degrade one zone/shard (partial unavailability).
Checks: does not "fail" the retray storm; breaker opens predictably; whether the queue is growing.
11) Antipatterns
One global "read timeout" without detail connect/TLS/per-stage.
The lack of a common deadline → retrays goes beyond SLO.
Retrai without jitter and without retry-budget.
"Eternal" connections without idle timeouts → leak descriptors.
Breaker counts 4xx as fatal mistakes.
No undo/abort → background work continues after client timeout.
Timeouts are too long for mobile/unstable networks.
12) Specifics of iGaming/Finance
Critical write (deposits/outputs): one short repeat with Idempotency-Key, then '202 Accepted' + polling instead of infinite expectations.
PSP/banking: separate policies by provider/region (some slower).
Responsible payments and limits: for locks/reviews - fast '423/409', do not stretch "hanging" transactions.
Reporting/aggregation - run asynchronously (batch + status resource).
13) Prod Readiness Checklist
- End-to-end critical route deadline (GET/POST) defined.
- Budgeted by stage; deadline propagation is enabled.
- Connect/TLS/read/write/idle configs on gateway and clients.
- Circuit breaker with failure-rate and slow-call thresholds; correct half-open logic.
- Bulkheads on dependencies; per-route concurrency limits.
- Load shedding before the business logic is executed during overload.
- Integration with retreats: backoff + jitter, retry-budget, respect 'Retry-After'.
- Idempotency write, 'Idempotency-Key' and outbox for events.
- Metrics: timeout/slow-call/breaker/queue time/competitive.
- Chaos tests: injection of delays/losses/failures, degradation of zones.
- Customer documentation: sample timings, response codes, replay tips.
14) TL; DR
Give each request a hard deadline, arrange it in stages and spread it down the chain. Manage faults via circuit breaker + bulkheads + adaptive concurrency + load shedding. Replays - only within the deadline, with jitter and budget; write - idempotent only. Measure timeout/slow-call, breaker state and 'queue _ time', regularly drive chaos tests.