Circuit Breaker and Retreats
Circuit Breaker and Retrai
1) Why do you need it
Networks are unreliable: latency pulsates, nodes fall, limits are reached. Retrays are saved from short-term failures, and Circuit Breaker protects the system from cascading failures and self-DDoS. The combination with the correct timeouts and limits preserves SLO, stabilizes tail delays and the price of "nines."
2) Basic principles
First timeouts, then retreats, then Circuit Breaker.
Retraim only idempotent operations (GET, secure POST/PUT with idempotent key).
Allocate a retray budget: ≤ 10-15% of the original RPS per route.
Localize failure: bulkhead (separate pools/quotas) + rate-limit.
During degradation - fast failure (fail-fast), graceful-degradation/stub.
3) Retray semantics
When to retreat
Transient errors: timeouts, 5xx, network unavailability, 429 (after 'Retry-After').
You cannot retract: obvious business errors (4xx ≠ 429), side-effects without idempotence (payment without a key).
Strategies
Exponential backoff + jitter (full or even): Smooths out swarms of retraces.
Max attempts: 1-2 (rarely 3) - more is usually harmful.
Budget: global retray counter/sec per service and per-request "retry tokens."
Hedging (rare): parallel double of the request after t-quantile (p95) - only for strictly idempotent reads.
Pseudocode backoff + jitter:python base = 100 # ms for attempt in range(1, max_attempts+1):
try:
return call()
except Transient as e:
if attempt == max_attempts: raise sleep_ms = min(cap_ms, base 2(attempt-1))
sleep(random(0, sleep_ms)) # full jitter
4) Timeouts and "quick failure"
Client timeout <upstream timeout: so as not to accumulate "zombie" requests.
Делите: connect timeout, read timeout, overall deadline.
Tail-aware timeouts: aim for p95/p99 + small margin.
Use a common deadline field (for example, gRPC 'deadline') and cast it down the chain.
5) Circuit Breaker: How it works
States:- Closed: passes traffic, counts errors/latency.
- Open: immediately gives a quick refusal (or spare answer).
- Half-Open: test queries; if successful, it closes.
- Errors/timeouts exceed X% per window N requests/seconds or p99 above the threshold.
- Rolling statistics and minimum volume are relevant (for example, ≥ 50 queries).
6) Bulkhead, quotas and divide and conquer
Separate pools of per-upstream and per-feature connections.
Quotas for in-flight requests; superfluous - quick refusal.
In case of shortage - degradation of feature flags.
7) Perimeter integration (Envoy/Istio/Nginx)
Envoy (retry + outlier + CB, idea):yaml routes:
- match: { prefix: "/api" }
route:
cluster: upstream_api timeout: 2s retry_policy:
retry_on: "connect-failure,reset,retriable-4xx,5xx"
num_retries: 2 per_try_timeout: 600ms retry_back_off: { base_interval: 100ms, max_interval: 800ms }
hedge_policy:
hedge_on_per_try_timeout: true initial_requests: 1 additional_request_chance: { numerator: 5, denominator: HUNDRED } # 5%
clusters:
- name: upstream_api circuit_breakers:
thresholds:
- priority: DEFAULT max_connections: 500 max_requests: 1000 max_retries: 200 outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50
Istio (VirtualService fault/retry, compressed example):
yaml apiVersion: networking. istio. io/v1beta1 kind: VirtualService spec:
hosts: ["payments"]
http:
- route: [{ destination: { host: payments } }]
timeout: 2s retries:
attempts: 2 perTryTimeout: 600ms retryOn: "5xx,connect-failure,refused-stream,reset"
Nginx Ingress (annotations):
yaml nginx. ingress. kubernetes. io/proxy-connect-timeout: "2"
nginx. ingress. kubernetes. io/proxy-read-timeout: "2"
nginx. ingress. kubernetes. io/proxy-next-upstream: "error timeout http_502 http_503 http_504"
nginx. ingress. kubernetes. io/proxy-next-upstream-tries: "2"
8) Libraries and code (stack snippets)
Java (Resilience4j):java var cb = CircuitBreaker. ofDefaults("psp");
var retry = Retry. of("psp-retry",
RetryConfig. custom()
.maxAttempts(2)
.waitDuration(Duration. ofMillis(200))
.intervalFunction(IntervalFunction. ofExponentialRandomBackoff(100, 2. 0, 0. 5) )//jitter
.retryExceptions(SocketTimeoutException. class, IOException. class)
.build());
Supplier<Response> decorated =
CircuitBreaker. decorateSupplier(cb,
Retry. decorateSupplier(retry, () -> client. call()));
return Try. ofSupplier(decorated)
.recover(BusinessException. class, fallback())
.get();
Go (context deadline + backoff):
go ctx, cancel:= context. WithTimeout(context. Background(), 2time. Second)
defer cancel()
var lastErr error for i:= 0; i < 2; i++ {
reqCtx, stop:= context. WithTimeout(ctx, 600time. Millisecond)
lastErr = call(reqCtx)
stop()
if lastErr == nil { break }
sleep:= time. Duration(rand. Intn(1<<uint(7+i))) time. Millisecond // full jitter time. Sleep(min(sleep, 800time. Millisecond))
}
if lastErr!= nil { return fastFail() }
Node. js (got + p-retry):
js import pRetry from 'p-retry';
await pRetry(() => got(url, { timeout: { connect: 500, request: 2000 } }), {
retries: 2,
factor: 2,
randomize: true,
minTimeout: 100,
maxTimeout: 800,
onFailedAttempt: e => { if (isBusiness(e)) throw e; }
});
9) Retray and SLO budget
Type retry tokens: Each retray spends a token; pool is limited.
Associate with error-budget: if burn-rate is above the threshold, turn off retrays, open CB more often, turn on degradation.
Canary releases: On canaries, reduce attempts and tokens.
10) Hedging (caution)
Run an additional request after the p95 deadline, canceling the loser.
Only for reads and "safe" idempotent operations; limit the share (≤ 1-5%).
Watch for an increase in the load on the upstream.
11) Observability
RED metrics along the routes: Rate, Error, Duration (p50/p95/p99).
CB metrics: status (open/half-open), opening rate, missed/refused requests.
Retrays: attempts/request, retry-rate, burned tokens.
Perimeter: outlier-ejection, ejection-rate.
Traces: annotate 'retry _ attempt', 'cb _ state', 'hedged = true', cast 'trace _ id'.
12) Architecture integration
Bulkhead + CB for each critical upstream.
Queues/asynchron: for long operations instead of crazy timeouts.
Cache/stubs: for non-critical features when fail-open.
Autoscale: Doesn't make up for bad retreats - stop the storm first.
13) Anti-patterns
Retrays without timeouts → frozen connections and depletion of pools.
Repeat non-idempotent transactions (double write-offs).
Infinite exponential growth without cap and jitter.
A single CB to all upstream → drag and drop failure to the entire product.
Ignoring 429/' Retry-After '.
The client timeout is longer than that of the upstream (or not at all).
"Treat" business errors with retras.
14) Implementation checklist (0-30 days)
0-7 days
Identify routes and their idempotency.
Set timeouts (connect/read/overall), enable minimum retrays (× 1) and default CB.
Separate the pools/quotas (bulkhead) for the main upstream.
8-20 days
Include jitter and global retray budget, retry-rate alerts.
Configure outlier-ejection on the perimeter, fast failure for low-prio feature.
RED + CB/Retry dashboards, tagged trails.
21-30 days
Canary retray profiles (fewer attempts), game-day "upstream slow/flaps."
Document policy: who/what retraces, limits, exceptions.
Review the p95/p99 and timeouts according to the data, not by eye.
15) Maturity metrics
100% of routes have timeouts and documented retray/NE policy.
Retry-rate fits into the budget (≤ 10-15%), there are no spikes in incidents.
CBs fire before the entire pool falls; no cascading failures.
Trails show attempts/hedging; p99 is stable at peaks.
Canary releases use a "careful" retray profile.
16) Short configuration examples
Resilience4j YAML (Spring Boot, идея):yaml resilience4j:
circuitbreaker:
instances:
psp:
slidingWindowType: COUNT_BASED slidingWindowSize: 100 minimumNumberOfCalls: 50 failureRateThreshold: 50 waitDurationInOpenState: 30s permittedNumberOfCallsInHalfOpenState: 5 retry:
instances:
psp:
maxAttempts: 2 waitDuration: 200ms enableExponentialBackoff: true exponentialBackoffMultiplier: 2. 0 retryExceptions:
- java. net. SocketTimeoutException
- java. io. IOException
Envoy rate-limit (idea fragment):
yaml rate_limits:
- actions:
- generic_key: { descriptor_value: "api. payments" }
17) Conclusion
Sustainability is a discipline: timeouts → retreats (with jitter and budget) → Circuit Breaker + bulkhead/quotas and quick rejection. Set up a perimeter (outlier-ejection), hang up RED/CB/Retry dashboards, fix the idempotency policy and do not forget about business SLI. Then brief failures will remain invisible, and real incidents will not turn into cascading falls.