Circuit Breaker and Degradation
Circuit Breaker (CB) is a security pattern that interrupts calls to a degraded dependency to localize the failure and protect upstream services and the user. Degradation (graceful degradation) - deliberate simplification of functionality in case of lack of resources or failures (for example, returning cached/incomplete data, disabling "expensive" features) without full downtime.
The main goal: to preserve SLO and user experience through controlled failures, instead of cascading drops.
1) When to apply
Dependence is unstable: p95/p99 growth, timeouts, erroneous answers.
External APIs with strict limits/penalties.
"Heavy" backends (search, recommendations, reports), where retrays intensify the storm.
Highly loaded areas with a risk of depletion of pools (connections, threads).
2) CB states and transitions
Classic triple:1. Closed - traffic goes, error/latency metrics are counted.
2. Open - calls are instantly rejected (fail-fast) and/or transferred to fallback.
3. Half-Open - A limited number of "trial" requests determine whether to close the switch.
Opening triggers
Error/timeout threshold per window (for example, ≥ 50% of the last N).
Latency threshold (e.g. p95> target).
Combined policies (errors ∧ timeout exceeded).
Hold time (cool-down)
Fixed (for example, 10-60 seconds) or adaptive (exponential increase with repeated actuations).
3) Timeouts, retreats and jitter
Timeouts are always shorter than upstream SLOs and are deadline propagation.
Retrai only for idempotent operations; 1-2 attempts are sufficient in most cases.
Backoff + jitter (full jitter) prevents synchronous waves of repetitions.
Hedging (spare requests) - economical and only for very critical reads.
4) Bulkhead-isolation and "fuses"
Separate connection/worker/queue pools by domain and traffic type (VIP, background tasks, public APIs).
Caps on concurrency for "expensive" operations.
Admission control: easy failure before execution when the queue is full.
5) Fallback and degradation scenarios
Options
Cache/style responses: 'stale-while-revalidate', returning data from L2/L3 cache.
Read-only: block of write/commands, allow safe reads.
Surrogate responses: incomplete data (eg, no recommendations/avatars).
Functional disabling: temporarily hide non-critical widgets/features.
Feature flags: quick change of behavior without release.
Rules
Fallback must be deterministic, fast, and safe from data.
Explicitly mark the degraded path in the logs/tracks/metrics.
6) Prioritization and traffic shaping
VIP/paid plans - higher priority/quotas in case of shortage.
Rate limits and throttling reduce the burden on degraded dependencies.
Shed load: Soft reduction in quality (e.g. fewer results, truncated images) until stabilized.
7) Observability and signaling
CB metrics
Status (closed/open/half-open) and duration in state.
The share of failures by causes: CB-open, timeout, 5xx, retry-exhausted.
p95/p99 latency "before" and "after" the switch.
Number/percentage of requests via fallback.
Tracing
Annotations of spans: 'circuit = opened', 'fallback = cache', 'admission = denied'.
Correlation with limits (429/RateLimit-), queues and connection bullets.
Logs/Audits
Reason for opening/closing, thresholds, dependency IDs.
8) Contracts and Protocol
HTTP
Fail-fast: '503 Service Unavailable' with 'Retry-After' (or '429' at limits).
Partial content/stale: '200 '/' 206' with degradation metadata (for example, 'X-Degraded: true').
Cache policies: 'Cache-Control: stale-if-error, stale-while-revalidate'.
gRPC
'UNAVAILABLE ',' DEADLINE _ EXCEEDED ', retray semantics by client/proxy policies.
Deadline/timeout on the request context; spreading the deadline down the chain.
Idempotency
'Idempotency-Key'for POST operations, deduplication at the border.
9) Typical implementation (pseudo code)
pseudo onRequest(req):
if circuit. isOpen(dep):
return fallbackOrFail(req)
with timeout(T):
try:
resp = call(dep, req)
circuit. recordSuccess(dep, latency=resp. latency)
return resp except TimeoutError or 5xx as e:
circuit. recordFailure(dep)
if circuit. shouldOpen(dep):
circuit. open(dep, coolDown=adaptive())
return fallbackOrFail(req)
Half-Open sample
pseudo onTimer():
if circuit. state(dep) == OPEN and coolDownExpired():
circuit. toHalfOpen(dep)
onRequestHalfOpen(req):
if circuit. allowTrial (dep): # e.g. 1 try: call -> success => close catch: reopen with longer coolDown else:
return fallbackOrFail(req)
10) Setting thresholds
Observation window: sliding N seconds/queries.
Error threshold: 20-50% in the window (depending on the profile).
Latency threshold: p95 ≤ target SLO (eg, 300-500 ms); the excess is counted as an "error" for the CB.
Adaptive cool-down: 10s → 30s → 60s with repeated actuations.
11) Testing and chaos practices
Chaos: latency/dependency error injection, DNS breakdown, packet drop.
Game days: starting the "opening" of the switch on a combat-like environment, checking the fallback.
Canary: Enable POC/degradation policies first for 1-5% of traffic.
SLO budget: allow experiments until error-budget is exhausted.
12) Integration with multi-tenancy
The CB state can be stored per-dependency per-tenant (for noisy tenants) or globally - depending on the load profile.
Segment the fallback data and caches by 'tenant _ id'.
Priorities/quotas - according to plans (VIPs should not suffer from Starter behavior).
13) Pre-sale checklist
- Timeouts and deadlines are end-to-end and consistent.
- Retrays are limited, only for idempotent operations, with backoff + jitter.
- The CB thresholds are justified by the load test data.
- Fallback paths exist, fast and secure; policy cache defined.
- Bulkhead isolation: separate pools/queues/limits.
- Metrics/trails/logs flag degradation and CB states.
- Response contract documentation (HTTP/gRPC) with sample headers/codes.
- Chaos scenarios and game days take place regularly; there is a runbook.
14) Typical errors
There are no timeouts → retreats "all the way" and cascading falls.
Single global CB instead of selective (by endpoint/method) - unnecessary failures.
Open switch without fallback → "empty" screens instead of degraded UX.
Retrai without jitter → synchronous storms of requests.
Long cool-down with short-term failures or too short with stable - "flip-flop" states.
Absence of bulkhead - depletion of shared pools and "head-of-line blocking."
15) Quick strategy selection
Reads of high importance: CB + cache of stale responses + hedging (economical).
Records/payments: strict timeouts, minimum retrays, idempotency keys, no dirty fallback.
External APIs: CB with aggressive thresholds, adaptive cool-down, strict throttling.
Pulsating load microservices: bulkheads, caps per concurrency, VIP prioritization.
Conclusion
Circuit Breaker and managed degradation are architecture "insurance": they translate chaotic failures into predictable behavior. Clear timeouts, limited jitter retreats, isolated pools, thoughtful fallback paths, and telemetry make the system resilient to dependency failures and hold SLOs even during peak and crash periods.