Stability testing
1) Basic concepts and goals
Reliability - probability of performance; resilience-The behavior during and after a failure.
SLO/erroneous budget: degradation "acceptability" criteria.
Steady-state hypothesis: formal expectation of stable metrics (e.g. p95 <200 ms, error rate <0. 5%). The experiment is considered successful if the hypothesis is sustained.
Types of failures: network (latency, loss/duplicates, breaks), computational (CPU, memory), storage (I/O, disk exhaustion), dependencies (5xx, timeouts, rate-limit), logical (partial incidents, "slow degradation"), operational (release, config), "dark" (split-brain, hour shifts).
2) Pyramid of sustainability
1. Unit tests of logic failures (retrays, idempotency, timeouts).
2. Component adapters with fault-inject (Testcontainers/tc-netem).
3. Integration/system with network/database/caches and real-world profiles.
4. Chaos experiments in pre-prod (and then limited in prod) on runbooks.
5. Game day - scenario exercises of the team (people + tools).
3) Observability as basis
SLI: p50/p95/p99 latency, error rate, saturation (CPU/heap/FD/IOPS), drop/timeout, queue depth.
Traces: to find bottlenecks under failure.
Semantic resilience metrics: graceful-degrade success rate, shed request rate, self-healing rate (MTTR).
Labeling experiments: 'chaos. experiment_id', 'phase = inject/recover' in events/logs.
4) Faults injection catalogue
Network: delay/jitter, loss/duplicates/reordering, bandwidth limitation, burst storms, TLS breaks.
Host: CPU limit, memory leaks/limits, GC pauses, descriptor exhaustion, "clock skew."
Storage: increasing latency, EROFS, ENOSPC, replica degradation, loss of leader.
Dependencies: 5xx/429, slowdown, DNS flapping, outdated certificates, rate-limit, "partial responses."
Data: write corruption, "holes" in streams, event duplicates, version conflicts.
Operations: unsuccessful release, feature flag, config drift, manual error (as part of the simulation).
5) Stability patterns (what to check)
Jitter retraces and timeouts on each RPC.
Circuit Breaker (opening/semi-opening, exponential recovery).
Bulkheads (isolation of pools/queues to critical domains).
Load Shedding (reset low-priority requests when saturated).
Backpressure (signals up the chain, concurrency limits).
Idempotency (idempotency keys on "side effects").
Caching and stacks in case of source degradation.
Graceful Degradation (lightweight responses, stale data, disabling features).
Timeout-budget.
Atomicity/compensation (Saga/Outbox/Transactional Inbox).
Quorums and replication (R/W quorums, consistency degradation for availability).
Anti-entropy/replay (recovery from event holes).
6) Prescriptions for injections and expectations (pseudocode)
Retray with jitter and breaker
for attempt in 1..N:
if breaker. open(): return fallback()
res = call(dep, timeout = base 0. 8)
if res. ok: return res sleep(exp_backoff(attempt) jitter(0. 5..1. 5))
if attempt == N: breaker. trip()
return fallback()
Shading and Backprescher
if queue. depth() > HIGH cpu. load() > 0. 85:
if request. priority < HIGH: return 503_SHED limiter. acquire () # constrain concurrency
Idempotency
key = hash("payout:"+external_id)
if store. exists(key): return store. get(key)
result = do_side_effect()
store. put(key, result, ttl=30d)
return result
7) Experiments: Scenarios and Hypotheses
7. 1 "Slow Dependency"
Injection: + 400 ms p95 to external API.
Waiting: timeout growth ≤ X%, breaker opening, fallback responses, saving p99 service <SLA, no cascade during retrays.
7. 2 "Partial cache loss"
Injection: failure of 50% of Redis/Cache shard nodes.
Waiting: increased miss, but without avalanche to the source (request coalescing/immutable TTL), auto-warm up and recovery.
7. 3 "Split-brain in the database"
Injection: Loss of leader, switch to replica.
Waiting: short-term deny of records, read from quorum, no data loss, Outbox does not lose messages.
7. 4 "ENOSPC/disk full"
Injection: 95-100% disc.
Waiting: emergency rotation of logs, failure of non-blocking features, safety of critical logs (WAL), alerts and autoliquids.
7. 5 "Traffic burst"
Injection: × 3 RPS to hot endpoint 10 min.
Expectation: low priority shading, stable p95 for "nuclear" paths, queue growth within limits, no DLQ storms.
7. 6 «Clock Skew»
Injection: shift of the node time by +/ − 2 min.
Waiting: correct TTL/signatures (leeway), monotonic timers in retrays, valid tokens with acceptable drift.
8) Environments and safety of experiments
Start with pre-prod, synthetic data, configs/topology as close as possible to the product.
In sales - only controlled windows, feature flags, step-by-step amplitude, auto-rollback, "red button."
Guardrails: RPS/bug limits, SLO guards, blocking releases during critical incidents.
A runbook is required: how to roll back, who to call, where to look.
9) Automation and CI/CD
Catalog of experiments as code (YAML/DSL): goals, injections, metrics, thresholds, rollback "buttons."
Smoke-chaos in each release: short injections (e.g. 2 min + 200 ms to addiction) in the stage.
Matrix Night Runs - Services × Failure Modes
Release gate: prohibition of deploy if stability is below the threshold (for example, 'fallback coverage <95%' under the "slow dependency").
10) Data and consistency
Check compensation (Saga): partially performed operations must be brought to an agreed state.
Test replays/event duplicates, out-of-order delivery, holes and replays.
Verify domain invariants after failures: the balance is not negative, transactions do not get stuck, limits are not violated.
11) Anti-patterns
Test only happy-path and load without failures.
Retrai without jitter → a storm under degradation.
No global timeout budget → cascading timeouts.
A single pool for all tasks → no isolation (bulkheads).
"Infinite" queues → an increase in latency/PDE.
Zero telemetry of experiments → "blind" chaos practices.
Chaos in the sale without rollback/limits/responsible owner.
12) Architect checklist
1. Steady-state hypothesis and SLO defined?
2. Each RPC has timeouts, jitter retreats, breakers?
3. Implemented bulkheads, limiters, backpressure, load-shedding?
4. Cache steady: coalescing, cache storm protection, warming up?
5. Outbox/Saga for side effects, idempotent keys?
6. Quorums/replication/feilover tested?
7. Is there a catalog of experiments, nightly chaos and gates in CI/CD?
8. Metrics/traces mark experiments, are there dashboards?
9. Runbook 'and "red button" ready, responsibility assigned?
10. Regular game days featuring Dev/SRE/Support?
13) Mini Tools and Sample Scenarios (YAML Sketches)
Network (tc/netem)
yaml experiment: add-latency target: svc:payments inject:
netem:
delay_ms: 300 jitter_ms: 50 loss: 2%
duration: 10m guardrails:
error_rate: "< 1%"
p95_latency: "< 400ms"
CPU/Heap
yaml inject:
cpu_burn: { cores: 2, duration: 5m }
heap_fill: { mb: 512 }
Dependency
yaml inject:
dependency:
name: currency-api mode: slow p95_add_ms: 500 fallback_expectation: "serve stale rates ≤ 15m old"
Conclusion
Resilience testing is not a "chaos trick," but a discipline that makes the system predictable under glitches. Clear hypotheses, telemetry, a catalog of controlled experiments and embedded patterns in the architecture (timeouts, breakers, isolation, idempotence) turn potential incidents into controlled scenarios. The team gets confidence in releases, and users get a stable service even in conditions of failures.