Chaos Engineering

1) Basic principles

Steady State as the original hypothesis. Clearly define the norm (for example: p95 <200 ms, error rate <0. 3%, critical flow success> 99. 5%).
Isolated variables. Change as much as possible one factor at a time to causally link effect and improvement.
Degree. We start with small amplitudes in a safe environment → expand coverage and intensity.
Guardrails. Explicit stop conditions on SLO/alert/error budget.
Repeatability. The experiment must be deterministically reproducible (scripts/manifests/IaC).
Ethics and safety. No real personal data and financial transactions in risky experiments.

2) What is "steady state"

Steady State is a set of observable metrics that describe user value and business invariants:

p50/p95/p99 latencies of key endpoints.
Success rate and critical path conversion.
Error rate, timeouts, percentage of "shed" requests (truncated on saturation).
Self-healing rate (MTTR), resistance to retreats (without storms).
Domain invariants: lack of "cons in the balance," once executed payments, consistency of reporting days, etc.

3) Injection catalog (what we break)

Network: latency, jitter, loss/duplicates, bandwidth limitation, TLS breaks, DNS flapping.
Calculations: CPU overload, memory/GC pressure, descriptor exhaustion, clock skew.
Storage: high p95 I/O, ENOSPC, leader/replica failure, split-brain, lingering fsync.
Dependencies: 5xx/429, "slow success," degradation of external APIs, rate-limit.
Data: duplicates/misses of messages, out-of-order, dirty records, version conflict.
Operations: unsuccessful release/config, feature flag with a bug, expired certificate, key rotation.
People and processes: unavailability of those responsible, delay of manual update, incorrect runbook.

4) Experiment design (template)

1. Hypothesis: "At + 300 ms to the currency service p99 of the main API <450 ms, a breaker opens, a stale response is given ≤ 15 minutes ago."

2. Injection: failure profile (type/amplitude/duration) and target contour.
3. Metrics/log tags: marking'chaos. experiment_id`, `phase=inject|recover`.
4. Guardrails: abort at'error _ rate> 2% 'or p99> SLA × 2 for more than 1 minute.
5. Results/output: list of observations, bugs, improvements, work plan and re-run.

5) Observability: what is mandatory

Tracing: request path through dependencies; segments with degradation are marked.
Resource metrics: CPU, heap/GC, FD, disk IOPS/lat, network bandwidth, queue depth.
Business metrics: conversion/success of operations, share of compensated transactions.
Event logs: opening/closing breakers, retrays and their budget, switching the database leader.

Experiment panel: live-dashboard with guardrails thresholds and an abortion "red button."

6) Guardrails and security

Technical: upper limits of error rate/latency, drop in the share of successful operations, DLQ growth.

Organizational: window of time, on-call involved, the principle of "one zone - one experiment."

Data/compliance: synthetics only or impersonal kits; banning tests leading to regulatory violations.
Rollback: ready rollback/disable procedure of the flag/soft drain traffic.

7) Resilience patterns that should show up

Timeout budgets and jitter retreats (storm-free).
Circuit Breaker with half-open and exponential recovery.
Bulkheads: isolation of criticality pools (payments vs analyst).
Backpressure and rate-limit: predictable low priority cutoff.

Cache with coalescing, protection against "warm-up storms."

Idempotence of side effects and sagas with compensatory actions.
Quorums, feilover, and anti-entropy for data recovery.

8) Sample scenarios (sketches)

8. 1 Slow Dependency (YAML)

yaml experiment: slow-downstream target: svc:api inject:
dependency:
name: currency mode: add_latency p95_ms: 300 duration: 10m guardrails:
error_rate: "< 1. 5%"
p99_latency: "< 450ms"
expectations:
breaker_open: true stale_data_served: "<= 15m"

8. 2 Loss of DB leader

Injection: Leader stoppage/forced re-election.
Waiting: temporary write inhibit, quorum read, WAL/Outbox safe, auto-restore replication, no double write.

8. 3 ENOSPC on Log Disk

Injection: fill the disc to 95-100%.
Waiting: emergency rotation of logs, safety of critical logs, disabling non-critical features, alert and auto-remediation.

8. 4 Burst traffic + shading

Injection: × 3 RPS for 5 minutes on a hot endpoint.
Waiting: dropping low priority, stable p95 "core," no retray cascade.

9) Automation in CI/CD

Chaos-smoke in the stage for each release (short injections at safe amplitudes).
Nightly runs according to the catalog of experiments (matrix services × types of failures).
Gates: The release is blocked if "persistence is below the threshold" (for example, the percentage of successful fallbacks is <95%).
Artifacts: report, trails, CPU/heap flameshraphs, snapshots of metrics and configs.

10) Game Days (Game Days)

Regular team exercises with "live" scenarios:

Roles: experiment leader, metrics observer, rollback operator, business representative.
Scenarios: cache degradation, partial AZ/region-feilover failure, "bad release," unavailability of an external provider.
Results: found gaps in the runbook, improvements in alerts, adjustments to SLOs and retray budgets.

11) Chaos for data, events and ML

Data streams: tests for duplicates, gaps, out-of-order, delays; validation of idempotent consumers and DLQ strategies.
Repositories: index degradation, hot-partition, lock conflict, replication under lag.
ML: feature delay, rollback to baseline model, degradation of input data quality (drift) - the system should "softly blunt" and not fall.

12) Anti-patterns

Chaos without observability: you are "blind," the conclusions are speculative.
Injections immediately in prod without stage and guard rails.
"One big experiment" on everything at once - it is unclear what exactly worked.
Haphazard chaos actions without hypotheses and retests after fixes.
Focusing only on infrastructure - business invariants are forgotten.
Ignoring people/processes: alerts, on-call, runbook - part of the system.

13) Maturity of practice (model)

1. Ad-hoc: single injections locally.
2. Stage chaos: catalog of scenarios, repeated runs, dashboards.
3. Release chaos: smoke chaos in each release, gates, reports.
4. Food chaos with restrictions: low traffic, strict guardrails, ready rollback.
5. Continuous stability: auto-experiments, SLO-management, improvements as a work flow.

14) Integration with architectural practices

Resistance testing: Chaos experiments complement fault injections and degradation scenarios.
Load testing: Combined load + fail experiments reveal cascades and a storm of retraces.
Policy as Code/RBAC/ABAC: guardrails, rollback steps and limits are designed as policies.
Consent/privacy management: do not allow experiments that violate the data processing mode.
Geo-architecture: chaos-checks of the regions' failover and data binding to jurisdictions.

15) Mini recipes (pseudocode)

Breaker + degradation


if breaker. open():
return serve_stale(cache. max_age=15m)
try:
res = call(dep, timeout=250ms)
return res except Timeout:
breaker. trip()
return serve_stale()

Limiter + shading


if cpu. load() > 0. 85 or queue. depth() > HIGH:
if req. priority < HIGH: return 503_SHED limiter. acquire()

Idempotent side effect


key = "payout:"+external_id if kv. exists(key): return kv. get(key)
res = side_effect()
kv. put(key, res, ttl=30d)
return res

16) Architect checklist

1. Defined Steady State and guardrails?
2. Is there a script directory (network/CPU/storage/dependencies/data/operations)?
3. Does observability cover resources, latency tails, business invariants?
4. Timeouts/retreats/breakers/limiters/bulkheads enabled and parameterizable?
5. Prepared runbook and "red button"?
6. Are there chaos-smoke in the stage and nightly experiments?
7. Are there "safe" windows and roles for game days?
8. Experiments are reproducible (IaC/scripts), results are versioned?
9. Improvements are fixed by tasks, retest is done?
10. Data and ML pipelines covered, not only HTTP?

Conclusion

Chaos Engineering turns "unforeseen incidents" into predictable scenarios. The resistance hypothesis, controlled injections, rigid guardrails, rich observability and retest discipline are tools that reduce the risk of releases and increase trust in the platform. As a result, the team understands the boundaries of the system, is able to elegantly degrade and quickly return the service to the user, even in conditions of failures.

Chaos Engineering

Limiter + shading

Idempotent side effect

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects