Chaos Engineering: System Resilience
Chaos Engineering: System Resilience
1) Why chaos engineering
The goal is to prove the stability of the production architecture not by words and diagrams, but by experiments. We deliberately create controlled failures to:- test hypotheses about system behavior and validate SLO;
- detect hidden SPOFs, incorrect timeouts/retrays, cascading effects;
- train teams: game-days, working out runbooks, communications;
- to form a "sustainability by default" culture rather than "hoping for the best."
Important: Chaos Engineering ≠ "break everything." This is a scientific method: steady-state → hypothesis → experiment → conclusions → improvement.
2) Basic experiment cycle
1. Steady-state (baseline): Which SLIs are stable? (for example, success ≤500 ms at 99. 95%).
2. Hypothesis: with the loss of one AZ, p95 will increase <10%, and the availability ≥99. 9%.
3. Experiment: planned foul with limited blast radius and stop criteria.
4. Observation: metrics/trails/logs, burn-rate SLO, business SLI (for example, successful deposits).
5. Improvements: we record finds, change timeouts/limits/routing, update the runbook.
6. Automation/regression: repeat in the schedule, add to CI/CD and game-days calendars.
3) Safety first
Blast radius: start with a narrow one - one pod/instance/route/namespace.
Guardrails: alerts to SLO burn-rate (fast/slow), retray limits, QPS limit, incident budget.
Stop criteria: "if error-rate> X% or p99> Y ms N minutes - instantly stop and rollback."
Windows: on-call working hours, notified stakeholders, frozen releases.
Communication: IC/Tech lead/Comms, clear channel (War-room), message template.
4) Failure classes and hypothesis ideas
Network: delay/jitter/loss, partial drop of ports, "flopping" communication between services/PSP.
Computer/nodes: killing processes, CPU overheating, exhaustion of file descriptors, narrow connection pools.
Storage and database: growth of latency disks, lag replicas, stop one shard/leader, split-brain.
Dependencies: degradation of external APIs, provider limits, 5xx/429 bursts.
Change management: unsuccessful release, bad feature flag, partial rollout.
Perimeter: CDN degrades, DNS/Anycast drift, WAF/bot protection failure.
Region/AZ: complete loss or "partial" incident (slightly worse and unpredictable).
5) Tools and techniques
Kubernetes: Chaos Mesh, Litmus, PowerfulSeal, kube-monkey.
Clouds: AWS Fault Injection Simulator (FIS), Fault Domains near clouds.
Network/proxy: Toxiproxy (TCP poison), tc/netem, iptables, Envoy fault (delay/abort), Istio fault injection.
Processes/nodes: 'stress-ng', cgroups/CPU-throttle, disk fill.
Traffic routing: GSLB/DNS weights, canary/blue-green switching for fake checks.
6) Sample scripts (Kubernetes)
6. 1 Delay/abort on the route (Istio VirtualService)
yaml apiVersion: networking. istio. io/v1alpha3 kind: VirtualService metadata: { name: api-chaos }
spec:
hosts: ["api. internal"]
http:
- route: [{ destination: { host: api-svc } }]
fault:
delay: { percentage: { value: 5 }, fixedDelay: 500ms }
abort: { percentage: { value: 1 }, httpStatus: 503 }
Hypothesis: client timeouts/retrays and CBs will hold p95 <300 ms and error-rate <0. 5%.
6. 2 Pod Kill (Chaos Mesh)
yaml apiVersion: chaos-mesh. org/v1alpha1 kind: PodChaos metadata: { name: kill-one-api }
spec:
action: pod-kill mode: one selector:
namespaces: ["prod"]
labelSelectors: { "app": "api" }
duration: "2m"
Hypothesis: the balancer and HPA compensate for the loss of one instance without a growth of p99> 20%.
6. 3 Network chaos (delay to database)
yaml apiVersion: chaos-mesh. org/v1alpha1 kind: NetworkChaos metadata: { name: db-latency }
spec:
action: delay mode: all selector: { namespaces: ["prod"], labelSelectors: {"app":"payments"} }
delay: { latency: "120ms", jitter: "30ms", correlation: "25" }
direction: to target:
selector: { namespaces: ["prod"], labelSelectors: {"role":"db"} }
mode: all duration: "5m"
Hypothesis: pools/timeouts/cache will reduce impact; p95 payments will remain ≤ SLO.
6. 4 Disk filling
yaml apiVersion: chaos-mesh. org/v1alpha1 kind: IOChaos metadata: { name: disk-fill-logs }
spec:
action: fill mode: one selector: { labelSelectors: {"app":"ingest"} }
volumePath: /var/log size: "2Gi"
duration: "10m"
Hypothesis: rotation of logs/quotas/alerts will work before the degradation of routes.
7) Experiments outside the K8s
7. 1 Toxiproxy (local/integration)
bash toxiproxy-cli create psp --listen 127. 0. 0. 1:9999 --upstream psp. prod:443 toxiproxy-cli toxic add psp -t latency -a latency=200 -a jitter=50 toxiproxy-cli toxic add psp -t timeout -a timeout=1000
7. 2 Envoy HTTP fault (perimeter/mesh)
yaml fault:
delay: { fixed_delay: 0. 3s, percentage: { numerator: 10, denominator: HUNDRED } }
abort: { http_status: 503, percentage: { numerator: 1, denominator: HUNDRED } }
7. 3 AWS FIS (example idea)
Experiment "kill" N% EC2 in Auto Scaling Group, artificially raise EBS-latency, disable NAT-GW in one AZ.
Built-in stop criteria for CloudWatch SLO metrics.
8) Observability metrics during chaos
SLO/SLI: fraction of good requests, p95/p99, burn-rate.
RED model for critical routes (Rate, Errors, Duration).
Pools: waiting for p95 connection, utilization.
DB: lag replicas, locks, drift p95 requests.
Network: retransmitts, RTT, dscp/ecn behavior.
Business SLI: success of transactions (deposits/checks),% returns/errors.
Tracing: selective trails (exemplars), correlation of release annotations.
9) Integration with SLO/Error-budget
Plan experiments within the budget of mistakes: chaos should not "disrupt" quarterly goals.
Burn-rate alerts as automatic kill-switch.
Reporting: "how much budget burned," "what deviations steady-state."
10) Game-days (exercise)
Scenario: brief legend (e.g. "region-East lost"), injection steps, SLO goals, roles, time.
Rating: RTO/RPO actual, quality of communications, runbook correctness.
Retro: list of improvements with owners and deadlines, update documentation/dashboards.
11) Automation and CI/CD
Smoke-chaos: short staging tests on each release (e.g. 1 pod-kill + 200ms delay per route).
Nightly/Weekly: Heavier scenarios (5-15 min) with report.
Promo gates: if p95/errors> threshold on canary - auto-rollback.
Repositories with a catalog of experiments (YAML + runbook + SLO-thresholds).
12) Anti-patterns
"Breaking food without railings": no stop criteria, no on-call → the risk of a real incident.
One-time action instead of process.
Chaos without steady-state: It's not clear what counts as success/failure.
Excessive retrays → self-DDoS when injecting delays.
Ignoring business SLI: "Technarian" success when payments/orders fail.
Lack of post-analysis and improvement owners.
13) Implementation checklist (0-45 days)
0-10 days
Define steady-state SLI (user + business).
Select a tool (Chaos Mesh/Litmus/Toxiproxy/FIS).
Describe railings: blast radius, stop criteria, windows, roles.
11-25 days
Run the first experiments: pod-kill, 100-200 ms delay per critical upstream, drop 1% of packets.
Configure burn-rate alerts, associate kill-switch with stop-criteria.
Spend the first game-day, collect retro and fixes.
26-45 days
Add AZ level/dependency scripts (external PSP, DB-lag).
Automate nightly chaos on staging; prepare "seasonal" scenarios (peaks).
Catalogue of experiments and regular reports for management/SRE.
14) Maturity metrics
≥80% of critical routes have the described experiments and steady-state metrics.
Auto kill-switch is triggered when the p99/error-rate thresholds are exceeded.
Quarterly - game-day level AZ/region; ≥1 times/month - target scenario of dependencies.
MTTR decreases after a cycle of improvements, the "release ↔ incident" correlation decreases.
The proportion of "unexpected" drops in real failures → tends to zero.
Dashboards show "resilience" as KPIs (burn-rate, recovery time, proportion of successful DR actions).
15) Examples of guardrails and stop triggers
Stop at: 'http _ req _ failed> 1%' 3 minutes, 'p99> 1000 ms' 3 windows, 'deposit _ success <99. 5%`.
Reduction of blast radius: auto-rollback of the manifest, return of GSLB weights, disabling fault injections.
Stop command: single button/script with logging of causes.
16) Culture and processes
Chaos is part of the SRE rhythm, not "extreme."
Transparent reporting, vulnerability recognition, corrective action.
Training on-call, simulating communications with customers/partners.
Linking with SLA/SLO and budgets: Chaos should boost, not undermine, reliability.
17) Conclusion
Chaos Engineering turns "hope on the nines" into provable sustainability. Formulate steady-state, place railings, break small and controlled, observe SLO and business SLI, record and automate improvements. Then real failures will become controlled events: predictable RTO, protected error-budget and the team's willingness to act without panic.