Load testing and stress
Load testing and stress
1) Why do you need it
Objectives:- Confirm capacity (how many RPS/competitive sessions the system will withstand given SLO).
- Find bottlenecks (CPU/IO/DB/networks/locks/pools).
- Set up performance budgets and gates in CI/CD.
- Reduce the risk of releases (p95/p99 regression, peak error growth).
- Plan capacity/cost (scale out and reserves).
2) Types of perf tests
Load: realistic traffic close to peaks; SLO validation.
Stress: growth to/above the limit → degradation behavior where it breaks.
Spike: fast load jump → elasticity/autoscale.
Soak/Endurance: hours/day → leaks, fragmentation, latency drift.
Capacity/Scalability: how throughput/latency changes with scale-out; Amdal/Gustafson law.
Smoke perf: a short "smoke" run on each release (performance dignity).
3) Traffic generation models
Fixed VUs/concurrency: 'N' users, each making requests to → queue on the client. Risk of hiding overload.
Arrival rate: a flow of applications with λ intensity (req/s), as in real life. More correct for public APIs.
Little's Law: 'L = λ × W'.
For pool/service, minimum parallelism ≈ 'λ × W' (add 20-50% of inventory).
Where 'λ' is throughput, 'W' is the average service time.
4) Load profiles and scenarios
User journey mix: shares of scripts (login, browse, deposit, checkout...).
Think-time: user pauses (distributions: exponential/lognormal).
Data profile: size of responses, payload, variability of parameters.
Correlation: link steps (cookies/tokens/ID) as in a real flow.
Cold/warm/hot cache: individual runs.
Read vs Write: balance of reads/records, idempotency for retrays.
Multi-region: RTT, distribution by POP/ASN.
5) Test environment
Isolation: the stand is close to the product in topology/settings (but do not "beat" the product).
Data: PII masking, volumes, indices as in sales.
Load generators: do not rest against the CPU/network; distributed runners, time synchronization.
Observability: metrics/trails/logs, synthetics on the perimeter, export of CPU/heap profiles.
6) Metrics and SLI
Throughput: RPS/Transactions/sec
Latency: p50/p95/p99, TTFB, server time vs network.
Errors: share of 5xx/4xx/domain errors.
Saturation: CPU, load avg, GC, disk IOps/latency, network, pool wait.
Business SLI: ≤ 5s deposit success, ≤ 2s order confirmation.
Take the thresholds from the SLO (for example, "99. 95% ≤ 300 ms"), monitor burn-rate during the run.
7) Finding bottlenecks (technique)
1. Consistently warm up the system by 60-80% of the target load.
2. Increase in steps (ramp) → fix where p95/p99 and error-rate grow.
- queues in pools (DB/HTTP),
- growth of WAIT/locks (DB),
- GC-pauses/heap,
- network retransmitts/packet loss,
- disk latency/cache misses.
- 4. Localize: binary search by query path, profilers (CPU/alloc/lock-profile).
- 5. Fix the "bottle" → tuning → repeating the run.
8) Behavior under stress
Graceful degradation: limits, circuit-breakers, backpressure queues, accepted for processing.
Retrays: maximum 1, idempotent only; jitter; the retray budget ≤ 10% of RPS.
Fail-open/Fail-closed: for non-critical dependencies, allow fail-open (cache/stubs).
Cascading failure: isolation of pools/quotas (bulkhead), fast timeouts, "smooth" disabling of functions (feature flags).
9) Tools (selection for the task)
k6 (JavaScript, open/open-model, fast, convenient in CI).
JMeter (rich in ecosystem, GUI/CLI, plugins, but heavier).
Gatling (Scala DSL, high performance).
Locust (Python, scripting flexibility).
Vegeta/hey/wrk (micro-benches and quick check).
Rule: one "main" tool + light CLI for smoke pen in PR.
10) Examples (snippets)
10. 1 k6 (open model with arrival rate)
js import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
scenarios: {
open_model: {
executor: 'ramping-arrival-rate',
startRate: 200, timeUnit: '1s',
preAllocatedVUs: 200, maxVUs: 2000,
stages: [
{ target: 500, duration: '5m' }, // до 500 rps
{ target: 800, duration: '5m' }, // стресс
{ target: 0, duration: '1m' }
]
}
},
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<800'],
http_req_failed: ['rate<0.005'],
},
};
export default function () {
const res = http.get(`${__ENV.BASE_URL}/api/catalog?limit=20`);
sleep(Math.random() 2); // think-time
}
10. 2 JMeter (profile idea)
Thread Group + Stepping Thread или Concurrency Thread (open-like).
HTTP Request Defaults, Cookie Manager, CSV Data Set.
Backend Listener → InfluxDB/Grafana; Assertions by time/code.
10. 3 Locust (Python)
python from locust import HttpUser, task, between class WebUser(HttpUser):
wait_time = between(0.2, 2.0)
@task(5)
def browse(self): self.client.get("/api/catalog?limit=20")
@task(1)
def buy(self): self.client.post("/api/checkout", json={"sku":"A1","qty":1})
11) Data, correlation, preparation
Seed data: directories, users, balances, tokens - as in sales.
PII masking/anonymization; generating synthetics on top of real distributions.
Correlation: Extract IDs/tokens from responses (RegExp/JSONPath) and use in subsequent steps.
12) Observability during runs
RED dashboards (Rate, Errors, Duration) along the routes.
Exemplars - transition from metrics to traces (trace_id).
Error logs: sampling + aggregation, duplicates/idempotence.
System: CPU/GC/heap, disks/network, pool wait.
DB: top queries, locks, index scans, bloat.
13) Automation and performance gates
CI: short runs on merge (e.g. k6 2-3 minutes) with thresholds.
Nightly/Weekly: long soak/stress in a separate medium; reports and trends.
Canary releases: analysis of SLO (error-rate, p95) as the "gate" of the promotion.
Regressions: baseline vs current build; alert at deterioration> X%.
14) Capacity planning and cost
Curves throughput→latency: define knee point - after it p99 grows sharply.
Scale-out: Measure scaling efficiency (RPS delta/node delta).
Cost: "RPS per $/hour," reserve for peak events + DR-reserve.
15) Anti-patterns
Beat into the prod without control or test in an "empty" environment, not like the prod.
Closed model with fixed VUs hiding overload.
Lack of think-time/data → unrealistic cache hits, or vice versa - storm to the source.
One "/ping "script instead of custom flow.
Lack of observability: "we see only RPS and average delay."
Uncontrolled retrays → self-DDoS.
Mixing the test and optimizations without fixing hypotheses/changes.
16) Checklist (0-30 days)
0-7 days
Define SLI/SLO and target traffic profiles (mix, think-time, data).
Select the tool (k6/JMeter/Locust), raise the RED dashboards.
Prepare the stand and seed data, disable third-party limits/captchas.
8-20 days
Build scenarios: open-model (arrival rate), cold/warm/hot cache.
Run load → stress → spike; fix knee point and bottlenecks.
Implement performance gates in CI (micro-run).
21-30 days
Soak test 4-24h: GC leaks/drift, stabilization.
Document limits, capacity plan, "RPS→p95/oshibki" illustrations.
Prepare runbook "how to increase limits/scale" and "how to degrade."
17) Maturity metrics
There are realistic profiles (mix, think-time, data) that cover ≥ 80% of traffic.
RED dashboards + tracing are connected for all tests.
Performance gates block releases when regressing p95/errors.
Capacity and knee point are documented by key services.
Monthly soak/stress runs and progress reports.
Resistance to "spike" is confirmed by autoscale and the absence of cascade-fail.
18) Conclusion
Load testing is a regular engineering practice, not a one-time "measurement." Model real users (open-model), measure what reflects the client's experience (SLI/SLO), keep observability and gates in CI/CD, conduct stress/spike/soak runs and fix knee point. Then peak events and black swans turn into manageable scenarios, and performance turns into a predictable and measurable parameter of your platform.