Benchmarking and performance comparison
Brief Summary
Benchmarking is an experiment, not a "run wrk for 5 minutes." Main principles:1. Formulate hypothesis and metrics.
2. Control variables (hardware, core, power, background noise).
3. Collect enough data (replicas, confidence intervals).
4. Do profiling - without it you cannot understand the "why."
5. Do repro: scripts, fixing versions and artifacts.
Benchmark goals and business metrics
Throughput: RPS/QPS/CPS, writes/sec
Latency: p50/p95/p99/tail density.
Efficiency: Cost-per-1k RPS, watt per transaction, $/millisecond improvement.
Stability: jitter, inter-cycle/node variability.
Elasticity: how indicators are scaled at N × resource (Amdahl/Gustafson benchmarks).
Methodology: experimental design
Hypothesis: "Envoy with HTTP/3 will reduce p95 TTFB by 10-15% with the same RPS."
Unit of comparison: build/config/instance version of iron.
A/B diagram: parallel run on identical environment; or ABAB/Latin Square to reduce the impact of drift.
Number of repetitions: ≥ 10 short + 3 long runs per configuration for stable ratings.
Statistics: median, MAD, bootstrap confidence intervals; non-parametric tests (Mann-Whitney) for "tailed" distributions.
DoE (minimum): Change one variable at a time (OVAT) or factorial factoring for 2-3 factors (for example, TLS profile × HTTP version × kernel).
Variable and noise control
CPU governor: `performance`; disable "power save."
Turbo/Throttling: monitoring frequencies, temperatures and throttling (otherwise warming up will give false winnings).
NUMA/Hyper-Threading: pin IRQs and processes ('taskset/numactl'), measure memory locality.
C-states/IRQ balance: fix the settings; for network tests - IRQ pin for specific cores.
Background processes: clean node, turn off cron/backup/antivirus/updatedb.
Network: stable paths, fixed MTU/ECN/AQM, no channel flutter.
Data: same sets, cardinality and distributions.
Cache: separate "cold" (first pass) and "warm" (repeat) modes, explicitly mark.
Benchmark Classes
1) Micro benchmarks (function/algorithm)
Purpose: Measure a specific code/algorithm.
Tools: built-in bench frameworks (Go'testing. B`, JMH, pytest-benchmark).
Rules: JIT warm-up, milliseconds → nanoseconds; GC isolation; fixed seed.
2) Meso benchmarks (component/service)
HTTP server, cache, broker, database on one node.
Tools: wrk/wrk2, k6 (open model), vegeta, ghz (gRPC), fio, sysbench, iperf3.
Rules: connection/file limits, pools; CPU/IRQ/GC report.
3) Macro benchmarks (e2e/request path)
Full way: CDN/edge → proxy → service → DB/cache → answer.
Tools: k6/Locust/Gatling + RUM/OTel tracing; a realistic mix of routes.
Rules: closer to reality ("dirty" data, lags of external systems), neatly with retras.
Metrics by Layer
Test templates and commands
Network (TCP/UDP):bash iperf3 -s # server iperf3 -c <host> -P 8 -t 60 # parallel, stable bandwidth
HTTP server (stable load, wrk2):
bash wrk2 -t8 -c512 -d5m -R 20000 https://api. example. com/endpoint \
--latency --timeout 2s
Open-model (k6, arrival-rate):
javascript export const options = {
scenarios: { open: { executor: 'constant-arrival-rate', rate: 1000, timeUnit: '1s',
duration: '10m', preAllocatedVUs: 2000 } },
thresholds: { http_req_failed: ['rate<0. 3%'], http_req_duration: ['p(95)<250'] }
};
Disk (fio, 4k random read):
bash fio --name=randread --rw=randread --bs=4k --iodepth=64 --numjobs=4 \
--size=4G --runtime=120 --group_reporting --filename=/data/testfile
Database (sysbench + PostgreSQL sample idea):
bash sysbench oltp_read_write --table-size=1000000 --threads=64 \
--pgsql-host=... --pgsql-user=... --pgsql-password=... prepare sysbench oltp_read_write --time=600 --threads=64 run
Memory/CPU (Linux perf + stress-ng):
bash perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
-- <your_binary> --bench
Statistics and validity
Replicates: minimum 10 runs, exclude outliers (robust: median/MAD).
Confidence intervals: bootstrap 95% CI for p95/p99 and means.
Effect-size: relative change and its CI (e.g. − 12% [− 9%; − 15%]).
Practical significance: a 10% decrease in p95 at a price of + 30% CPU - is it worth it?
Graphs: violin/ECDF for distributions, "saturation curves" (RPS→latency).
Bottleneck profiling and localization
CPU: `perf`, `async-profiler`, eBPF/pyroscope; flamegraph before and after.
Alloc/GC: runtime profiles (Go pprof/Java JFR).
I/O: `iostat`, `blktrace`, `fio --lat_percentiles=1`.
Сеть: `ss -s`, `ethtool -S`, `dropwatch`, `tc -s qdisc`.
БД: `EXPLAIN (ANALYZE, BUFFERS)`, pg_stat_statements, slowlog.
Cash: top keys, TTL, eviction cause.
Reporting and Artifacts
What to fix:- git SHA build, compilation/optimization flags.
- Kernel/network configs (sysctl), driver versions/NIC/firmware.
- Topology (vCPU/NUMA/HT), governor, temperature/frequencies.
- Data: size, cardinality, distributions.
- What to publish: p50/p95/p99 tables, error/sec, throughput, resources (CPU/RAM/IO), CI.
- Artifacts: run scripts, graphs, flamegraph, raw JSON/CSV results, environment protocol.
Fair benchmarking
Identical limiters (conn pool, keepalive, chain TLS, OCSP stapling).
Negotiated timeouts/retrays and HTTP version (h2/h3).
Temperature balance: warming up to equilibrium (without turbo-boost effect).
Fair caches: Either both "cold" or both "warm."
Network symmetry: same routes/MTU/ECN/AQM.
Time budget: DNS/TLS/connect - count explicitly or exclude equally.
Anti-patterns
One run → "output."
Mixing of modes (part cold, part warm) in one series.
A closed model instead of an open one for the Internet load → false "stability."
Unaccounted retrays → "RPS grows" at the cost of takes and cascading 5xx.
Comparison on different glands/cores/power circuits.
No profiling → blind optimization.
Playing with GC/heap without profile analysis → tail regression.
Practical recipes
Minimum Bench Pipeline Steps:1. Fix the environment (script 'env _ capture. sh`).
2. Warm up (5-10 min), record frequencies/temperatures.
3. Conduct N repetitions of short + 1 long run.
4. Remove profiles (CPU/alloc/IO) at peak.
5. Calculate CI/graphs, collect artifacts.
6. Solution: accept/reject the hypothesis, form the next steps.
Capacity curve:- RPS steps (10% of the step) → fix p95/errors → find the "knee."
- We build a schedule of RPS→latency and RPS→CPU: we see the border and the cost of further%.
iGaming/fintech specific
Cost per millisecond: Rank improvements by $ effect (conversion/churn/PSP limits).
Peaks (matches/tournaments): spike + plateau benchmarks with TLS/CDN/cache warming up.
Payments/PSP: measure end-to-end with sandbox limits, idempotency and reactions to degradation; fix Time-to-Wallet with proxy metrics.
Anti-fraud/bot filters: include a rule profile in the macro bench (false-positive-rate, latency additive).
Leaders/jackpots: Test hot keys/ranking, locks, atomicity.
Benchmarking checklist
- Hypothesis/metrics/success criterion.
- Variable monitoring (power/NUMA/IRQ/network/cache).
- Run plan (replicas, duration, warm-up)
- Cold/warm separation.
- Profiling enabled (CPU/alloc/IO/DB).
- Statistics: CI, significance tests, graphs.
- Artifacts and repro scripts in the repository (IaC for the bench).
- Report with "improvement cost" and recommendations.
- regression perf.
Mini-report (template)
The goal is to reduce the p95 API by 15% without CPU growth> 10%.
Method: A/B, k6 open-model 1k rps, 10 × 3 runs, warm cache.
Total: p95 − 12% [− 9%; − 15%], CPU + 6%, 5xx unchanged.
Flamegraph: ↓ JSON serialization (− 30% CPU), the bottleneck has shifted to the database.
Decision: accept optimization; the next step is to batch database requests.
Artifacts: graphics, profiles, configs, raw JSON.
Total
Good benchmarking is rigorous methodology + fair comparisons + statistical validity + profiling + reproducibility. Hypothesize, control the environment, read confidence intervals, publish artifacts and make decisions on the cost of improvement. So you will not get a beautiful figure in the presentation, but a real increase in the speed and predictability of the platform.