Benchmarking and performance comparison

Brief Summary

Benchmarking is an experiment, not a "run wrk for 5 minutes." Main principles:

1. Formulate hypothesis and metrics.

2. Control variables (hardware, core, power, background noise).

3. Collect enough data (replicas, confidence intervals).

4. Do profiling - without it you cannot understand the "why."

5. Do repro: scripts, fixing versions and artifacts.

Benchmark goals and business metrics

Throughput: RPS/QPS/CPS, writes/sec

Latency: p50/p95/p99/tail density.
Efficiency: Cost-per-1k RPS, watt per transaction, $/millisecond improvement.
Stability: jitter, inter-cycle/node variability.
Elasticity: how indicators are scaled at N × resource (Amdahl/Gustafson benchmarks).

Methodology: experimental design

Hypothesis: "Envoy with HTTP/3 will reduce p95 TTFB by 10-15% with the same RPS."

Unit of comparison: build/config/instance version of iron.
A/B diagram: parallel run on identical environment; or ABAB/Latin Square to reduce the impact of drift.
Number of repetitions: ≥ 10 short + 3 long runs per configuration for stable ratings.
Statistics: median, MAD, bootstrap confidence intervals; non-parametric tests (Mann-Whitney) for "tailed" distributions.
DoE (minimum): Change one variable at a time (OVAT) or factorial factoring for 2-3 factors (for example, TLS profile × HTTP version × kernel).

Variable and noise control

CPU governor: `performance`; disable "power save."

Turbo/Throttling: monitoring frequencies, temperatures and throttling (otherwise warming up will give false winnings).
NUMA/Hyper-Threading: pin IRQs and processes ('taskset/numactl'), measure memory locality.
C-states/IRQ balance: fix the settings; for network tests - IRQ pin for specific cores.
Background processes: clean node, turn off cron/backup/antivirus/updatedb.
Network: stable paths, fixed MTU/ECN/AQM, no channel flutter.
Data: same sets, cardinality and distributions.
Cache: separate "cold" (first pass) and "warm" (repeat) modes, explicitly mark.

Benchmark Classes

1) Micro benchmarks (function/algorithm)

Purpose: Measure a specific code/algorithm.
Tools: built-in bench frameworks (Go'testing. B`, JMH, pytest-benchmark).
Rules: JIT warm-up, milliseconds → nanoseconds; GC isolation; fixed seed.

2) Meso benchmarks (component/service)

HTTP server, cache, broker, database on one node.
Tools: wrk/wrk2, k6 (open model), vegeta, ghz (gRPC), fio, sysbench, iperf3.
Rules: connection/file limits, pools; CPU/IRQ/GC report.

3) Macro benchmarks (e2e/request path)

Full way: CDN/edge → proxy → service → DB/cache → answer.
Tools: k6/Locust/Gatling + RUM/OTel tracing; a realistic mix of routes.
Rules: closer to reality ("dirty" data, lags of external systems), neatly with retras.

Metrics by Layer

Layer	Metrics
Customer/edge	DNS p95, TLS handshake p95, TTFB, HTTP/2/3 доля
Network	RTT/loss/jitter, ECN CE, Goodput, PPS/CPS
TLS/Proxy	handshakes/s, resumption rate, cipher mix
Application	p50/95/99, 5xx/429, GC pauses, threads, queues
Cache	hit-ratio by layer, eviction, hot-keys
DB	QPS, p95 requests, locks, buffer/cache hit, WAL/fsync
Disk	IOPS, latency, 4k/64k, read/write mix, fsync cost
GPU/ML	throughput (samples/s), latency, mem BW, CUDA/ROCm util

Test templates and commands

Network (TCP/UDP):

bash iperf3 -s # server iperf3 -c <host> -P 8 -t 60 # parallel, stable bandwidth

HTTP server (stable load, wrk2):

bash wrk2 -t8 -c512 -d5m -R 20000 https://api. example. com/endpoint \
--latency --timeout 2s

Open-model (k6, arrival-rate):

javascript export const options = {
scenarios: { open: { executor: 'constant-arrival-rate', rate: 1000, timeUnit: '1s',
duration: '10m', preAllocatedVUs: 2000 } },
thresholds: { http_req_failed: ['rate<0. 3%'], http_req_duration: ['p(95)<250'] }
};

Disk (fio, 4k random read):

bash fio --name=randread --rw=randread --bs=4k --iodepth=64 --numjobs=4 \
--size=4G --runtime=120 --group_reporting --filename=/data/testfile

Database (sysbench + PostgreSQL sample idea):

bash sysbench oltp_read_write --table-size=1000000 --threads=64 \
--pgsql-host=... --pgsql-user=... --pgsql-password=... prepare sysbench oltp_read_write --time=600 --threads=64 run

Memory/CPU (Linux perf + stress-ng):

bash perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
-- <your_binary> --bench

Statistics and validity

Replicates: minimum 10 runs, exclude outliers (robust: median/MAD).
Confidence intervals: bootstrap 95% CI for p95/p99 and means.
Effect-size: relative change and its CI (e.g. − 12% [− 9%; − 15%]).
Practical significance: a 10% decrease in p95 at a price of + 30% CPU - is it worth it?
Graphs: violin/ECDF for distributions, "saturation curves" (RPS→latency).

Bottleneck profiling and localization

CPU: `perf`, `async-profiler`, eBPF/pyroscope; flamegraph before and after.
Alloc/GC: runtime profiles (Go pprof/Java JFR).
I/O: `iostat`, `blktrace`, `fio --lat_percentiles=1`.
Сеть: `ss -s`, `ethtool -S`, `dropwatch`, `tc -s qdisc`.
БД: `EXPLAIN (ANALYZE, BUFFERS)`, pg_stat_statements, slowlog.
Cash: top keys, TTL, eviction cause.

Reporting and Artifacts

What to fix:

git SHA build, compilation/optimization flags.
Kernel/network configs (sysctl), driver versions/NIC/firmware.
Topology (vCPU/NUMA/HT), governor, temperature/frequencies.
Data: size, cardinality, distributions.
What to publish: p50/p95/p99 tables, error/sec, throughput, resources (CPU/RAM/IO), CI.
Artifacts: run scripts, graphs, flamegraph, raw JSON/CSV results, environment protocol.

Fair benchmarking

Identical limiters (conn pool, keepalive, chain TLS, OCSP stapling).
Negotiated timeouts/retrays and HTTP version (h2/h3).
Temperature balance: warming up to equilibrium (without turbo-boost effect).

Fair caches: Either both "cold" or both "warm."

Network symmetry: same routes/MTU/ECN/AQM.
Time budget: DNS/TLS/connect - count explicitly or exclude equally.

Anti-patterns

One run → "output."

Mixing of modes (part cold, part warm) in one series.

A closed model instead of an open one for the Internet load → false "stability."

Unaccounted retrays → "RPS grows" at the cost of takes and cascading 5xx.
Comparison on different glands/cores/power circuits.
No profiling → blind optimization.
Playing with GC/heap without profile analysis → tail regression.

Practical recipes

Minimum Bench Pipeline Steps:

1. Fix the environment (script 'env _ capture. sh`).

2. Warm up (5-10 min), record frequencies/temperatures.

3. Conduct N repetitions of short + 1 long run.

4. Remove profiles (CPU/alloc/IO) at peak.

5. Calculate CI/graphs, collect artifacts.

6. Solution: accept/reject the hypothesis, form the next steps.

Capacity curve:

RPS steps (10% of the step) → fix p95/errors → find the "knee."
We build a schedule of RPS→latency and RPS→CPU: we see the border and the cost of further%.

iGaming/fintech specific

Cost per millisecond: Rank improvements by $ effect (conversion/churn/PSP limits).
Peaks (matches/tournaments): spike + plateau benchmarks with TLS/CDN/cache warming up.
Payments/PSP: measure end-to-end with sandbox limits, idempotency and reactions to degradation; fix Time-to-Wallet with proxy metrics.
Anti-fraud/bot filters: include a rule profile in the macro bench (false-positive-rate, latency additive).
Leaders/jackpots: Test hot keys/ranking, locks, atomicity.

Benchmarking checklist

Hypothesis/metrics/success criterion.
Variable monitoring (power/NUMA/IRQ/network/cache).
Run plan (replicas, duration, warm-up)
Cold/warm separation.
Profiling enabled (CPU/alloc/IO/DB).
Statistics: CI, significance tests, graphs.
Artifacts and repro scripts in the repository (IaC for the bench).
Report with "improvement cost" and recommendations.
regression perf.

Mini-report (template)

The goal is to reduce the p95 API by 15% without CPU growth> 10%.
Method: A/B, k6 open-model 1k rps, 10 × 3 runs, warm cache.
Total: p95 − 12% [− 9%; − 15%], CPU + 6%, 5xx unchanged.
Flamegraph: ↓ JSON serialization (− 30% CPU), the bottleneck has shifted to the database.
Decision: accept optimization; the next step is to batch database requests.
Artifacts: graphics, profiles, configs, raw JSON.

Total

Good benchmarking is rigorous methodology + fair comparisons + statistical validity + profiling + reproducibility. Hypothesize, control the environment, read confidence intervals, publish artifacts and make decisions on the cost of improvement. So you will not get a beautiful figure in the presentation, but a real increase in the speed and predictability of the platform.