GH GambleHub

Benchmarking and performance comparison

Brief Summary

Benchmarking is an experiment, not a "run wrk for 5 minutes." Main principles:

1. Formulate hypothesis and metrics.

2. Control variables (hardware, core, power, background noise).

3. Collect enough data (replicas, confidence intervals).

4. Do profiling - without it you cannot understand the "why."

5. Do repro: scripts, fixing versions and artifacts.

Benchmark goals and business metrics

Throughput: RPS/QPS/CPS, writes/sec

Latency: p50/p95/p99/tail density.
Efficiency: Cost-per-1k RPS, watt per transaction, $/millisecond improvement.
Stability: jitter, inter-cycle/node variability.
Elasticity: how indicators are scaled at N × resource (Amdahl/Gustafson benchmarks).

Methodology: experimental design

Hypothesis: "Envoy with HTTP/3 will reduce p95 TTFB by 10-15% with the same RPS."

Unit of comparison: build/config/instance version of iron.
A/B diagram: parallel run on identical environment; or ABAB/Latin Square to reduce the impact of drift.
Number of repetitions: ≥ 10 short + 3 long runs per configuration for stable ratings.
Statistics: median, MAD, bootstrap confidence intervals; non-parametric tests (Mann-Whitney) for "tailed" distributions.
DoE (minimum): Change one variable at a time (OVAT) or factorial factoring for 2-3 factors (for example, TLS profile × HTTP version × kernel).

Variable and noise control

CPU governor: `performance`; disable "power save."

Turbo/Throttling: monitoring frequencies, temperatures and throttling (otherwise warming up will give false winnings).
NUMA/Hyper-Threading: pin IRQs and processes ('taskset/numactl'), measure memory locality.
C-states/IRQ balance: fix the settings; for network tests - IRQ pin for specific cores.
Background processes: clean node, turn off cron/backup/antivirus/updatedb.
Network: stable paths, fixed MTU/ECN/AQM, no channel flutter.
Data: same sets, cardinality and distributions.
Cache: separate "cold" (first pass) and "warm" (repeat) modes, explicitly mark.

Benchmark Classes

1) Micro benchmarks (function/algorithm)

Purpose: Measure a specific code/algorithm.
Tools: built-in bench frameworks (Go'testing. B`, JMH, pytest-benchmark).
Rules: JIT warm-up, milliseconds → nanoseconds; GC isolation; fixed seed.

2) Meso benchmarks (component/service)

HTTP server, cache, broker, database on one node.
Tools: wrk/wrk2, k6 (open model), vegeta, ghz (gRPC), fio, sysbench, iperf3.
Rules: connection/file limits, pools; CPU/IRQ/GC report.

3) Macro benchmarks (e2e/request path)

Full way: CDN/edge → proxy → service → DB/cache → answer.
Tools: k6/Locust/Gatling + RUM/OTel tracing; a realistic mix of routes.
Rules: closer to reality ("dirty" data, lags of external systems), neatly with retras.

Metrics by Layer

LayerMetrics
Customer/edgeDNS p95, TLS handshake p95, TTFB, HTTP/2/3 доля
NetworkRTT/loss/jitter, ECN CE, Goodput, PPS/CPS
TLS/Proxyhandshakes/s, resumption rate, cipher mix
Applicationp50/95/99, 5xx/429, GC pauses, threads, queues
Cachehit-ratio by layer, eviction, hot-keys
DBQPS, p95 requests, locks, buffer/cache hit, WAL/fsync
DiskIOPS, latency, 4k/64k, read/write mix, fsync cost
GPU/MLthroughput (samples/s), latency, mem BW, CUDA/ROCm util

Test templates and commands

Network (TCP/UDP):
bash iperf3 -s # server iperf3 -c <host> -P 8 -t 60 # parallel, stable bandwidth
HTTP server (stable load, wrk2):
bash wrk2 -t8 -c512 -d5m -R 20000 https://api. example. com/endpoint \
--latency --timeout 2s
Open-model (k6, arrival-rate):
javascript export const options = {
scenarios: { open: { executor: 'constant-arrival-rate', rate: 1000, timeUnit: '1s',
duration: '10m', preAllocatedVUs: 2000 } },
thresholds: { http_req_failed: ['rate<0. 3%'], http_req_duration: ['p(95)<250'] }
};
Disk (fio, 4k random read):
bash fio --name=randread --rw=randread --bs=4k --iodepth=64 --numjobs=4 \
--size=4G --runtime=120 --group_reporting --filename=/data/testfile
Database (sysbench + PostgreSQL sample idea):
bash sysbench oltp_read_write --table-size=1000000 --threads=64 \
--pgsql-host=... --pgsql-user=... --pgsql-password=... prepare sysbench oltp_read_write --time=600 --threads=64 run
Memory/CPU (Linux perf + stress-ng):
bash perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
-- <your_binary> --bench

Statistics and validity

Replicates: minimum 10 runs, exclude outliers (robust: median/MAD).
Confidence intervals: bootstrap 95% CI for p95/p99 and means.
Effect-size: relative change and its CI (e.g. − 12% [− 9%; − 15%]).
Practical significance: a 10% decrease in p95 at a price of + 30% CPU - is it worth it?
Graphs: violin/ECDF for distributions, "saturation curves" (RPS→latency).

Bottleneck profiling and localization

CPU: `perf`, `async-profiler`, eBPF/pyroscope; flamegraph before and after.
Alloc/GC: runtime profiles (Go pprof/Java JFR).
I/O: `iostat`, `blktrace`, `fio --lat_percentiles=1`.
Сеть: `ss -s`, `ethtool -S`, `dropwatch`, `tc -s qdisc`.
БД: `EXPLAIN (ANALYZE, BUFFERS)`, pg_stat_statements, slowlog.
Cash: top keys, TTL, eviction cause.

Reporting and Artifacts

What to fix:
  • git SHA build, compilation/optimization flags.
  • Kernel/network configs (sysctl), driver versions/NIC/firmware.
  • Topology (vCPU/NUMA/HT), governor, temperature/frequencies.
  • Data: size, cardinality, distributions.
  • What to publish: p50/p95/p99 tables, error/sec, throughput, resources (CPU/RAM/IO), CI.
  • Artifacts: run scripts, graphs, flamegraph, raw JSON/CSV results, environment protocol.

Fair benchmarking

Identical limiters (conn pool, keepalive, chain TLS, OCSP stapling).
Negotiated timeouts/retrays and HTTP version (h2/h3).
Temperature balance: warming up to equilibrium (without turbo-boost effect).

Fair caches: Either both "cold" or both "warm."

Network symmetry: same routes/MTU/ECN/AQM.
Time budget: DNS/TLS/connect - count explicitly or exclude equally.

Anti-patterns

One run → "output."

Mixing of modes (part cold, part warm) in one series.

A closed model instead of an open one for the Internet load → false "stability."

Unaccounted retrays → "RPS grows" at the cost of takes and cascading 5xx.
Comparison on different glands/cores/power circuits.
No profiling → blind optimization.
Playing with GC/heap without profile analysis → tail regression.

Practical recipes

Minimum Bench Pipeline Steps:

1. Fix the environment (script 'env _ capture. sh`).

2. Warm up (5-10 min), record frequencies/temperatures.

3. Conduct N repetitions of short + 1 long run.

4. Remove profiles (CPU/alloc/IO) at peak.

5. Calculate CI/graphs, collect artifacts.

6. Solution: accept/reject the hypothesis, form the next steps.

Capacity curve:
  • RPS steps (10% of the step) → fix p95/errors → find the "knee."
  • We build a schedule of RPS→latency and RPS→CPU: we see the border and the cost of further%.

iGaming/fintech specific

Cost per millisecond: Rank improvements by $ effect (conversion/churn/PSP limits).
Peaks (matches/tournaments): spike + plateau benchmarks with TLS/CDN/cache warming up.
Payments/PSP: measure end-to-end with sandbox limits, idempotency and reactions to degradation; fix Time-to-Wallet with proxy metrics.
Anti-fraud/bot filters: include a rule profile in the macro bench (false-positive-rate, latency additive).
Leaders/jackpots: Test hot keys/ranking, locks, atomicity.

Benchmarking checklist

  • Hypothesis/metrics/success criterion.
  • Variable monitoring (power/NUMA/IRQ/network/cache).
  • Run plan (replicas, duration, warm-up)
  • Cold/warm separation.
  • Profiling enabled (CPU/alloc/IO/DB).
  • Statistics: CI, significance tests, graphs.
  • Artifacts and repro scripts in the repository (IaC for the bench).
  • Report with "improvement cost" and recommendations.
  • regression perf.

Mini-report (template)

The goal is to reduce the p95 API by 15% without CPU growth> 10%.
Method: A/B, k6 open-model 1k rps, 10 × 3 runs, warm cache.
Total: p95 − 12% [− 9%; − 15%], CPU + 6%, 5xx unchanged.
Flamegraph: ↓ JSON serialization (− 30% CPU), the bottleneck has shifted to the database.
Decision: accept optimization; the next step is to batch database requests.
Artifacts: graphics, profiles, configs, raw JSON.

Total

Good benchmarking is rigorous methodology + fair comparisons + statistical validity + profiling + reproducibility. Hypothesize, control the environment, read confidence intervals, publish artifacts and make decisions on the cost of improvement. So you will not get a beautiful figure in the presentation, but a real increase in the speed and predictability of the platform.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.