Operations and → Management Performance Metrics
Performance Metrics
1) Why performance metrics
Performance is the ability of a system to provide target SLOs in response time and throughput at a given cost. It is impossible without metrics:- detect degradation prior to incidents,
- predict capacity and budget,
- compare alternatives (cache vs DB, gRPC vs REST),
- manage post-release regressions.
Principles: a single dictionary of metrics, aggregation by percentiles (p50/p90/p95/p99), separate accounting for "hot" and "cold" paths, context (version, region, provider, device).
2) Taxonomy of metrics
2. 1 Basic SRE frames
Four golden signals: Latency, Traffic, Errors, Saturation.
RED (for microservices): Rate, Errors, Duration.
USE (for hardware): Utilization, Saturation, Errors.
2. 2 Levels
Infrastructure: CPU, RAM, disk, network, containers, nodes.
Platform/Services: API endpoints, queues, caches, databases, event buses.
Customer experience: Web Vitals, mobile SDKs, streaming, CDN.
Data platform: ETL/ELT, streams, storefronts, BI delays.
Business critical flow: authorization, KYC, deposits/payments, game rounds.
3) Catalog of key metrics and formulas
3. 1 API and microservices
RPS (Requests per second).
Latency p50/p95/p99 (ms) - preferably "end-to-end" and "backend-only."
Error Rate (%) = 5xx + 4xx validated/all requests.
Saturation: Average worker queue length, in-flight requests.
Cold Start Rate (for FaaS).
Throttling/Dropped Requests.
SLO example: p95 latency ≤ 250 ms with RPS up to 2k in the EU-East region; errors ≤ 0. 5%.
3. 2 Databases
QPS/Transactions/s, avg/median query time, p95 query time.
Lock Waits / Deadlocks, Row/Index Hit Ratio, Buffer Cache Miss%.
RepLag (replication), Checkpoint/Flush time, Autovacuum lag.
Hot Keys/Skew - top N keys by load.
The formula for "Core Requests": QPS/ vCPU_core_count → a signal for sharding.
3. 3 Cache and CDN
Hit Ratio (%), Evictions/s, Latency p95, Item Size percentiles.
Origin Offload (%) для CDN, TTFB, Stale-while-revalidate hit%.
3. 4 Queues/Streams
Ingress/egress msg/s, Consumer Lag, Rebalance rate.
Processing Time p95, DLQ Rate.
3. 5 Infrastructure/Containers
CPU Utilization %, CPU Throttle %, Run Queue length.
Memory RSS/Working Set, OOM kills, Page Faults.
Disk IOPS/Latency/Throughput, Network RTT/ retransmits.
Node Saturation: pods pending, pressure (CPU/Memory/IO).
3. 6 Web Client (UX)
Core Web Vitals: LCP, INP, CLS.
TTFB, FCP, TTI, Resource Timing (DNS, TLS, TTFB, download).
Error Rate (JS), Long Tasks, SPA route change time.
CDN Geo-Latency (percentile).
3. 7 Mobile client
App Start time (cold/warm), ANR rate, Crash-free sessions %.
Network round-trips/session, Payload size, Battery drain/session.
Offline success rate.
3. 8 Data platform and reporting
Freshness Lag (T-now → витрина), Throughput rows/s, Job Success %.
Cost per TB processed, Skew by party, Late events%.
BI Time-to-Render p95 for key dashboards.
3. 9 Domain-critical flow (iGaming as an example)
Auth p95, KYC TTV (Time-to-Verify), Deposit/Withdrawal p95.
Game Round Duration p95, RNG call latency, Provider RTT p95.
Payment PSP success rate, Chargeback investigation SLA.
4) Normalization, percentiles and attribution
Percentiles versus averages: fix p50/p90/p95/p99 - averages smooth out peak pain.
Sections: application version, region, provider, network channel (4G/Wi-Fi), device.
Correlation: we associate "backend-only" and "real-user" metrics for causal chains.
Exemplars/Traces: associate extreme percentiles with traces.
5) Thresholds and alerts (approximate grid)
Latency p95 (core API): warning> 250 ms, critical> 400 ms 5 min in a row.
Error rate: warning > 0. 5%, critical> 2% (endpoint, not global).
DB RepLag: warning > 2 s, critical > 10 s.
Kafka consumer lag (time): warning > 30 s, critical > 2 min.
Web LCP (p75): warning > 2. 5 s, critical > 4 s.
Mobile ANR: warning > 0. 5%, critical > 1%.
ETL Freshness: warning > +15 min, critical > +60 min от SLA.
We use static + adaptive thresholds (seasonality, day patterns), deduplication and grouping of alerts by services/releases.
6) Performance testing
Types: baseline, stress, prolonged (soak), chaos (degrade links/PSP).
Load profiles: for real transactions (distribution-based), "bursts," regional peaks.
Objectives: SLO achievement with target RPS and mix operations, backpressure validation.
Run metrics: Throughput, Error%, p95 latency, GC pause, CPU throttle, queue lag, cost/run.
Regression rule: the release is considered successful if p95 is not degraded> 10% with an equal profile, and the cost of the request (CPU-ms/request) has not increased> 15%.
7) Capacity planning and price/performance
Demand model: RPS by hour × average work/request (CPU-ms, IO-ops).
Headroom: 30-50% margin for critical paths, auto-scaling by P95.
Cost KPIs: Cost per 1k requests, Cost per GB served, $ per 1 p. p. LCP improvements.
Caching/denormalization: read "cache ROI" = (CPU-ms savings − cache cost).
Warm and cold regions: offload to CDN/edge, read-only replication.
8) Observability and profiling practices
Traces: distributed trace-IDs across all hops; sampling is smart (tail-based).
Metrics: Prometheus/OpenTelemetry, single notation of names and labels.
Logs: with trace correlation/span, budget to log noise, PII editing.
Profilers: CPU/Heap/Alloc/Lock profiles, continuous profiling (eBPF).
Sample instances: associate p99 bursts with a specific span/SQL/PSP call.
9) Release and team metrics (for completeness)
DORA: Deployment Frequency, Lead Time, Change Failure Rate, MTTR.
SPACE: satisfaction, performance, activity, communication, efficiency.
These metrics are not about iron, but directly affect the stability of performance.
10) Anti-patterns
Chasing averages: ignoring p95/p99.
"Global" error rate: hides painful endpoints.
Without attribution by versions: it is impossible to catch client regressions.
Alert spam: thresholds without hysteresis and seasonality correction.
Blind optimization: no grading or traces.
Mixing UX and backend latency: incorrect conclusions from customer experience.
11) Checklists
Unified metric standard
- Dictionary of metrics with formulas, units, owners
- Mandatory percentiles p50/p90/p95/p99
- Trace correlation and log correlation
- Tags: region, version, provider, device, network channel
- Thresholds with hysteresis and deduplication
Before Release
- Baseline p95/p99 on stage and prod
- Canary traffic + A/B metric comparison
- Fast Rollback Flag Feature
- Observation runbook
Regularly
- Slowest Top N Query/SQL Review
- Audit Cache Policies and TTL
- Checking Freshness and Database Replication
- External provider degradation tests (PSP, KYC)
12) Mini playbooks (example)
Degradation p95/api/payments
1. Check error% and PSP external timeouts.
2. Check consumer lag queue collbecks.
3. View p99 examples trace SQL/HTTP bottleneck?
4. Enable directory/limit cache, reduce N + 1.
5. Budget: temporarily raise worker resources by 20%, include autoscale.
6. Post-fix: index by (psp_id, status, created_at), retray-jitter.
RepLag growth in DB
1. Check "heavy" requests and long transactions.
2. Increase replication concurrency, tune checkpoint.
3. Offload read-only cache/replica.
4. At peak windows - partial denorm + batches.
13) Examples of formulas/SQL (simplified)
Error Rate by Endpoint
sql
SELECT endpoint,
100. 0 SUM(CASE WHEN status >= 500 THEN 1 ELSE 0 END) / COUNT() AS error_pct
FROM http_logs
WHERE ts >= now() - interval '5 minutes'
GROUP BY 1
HAVING COUNT() > 500;
Latency p95 (TDigest/Approx)
sql
SELECT endpoint, approx_percentile(latency_ms, 0. 95) AS p95_ms
FROM http_metrics
WHERE ts >= date_trunc('hour', now())
GROUP BY 1;
Consumer Lag (time)
sql
SELECT topic, consumer_group,
max(produced_ts) - max(consumed_ts) AS lag_interval
FROM stream_offsets
GROUP BY 1,2;
Web LCP p75
sql
SELECT approx_percentile(lcp_ms, 0. 75) AS lcp_p75
FROM web_vitals
WHERE country = 'UA' AND device IN ('mobile','tablet')
AND ts >= current_date;
14) Embedding in dashboards and reporting
KPI cards: p95 latency, error%, RPS, saturation with WoW/DoD trends.
Top N "worst" endpoints/SQL/resources, clickable drill-down → trace.
Client version correlation: column "version → p95 LCP/INP → conversion."
World Map: geo-latency (CDN), PSP latency by region.
SLO panel: time share in SLO, crashes from SLO, "error budget."
15) Totals
Performance metrics are a systems discipline: single vocabulary, percentiles, attribution, good observability, and strict SLOs. By combining technical (latency, lags, cache hits) and product signals (KYC time, p95 deposit, LCP), you manage the quality of the experience and the cost of delivering it - predictable and scalable.