Operations and → Management Performance Metrics

Performance Metrics

1) Why performance metrics

Performance is the ability of a system to provide target SLOs in response time and throughput at a given cost. It is impossible without metrics:

detect degradation prior to incidents,
predict capacity and budget,
compare alternatives (cache vs DB, gRPC vs REST),
manage post-release regressions.

Principles: a single dictionary of metrics, aggregation by percentiles (p50/p90/p95/p99), separate accounting for "hot" and "cold" paths, context (version, region, provider, device).

2) Taxonomy of metrics

2. 1 Basic SRE frames

Four golden signals: Latency, Traffic, Errors, Saturation.
RED (for microservices): Rate, Errors, Duration.
USE (for hardware): Utilization, Saturation, Errors.

2. 2 Levels

Infrastructure: CPU, RAM, disk, network, containers, nodes.
Platform/Services: API endpoints, queues, caches, databases, event buses.
Customer experience: Web Vitals, mobile SDKs, streaming, CDN.
Data platform: ETL/ELT, streams, storefronts, BI delays.
Business critical flow: authorization, KYC, deposits/payments, game rounds.

3) Catalog of key metrics and formulas

3. 1 API and microservices

RPS (Requests per second).

Latency p50/p95/p99 (ms) - preferably "end-to-end" and "backend-only."

Error Rate (%) = 5xx + 4xx validated/all requests.
Saturation: Average worker queue length, in-flight requests.
Cold Start Rate (for FaaS).
Throttling/Dropped Requests.

SLO example: p95 latency ≤ 250 ms with RPS up to 2k in the EU-East region; errors ≤ 0. 5%.

3. 2 Databases

QPS/Transactions/s, avg/median query time, p95 query time.
Lock Waits / Deadlocks, Row/Index Hit Ratio, Buffer Cache Miss%.
RepLag (replication), Checkpoint/Flush time, Autovacuum lag.
Hot Keys/Skew - top N keys by load.

The formula for "Core Requests": QPS/ vCPU_core_count → a signal for sharding.

3. 3 Cache and CDN

Hit Ratio (%), Evictions/s, Latency p95, Item Size percentiles.
Origin Offload (%) для CDN, TTFB, Stale-while-revalidate hit%.

3. 4 Queues/Streams

Ingress/egress msg/s, Consumer Lag, Rebalance rate.
Processing Time p95, DLQ Rate.

3. 5 Infrastructure/Containers

CPU Utilization %, CPU Throttle %, Run Queue length.
Memory RSS/Working Set, OOM kills, Page Faults.
Disk IOPS/Latency/Throughput, Network RTT/ retransmits.
Node Saturation: pods pending, pressure (CPU/Memory/IO).

3. 6 Web Client (UX)

Core Web Vitals: LCP, INP, CLS.
TTFB, FCP, TTI, Resource Timing (DNS, TLS, TTFB, download).
Error Rate (JS), Long Tasks, SPA route change time.
CDN Geo-Latency (percentile).

3. 7 Mobile client

App Start time (cold/warm), ANR rate, Crash-free sessions %.
Network round-trips/session, Payload size, Battery drain/session.
Offline success rate.

3. 8 Data platform and reporting

Freshness Lag (T-now → витрина), Throughput rows/s, Job Success %.
Cost per TB processed, Skew by party, Late events%.
BI Time-to-Render p95 for key dashboards.

3. 9 Domain-critical flow (iGaming as an example)

Auth p95, KYC TTV (Time-to-Verify), Deposit/Withdrawal p95.
Game Round Duration p95, RNG call latency, Provider RTT p95.
Payment PSP success rate, Chargeback investigation SLA.

4) Normalization, percentiles and attribution

Percentiles versus averages: fix p50/p90/p95/p99 - averages smooth out peak pain.
Sections: application version, region, provider, network channel (4G/Wi-Fi), device.
Correlation: we associate "backend-only" and "real-user" metrics for causal chains.
Exemplars/Traces: associate extreme percentiles with traces.

5) Thresholds and alerts (approximate grid)

Latency p95 (core API): warning> 250 ms, critical> 400 ms 5 min in a row.
Error rate: warning > 0. 5%, critical> 2% (endpoint, not global).
DB RepLag: warning > 2 s, critical > 10 s.
Kafka consumer lag (time): warning > 30 s, critical > 2 min.
Web LCP (p75): warning > 2. 5 s, critical > 4 s.
Mobile ANR: warning > 0. 5%, critical > 1%.
ETL Freshness: warning > +15 min, critical > +60 min от SLA.

We use static + adaptive thresholds (seasonality, day patterns), deduplication and grouping of alerts by services/releases.

6) Performance testing

Types: baseline, stress, prolonged (soak), chaos (degrade links/PSP).
Load profiles: for real transactions (distribution-based), "bursts," regional peaks.
Objectives: SLO achievement with target RPS and mix operations, backpressure validation.
Run metrics: Throughput, Error%, p95 latency, GC pause, CPU throttle, queue lag, cost/run.

Regression rule: the release is considered successful if p95 is not degraded> 10% with an equal profile, and the cost of the request (CPU-ms/request) has not increased> 15%.

7) Capacity planning and price/performance

Demand model: RPS by hour × average work/request (CPU-ms, IO-ops).
Headroom: 30-50% margin for critical paths, auto-scaling by P95.
Cost KPIs: Cost per 1k requests, Cost per GB served, $ per 1 p. p. LCP improvements.
Caching/denormalization: read "cache ROI" = (CPU-ms savings − cache cost).
Warm and cold regions: offload to CDN/edge, read-only replication.

8) Observability and profiling practices

Traces: distributed trace-IDs across all hops; sampling is smart (tail-based).
Metrics: Prometheus/OpenTelemetry, single notation of names and labels.
Logs: with trace correlation/span, budget to log noise, PII editing.
Profilers: CPU/Heap/Alloc/Lock profiles, continuous profiling (eBPF).
Sample instances: associate p99 bursts with a specific span/SQL/PSP call.

9) Release and team metrics (for completeness)

DORA: Deployment Frequency, Lead Time, Change Failure Rate, MTTR.
SPACE: satisfaction, performance, activity, communication, efficiency.
These metrics are not about iron, but directly affect the stability of performance.

10) Anti-patterns

Chasing averages: ignoring p95/p99.
"Global" error rate: hides painful endpoints.
Without attribution by versions: it is impossible to catch client regressions.
Alert spam: thresholds without hysteresis and seasonality correction.
Blind optimization: no grading or traces.
Mixing UX and backend latency: incorrect conclusions from customer experience.

11) Checklists

Unified metric standard

Dictionary of metrics with formulas, units, owners
Mandatory percentiles p50/p90/p95/p99
Trace correlation and log correlation
Tags: region, version, provider, device, network channel
Thresholds with hysteresis and deduplication

Before Release

Baseline p95/p99 on stage and prod
Canary traffic + A/B metric comparison
Fast Rollback Flag Feature
Observation runbook

Regularly

Slowest Top N Query/SQL Review
Audit Cache Policies and TTL
Checking Freshness and Database Replication
External provider degradation tests (PSP, KYC)

12) Mini playbooks (example)

Degradation p95/api/payments

1. Check error% and PSP external timeouts.
2. Check consumer lag queue collbecks.
3. View p99 examples trace SQL/HTTP bottleneck?
4. Enable directory/limit cache, reduce N + 1.
5. Budget: temporarily raise worker resources by 20%, include autoscale.
6. Post-fix: index by (psp_id, status, created_at), retray-jitter.

RepLag growth in DB

1. Check "heavy" requests and long transactions.
2. Increase replication concurrency, tune checkpoint.
3. Offload read-only cache/replica.
4. At peak windows - partial denorm + batches.

13) Examples of formulas/SQL (simplified)

Error Rate by Endpoint

sql
SELECT endpoint,
100. 0 SUM(CASE WHEN status >= 500 THEN 1 ELSE 0 END) / COUNT() AS error_pct
FROM http_logs
WHERE ts >= now() - interval '5 minutes'
GROUP BY 1
HAVING COUNT() > 500;

Latency p95 (TDigest/Approx)

sql
SELECT endpoint, approx_percentile(latency_ms, 0. 95) AS p95_ms
FROM http_metrics
WHERE ts >= date_trunc('hour', now())
GROUP BY 1;

Consumer Lag (time)

sql
SELECT topic, consumer_group,
max(produced_ts) - max(consumed_ts) AS lag_interval
FROM stream_offsets
GROUP BY 1,2;

Web LCP p75

sql
SELECT approx_percentile(lcp_ms, 0. 75) AS lcp_p75
FROM web_vitals
WHERE country = 'UA' AND device IN ('mobile','tablet')
AND ts >= current_date;

14) Embedding in dashboards and reporting

KPI cards: p95 latency, error%, RPS, saturation with WoW/DoD trends.
Top N "worst" endpoints/SQL/resources, clickable drill-down → trace.

Client version correlation: column "version → p95 LCP/INP → conversion."

World Map: geo-latency (CDN), PSP latency by region.

SLO panel: time share in SLO, crashes from SLO, "error budget."

15) Totals

Performance metrics are a systems discipline: single vocabulary, percentiles, attribution, good observability, and strict SLOs. By combining technical (latency, lags, cache hits) and product signals (KYC time, p95 deposit, LCP), you manage the quality of the experience and the cost of delivering it - predictable and scalable.

Operations and → Management Performance Metrics

Performance Metrics

Before Release

Regularly

Latency p95 (TDigest/Approx)

Consumer Lag (time)

Web LCP p75

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects