Analytics API and performance metrics

1) Why do you need it

API - platform "circulatory system." Without strict metrics, we cannot:

prove the implementation of SLO and SLA,
manage bandwidth and query economics,
quickly localize degradation (p99-tails, 5xx bursts),
prioritize business impact optimizations.

Objective: a single observability model where each request is tracked from perimeter to DB with common identifiers and consistent SLIs.

2) Taxonomy of metrics

Technical: RPS, latency (p50/p95/p99), error rate (4xx/5xx), saturation (CPU, memory, file-desc), queue time.
Product: successful operations/min, step conversion (200/total), rate-limited share (429), retrays, user segments.
Cost: cost/request (CPU-ms + egress + database requests), cost of feature/endpoint, $/tenant, $/1k calls.

3) "Golden Signals": RED and USE

RED (for API):

Rate - requests/sec (by endpoint/tenant/plan).
Errors - 4xx/5xx/429 fractions and absolutes.
Duration - p50/p95/p99 end-to-end and by stages (ingress → app → DB → third-party).

USE (for resources):

Utilization - CPU/IO/channel load.
Saturation - queues (run-queue, backlog, DB wait).
Errors - driver errors/timeouts.

4) Basic definitions and formulas

Availability SLI: `1 − (5xx + gateway_timeout) / all_requests`.
Success SLI: '2xx/( all − 429_shadow)' (excluding shadow locks).
Apdex: `(|T≤T| + 0. 5·|T≤4T| )/all ', where' T 'is the target "good" threshold.
Tail amplification: 'p99 _ total − max (p99_stage_i)' - contribution of queues/composition.
Error budget (month) at 99. 9%: 'budget = 0. 1% period _ time '.

Recommended percentile bins of latency histograms: '[5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2. 5s, 5s]`.

5) SLI/SLO and alert by burn-rate

Example SLO (public API):

Availability: ≥ 99. 9 %/30 days.
p95 latency 'GET/wallet/balance' <150 ms; 'POST/payments' <400 ms.
Errors' 5xx '<0. 2%. '429' (solid) <1% of total traffic.

Burn-rate alerts (two-window):

2% of the budget for 1 hour or 5% for 6 hours → page to the engineer.
10% per day → RCA prioritization.

6) Set of metrics (what to collect is mandatory)

On the perimeter (gateway/WAF):

`http_requests_total{route,method,status,tenant,plan}`
'http _ request _ duration _ seconds _ bucket {route,...} '(histogram)
`retries_total{reason}`, `rate_limited_total{key,policy}`
Body sizes: 'request _ bytes', 'response _ bytes'

Please find attached:

`db_client_requests_total{op,table}`, `db_latency_seconds_bucket{op}`
'cache _ ops _ total {op} ', cache hit-rate external calls:' outbound _ calls _ total {provider, op} ', latency/errors/queue timeouts/pools: lengths/delays, active resource USE workers: CPU, RSS, FD, GC pauses

Business labels: 'tenant _ id', 'region', 'kyc _ level', 'plan', 'feature _ flag'.

7) Trace and Correlation (OpenTelemetry)

W3C Trace-Context ('traceparent', 'tracestate') on all hops.
Span-s by stages: ingress → authZ → app handler → DB/Redis → PSP/external.
Attributes/labels: 'http. route`, `enduser. id`, `tenant. id`, `idempotency. key`, `risk. score`.
Exemplars: Associate points on latency/error graphs with specific trace-id.

Sampling:

head-based 1-10% for "normal" paths,
tail-based for tails (take slow/erroneous),
adaptive for peaks and incidents.
Baggage: carry 'tenant '/' risk' for cuts without marking each event.

8) Logs: structure and privacy

Structured JSON; required fields are 'ts', 'trace _ id', 'span _ id', 'route', 'status', 'latency _ ms', 'tenant', 'user _ id _ hash'.
PII policy: mask PAN/PII; deny secrets/tokens.
Log sampling: high for 4xx/5xx/429 and "long" requests.

9) Dashboard map (minimum set)

1. Exec-Summary: RPS, Availability, Error-rate, p95/p99 latency, 429 share, burn-rate budget.
2. Per-Route: Top Endoints by RPS/Error/Tail; comparison of versions (canary).
3. Per-Tenant/Plan: Load/Cost/Error Leaders.
4. Dependency Health: DB, cache, PSP/external - latency, errors, saturation.
5. Capacity: CPU/RAM/FD, queues, connection pool, GC, container limits.
6. Security/Abuse: 401/403, 429/politicians, geo/ASN slices, retray spikes.

10) Alerts (threshold and trend)

'error _ rate {route} '> 2% (5 minutes) and RPS> N → pager.
'p99 _ latency {critical} '> target threshold (10 minutes).
'burn _ rate'by budget (see § 5).
DB'timeouts '/' deadlocks' or growth 'queue _ time'> X ms.
External providers: 'outbound _ 5xx _ rate {provider} '> 1% + SLO-dependent.

11) Capacitive planning and performance

Little's law: 'L = λ· W' (average queue length = traffic × average time).
Target p95 is decomposed: 'ingress + app + DB + external + queue'.
Concurrency budget: fix the maximum number of simultaneous write operations.
Budget metric: "CPU ms per request"; we keep a margin of 30-50% to the peak.
Interaction with rate-limit: Measure the proportion of requests at the quota "ceiling" and the impact on latency.

12) Load and synthetic checks

Types: base load, bursts × 10, "steps," long-term plateaus, stress/chaos (node killing, network delays), synthetics according to critical client scenarios.
Profiling: CPU/alloc/lock-contention, N + 1 (SQL/HTTP), slow codes.
Regression control: comparison of p95/p99/errors before/after release (canary).

13) Cost-Observability

Cost metrics: 'cpu _ ms', 'egress _ bytes', 'db _ calls', '$ per 1k requests'.
Allocation to endpoint/tenant/feature: billing tags from the orchestrator + load metrics → API unit economics report.
Optimization algorithm: select TOP-endpoints by the product 'traffic × cost × (p95 − target)'.

14) Per-tenant analytics and "justice"

All key metrics are labeled 'tenant _ id/plan'.
Share of "heavy" customers in p99 tails; individual limits/quotas and retray budgets.
Fair shearing: when overloaded, we reduce the share of "high-profile" tenants.

15) Specifics of iGaming/Finance

Sections by 'kyc _ level', 'risk _ tier', 'payment _ method'.
SLI for "money" paths ('POST/deposits', 'POST/withdrawals'): lower target p95, separate error budgets.
Time-to-Wallet (TTW) metrics, share of anti-fraud auto-locks, payout conversion.
Audit: immutable logs for financial actions and anti-fraud decisions.

16) Instrumentation: Implementation Practices

Naming metrics (example):

`api_http_requests_total` (counter)
`api_http_request_duration_seconds` (histogram)
`api_outbound_requests_total`, `api_db_query_duration_seconds`
`api_rate_limited_total`, `api_retry_total{reason}`

Лейблы: `route`, `method`, `status_class`, `tenant`, `region`, `version`, `canary`, `provider`, `db_table`.
Cardinality: avoid free values (user_id), use "buckets "/hash.
Exemplars: connect to histograms p95/p99 → clicking on trace.

17) Antipatterns

Medium instead of percentiles; no division into status classes.
Inconsistent 'route '/' path' (dynamic IDs are "sewn" into labels).
Labels with high cardinality (raw user_id, IP).
No separate accounting of external providers (PSP/3rd-party).
Alerts by "noise" (single window and one threshold).
p99 excluding queue time (masks real degradation).

18) Prod Readiness Checklist

SLI/SLO and error-budget defined, agreed with business.
Single latency histograms and status classes; p95/p99 on dashboards.
Full trace (OTel), log/metric/trace correlation.
Burn-rate alerts (two-window), p99 thresholds and error-rate.
Per-tenant/per-plan sections and cost reports.
Dashboards: Exec, Per-Route, Dependencies, Capacity, Abuse.
Load scenarios (burst/plateau/stress), profiling.
Jitter Retrai Policies; accounting for the effect of retrays on RPS.
SLA/SLO documentation for partners and public customers.
Retention/log masking, PII protection.

19) TL; DR

Build observability around SLI/SLO and error-budget, measure RED/USE, collect latency histograms with p95/p99 and queue time, distribute a single trace-id from perimeter to DB, use tail/adaptive-sampling, hold per-tenant/cost cuts and two-window burn-rate-alert. Plan capacity according to queue laws and impact on business metrics; antipatterns - medium instead of percentiles, high cardinality and unaccounted external dependencies.

Analytics API and performance metrics

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects