Analytics API and performance metrics
1) Why do you need it
API - platform "circulatory system." Without strict metrics, we cannot:- prove the implementation of SLO and SLA,
- manage bandwidth and query economics,
- quickly localize degradation (p99-tails, 5xx bursts),
- prioritize business impact optimizations.
Objective: a single observability model where each request is tracked from perimeter to DB with common identifiers and consistent SLIs.
2) Taxonomy of metrics
Technical: RPS, latency (p50/p95/p99), error rate (4xx/5xx), saturation (CPU, memory, file-desc), queue time.
Product: successful operations/min, step conversion (200/total), rate-limited share (429), retrays, user segments.
Cost: cost/request (CPU-ms + egress + database requests), cost of feature/endpoint, $/tenant, $/1k calls.
3) "Golden Signals": RED and USE
RED (for API):- Rate - requests/sec (by endpoint/tenant/plan).
- Errors - 4xx/5xx/429 fractions and absolutes.
- Duration - p50/p95/p99 end-to-end and by stages (ingress → app → DB → third-party).
- Utilization - CPU/IO/channel load.
- Saturation - queues (run-queue, backlog, DB wait).
- Errors - driver errors/timeouts.
4) Basic definitions and formulas
Availability SLI: `1 − (5xx + gateway_timeout) / all_requests`.
Success SLI: '2xx/( all − 429_shadow)' (excluding shadow locks).
Apdex: `(|T≤T| + 0. 5·|T≤4T| )/all ', where' T 'is the target "good" threshold.
Tail amplification: 'p99 _ total − max (p99_stage_i)' - contribution of queues/composition.
Error budget (month) at 99. 9%: 'budget = 0. 1% period _ time '.
Recommended percentile bins of latency histograms: '[5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2. 5s, 5s]`.
5) SLI/SLO and alert by burn-rate
Example SLO (public API):- Availability: ≥ 99. 9 %/30 days.
- p95 latency 'GET/wallet/balance' <150 ms; 'POST/payments' <400 ms.
- Errors' 5xx '<0. 2%. '429' (solid) <1% of total traffic.
- 2% of the budget for 1 hour or 5% for 6 hours → page to the engineer.
- 10% per day → RCA prioritization.
6) Set of metrics (what to collect is mandatory)
On the perimeter (gateway/WAF):- `http_requests_total{route,method,status,tenant,plan}`
- 'http _ request _ duration _ seconds _ bucket {route,...} '(histogram)
- `retries_total{reason}`, `rate_limited_total{key,policy}`
- Body sizes: 'request _ bytes', 'response _ bytes'
- `db_client_requests_total{op,table}`, `db_latency_seconds_bucket{op}`
- 'cache _ ops _ total {op} ', cache hit-rate external calls:' outbound _ calls _ total {provider, op} ', latency/errors/queue timeouts/pools: lengths/delays, active resource USE workers: CPU, RSS, FD, GC pauses
Business labels: 'tenant _ id', 'region', 'kyc _ level', 'plan', 'feature _ flag'.
7) Trace and Correlation (OpenTelemetry)
W3C Trace-Context ('traceparent', 'tracestate') on all hops.
Span-s by stages: ingress → authZ → app handler → DB/Redis → PSP/external.
Attributes/labels: 'http. route`, `enduser. id`, `tenant. id`, `idempotency. key`, `risk. score`.
Exemplars: Associate points on latency/error graphs with specific trace-id.
- head-based 1-10% for "normal" paths,
- tail-based for tails (take slow/erroneous),
- adaptive for peaks and incidents.
- Baggage: carry 'tenant '/' risk' for cuts without marking each event.
8) Logs: structure and privacy
Structured JSON; required fields are 'ts', 'trace _ id', 'span _ id', 'route', 'status', 'latency _ ms', 'tenant', 'user _ id _ hash'.
PII policy: mask PAN/PII; deny secrets/tokens.
Log sampling: high for 4xx/5xx/429 and "long" requests.
9) Dashboard map (minimum set)
1. Exec-Summary: RPS, Availability, Error-rate, p95/p99 latency, 429 share, burn-rate budget.
2. Per-Route: Top Endoints by RPS/Error/Tail; comparison of versions (canary).
3. Per-Tenant/Plan: Load/Cost/Error Leaders.
4. Dependency Health: DB, cache, PSP/external - latency, errors, saturation.
5. Capacity: CPU/RAM/FD, queues, connection pool, GC, container limits.
6. Security/Abuse: 401/403, 429/politicians, geo/ASN slices, retray spikes.
10) Alerts (threshold and trend)
'error _ rate {route} '> 2% (5 minutes) and RPS> N → pager.
'p99 _ latency {critical} '> target threshold (10 minutes).
'burn _ rate'by budget (see § 5).
DB'timeouts '/' deadlocks' or growth 'queue _ time'> X ms.
External providers: 'outbound _ 5xx _ rate {provider} '> 1% + SLO-dependent.
11) Capacitive planning and performance
Little's law: 'L = λ· W' (average queue length = traffic × average time).
Target p95 is decomposed: 'ingress + app + DB + external + queue'.
Concurrency budget: fix the maximum number of simultaneous write operations.
Budget metric: "CPU ms per request"; we keep a margin of 30-50% to the peak.
Interaction with rate-limit: Measure the proportion of requests at the quota "ceiling" and the impact on latency.
12) Load and synthetic checks
Types: base load, bursts × 10, "steps," long-term plateaus, stress/chaos (node killing, network delays), synthetics according to critical client scenarios.
Profiling: CPU/alloc/lock-contention, N + 1 (SQL/HTTP), slow codes.
Regression control: comparison of p95/p99/errors before/after release (canary).
13) Cost-Observability
Cost metrics: 'cpu _ ms', 'egress _ bytes', 'db _ calls', '$ per 1k requests'.
Allocation to endpoint/tenant/feature: billing tags from the orchestrator + load metrics → API unit economics report.
Optimization algorithm: select TOP-endpoints by the product 'traffic × cost × (p95 − target)'.
14) Per-tenant analytics and "justice"
All key metrics are labeled 'tenant _ id/plan'.
Share of "heavy" customers in p99 tails; individual limits/quotas and retray budgets.
Fair shearing: when overloaded, we reduce the share of "high-profile" tenants.
15) Specifics of iGaming/Finance
Sections by 'kyc _ level', 'risk _ tier', 'payment _ method'.
SLI for "money" paths ('POST/deposits', 'POST/withdrawals'): lower target p95, separate error budgets.
Time-to-Wallet (TTW) metrics, share of anti-fraud auto-locks, payout conversion.
Audit: immutable logs for financial actions and anti-fraud decisions.
16) Instrumentation: Implementation Practices
Naming metrics (example):- `api_http_requests_total` (counter)
- `api_http_request_duration_seconds` (histogram)
- `api_outbound_requests_total`, `api_db_query_duration_seconds`
- `api_rate_limited_total`, `api_retry_total{reason}`
Лейблы: `route`, `method`, `status_class`, `tenant`, `region`, `version`, `canary`, `provider`, `db_table`.
Cardinality: avoid free values (user_id), use "buckets "/hash.
Exemplars: connect to histograms p95/p99 → clicking on trace.
17) Antipatterns
Medium instead of percentiles; no division into status classes.
Inconsistent 'route '/' path' (dynamic IDs are "sewn" into labels).
Labels with high cardinality (raw user_id, IP).
No separate accounting of external providers (PSP/3rd-party).
Alerts by "noise" (single window and one threshold).
p99 excluding queue time (masks real degradation).
18) Prod Readiness Checklist
- SLI/SLO and error-budget defined, agreed with business.
- Single latency histograms and status classes; p95/p99 on dashboards.
- Full trace (OTel), log/metric/trace correlation.
- Burn-rate alerts (two-window), p99 thresholds and error-rate.
- Per-tenant/per-plan sections and cost reports.
- Dashboards: Exec, Per-Route, Dependencies, Capacity, Abuse.
- Load scenarios (burst/plateau/stress), profiling.
- Jitter Retrai Policies; accounting for the effect of retrays on RPS.
- SLA/SLO documentation for partners and public customers.
- Retention/log masking, PII protection.
19) TL; DR
Build observability around SLI/SLO and error-budget, measure RED/USE, collect latency histograms with p95/p99 and queue time, distribute a single trace-id from perimeter to DB, use tail/adaptive-sampling, hold per-tenant/cost cuts and two-window burn-rate-alert. Plan capacity according to queue laws and impact on business metrics; antipatterns - medium instead of percentiles, high cardinality and unaccounted external dependencies.