GH GambleHub

Analytics API and performance metrics

1) Why do you need it

API - platform "circulatory system." Without strict metrics, we cannot:
  • prove the implementation of SLO and SLA,
  • manage bandwidth and query economics,
  • quickly localize degradation (p99-tails, 5xx bursts),
  • prioritize business impact optimizations.

Objective: a single observability model where each request is tracked from perimeter to DB with common identifiers and consistent SLIs.

2) Taxonomy of metrics

Technical: RPS, latency (p50/p95/p99), error rate (4xx/5xx), saturation (CPU, memory, file-desc), queue time.
Product: successful operations/min, step conversion (200/total), rate-limited share (429), retrays, user segments.
Cost: cost/request (CPU-ms + egress + database requests), cost of feature/endpoint, $/tenant, $/1k calls.

3) "Golden Signals": RED and USE

RED (for API):
  • Rate - requests/sec (by endpoint/tenant/plan).
  • Errors - 4xx/5xx/429 fractions and absolutes.
  • Duration - p50/p95/p99 end-to-end and by stages (ingress → app → DB → third-party).
USE (for resources):
  • Utilization - CPU/IO/channel load.
  • Saturation - queues (run-queue, backlog, DB wait).
  • Errors - driver errors/timeouts.

4) Basic definitions and formulas

Availability SLI: `1 − (5xx + gateway_timeout) / all_requests`.
Success SLI: '2xx/( all − 429_shadow)' (excluding shadow locks).
Apdex: `(|T≤T| + 0. 5·|T≤4T| )/all ', where' T 'is the target "good" threshold.
Tail amplification: 'p99 _ total − max (p99_stage_i)' - contribution of queues/composition.
Error budget (month) at 99. 9%: 'budget = 0. 1% period _ time '.

Recommended percentile bins of latency histograms: '[5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2. 5s, 5s]`.

5) SLI/SLO and alert by burn-rate

Example SLO (public API):
  • Availability: ≥ 99. 9 %/30 days.
  • p95 latency 'GET/wallet/balance' <150 ms; 'POST/payments' <400 ms.
  • Errors' 5xx '<0. 2%. '429' (solid) <1% of total traffic.
Burn-rate alerts (two-window):
  • 2% of the budget for 1 hour or 5% for 6 hours → page to the engineer.
  • 10% per day → RCA prioritization.

6) Set of metrics (what to collect is mandatory)

On the perimeter (gateway/WAF):
  • `http_requests_total{route,method,status,tenant,plan}`
  • 'http _ request _ duration _ seconds _ bucket {route,...} '(histogram)
  • `retries_total{reason}`, `rate_limited_total{key,policy}`
  • Body sizes: 'request _ bytes', 'response _ bytes'
Please find attached:
  • `db_client_requests_total{op,table}`, `db_latency_seconds_bucket{op}`
  • 'cache _ ops _ total {op} ', cache hit-rate external calls:' outbound _ calls _ total {provider, op} ', latency/errors/queue timeouts/pools: lengths/delays, active resource USE workers: CPU, RSS, FD, GC pauses

Business labels: 'tenant _ id', 'region', 'kyc _ level', 'plan', 'feature _ flag'.

7) Trace and Correlation (OpenTelemetry)

W3C Trace-Context ('traceparent', 'tracestate') on all hops.
Span-s by stages: ingress → authZ → app handler → DB/Redis → PSP/external.
Attributes/labels: 'http. route`, `enduser. id`, `tenant. id`, `idempotency. key`, `risk. score`.
Exemplars: Associate points on latency/error graphs with specific trace-id.

Sampling:
  • head-based 1-10% for "normal" paths,
  • tail-based for tails (take slow/erroneous),
  • adaptive for peaks and incidents.
  • Baggage: carry 'tenant '/' risk' for cuts without marking each event.

8) Logs: structure and privacy

Structured JSON; required fields are 'ts', 'trace _ id', 'span _ id', 'route', 'status', 'latency _ ms', 'tenant', 'user _ id _ hash'.
PII policy: mask PAN/PII; deny secrets/tokens.
Log sampling: high for 4xx/5xx/429 and "long" requests.

9) Dashboard map (minimum set)

1. Exec-Summary: RPS, Availability, Error-rate, p95/p99 latency, 429 share, burn-rate budget.
2. Per-Route: Top Endoints by RPS/Error/Tail; comparison of versions (canary).
3. Per-Tenant/Plan: Load/Cost/Error Leaders.
4. Dependency Health: DB, cache, PSP/external - latency, errors, saturation.
5. Capacity: CPU/RAM/FD, queues, connection pool, GC, container limits.
6. Security/Abuse: 401/403, 429/politicians, geo/ASN slices, retray spikes.

10) Alerts (threshold and trend)

'error _ rate {route} '> 2% (5 minutes) and RPS> N → pager.
'p99 _ latency {critical} '> target threshold (10 minutes).
'burn _ rate'by budget (see § 5).
DB'timeouts '/' deadlocks' or growth 'queue _ time'> X ms.
External providers: 'outbound _ 5xx _ rate {provider} '> 1% + SLO-dependent.

11) Capacitive planning and performance

Little's law: 'L = λ· W' (average queue length = traffic × average time).
Target p95 is decomposed: 'ingress + app + DB + external + queue'.
Concurrency budget: fix the maximum number of simultaneous write operations.
Budget metric: "CPU ms per request"; we keep a margin of 30-50% to the peak.
Interaction with rate-limit: Measure the proportion of requests at the quota "ceiling" and the impact on latency.

12) Load and synthetic checks

Types: base load, bursts × 10, "steps," long-term plateaus, stress/chaos (node killing, network delays), synthetics according to critical client scenarios.
Profiling: CPU/alloc/lock-contention, N + 1 (SQL/HTTP), slow codes.
Regression control: comparison of p95/p99/errors before/after release (canary).

13) Cost-Observability

Cost metrics: 'cpu _ ms', 'egress _ bytes', 'db _ calls', '$ per 1k requests'.
Allocation to endpoint/tenant/feature: billing tags from the orchestrator + load metrics → API unit economics report.
Optimization algorithm: select TOP-endpoints by the product 'traffic × cost × (p95 − target)'.

14) Per-tenant analytics and "justice"

All key metrics are labeled 'tenant _ id/plan'.
Share of "heavy" customers in p99 tails; individual limits/quotas and retray budgets.
Fair shearing: when overloaded, we reduce the share of "high-profile" tenants.

15) Specifics of iGaming/Finance

Sections by 'kyc _ level', 'risk _ tier', 'payment _ method'.
SLI for "money" paths ('POST/deposits', 'POST/withdrawals'): lower target p95, separate error budgets.
Time-to-Wallet (TTW) metrics, share of anti-fraud auto-locks, payout conversion.
Audit: immutable logs for financial actions and anti-fraud decisions.

16) Instrumentation: Implementation Practices

Naming metrics (example):
  • `api_http_requests_total` (counter)
  • `api_http_request_duration_seconds` (histogram)
  • `api_outbound_requests_total`, `api_db_query_duration_seconds`
  • `api_rate_limited_total`, `api_retry_total{reason}`

Лейблы: `route`, `method`, `status_class`, `tenant`, `region`, `version`, `canary`, `provider`, `db_table`.
Cardinality: avoid free values ​ ​ (user_id), use "buckets "/hash.
Exemplars: connect to histograms p95/p99 → clicking on trace.

17) Antipatterns

Medium instead of percentiles; no division into status classes.
Inconsistent 'route '/' path' (dynamic IDs are "sewn" into labels).
Labels with high cardinality (raw user_id, IP).
No separate accounting of external providers (PSP/3rd-party).
Alerts by "noise" (single window and one threshold).
p99 excluding queue time (masks real degradation).

18) Prod Readiness Checklist

  • SLI/SLO and error-budget defined, agreed with business.
  • Single latency histograms and status classes; p95/p99 on dashboards.
  • Full trace (OTel), log/metric/trace correlation.
  • Burn-rate alerts (two-window), p99 thresholds and error-rate.
  • Per-tenant/per-plan sections and cost reports.
  • Dashboards: Exec, Per-Route, Dependencies, Capacity, Abuse.
  • Load scenarios (burst/plateau/stress), profiling.
  • Jitter Retrai Policies; accounting for the effect of retrays on RPS.
  • SLA/SLO documentation for partners and public customers.
  • Retention/log masking, PII protection.

19) TL; DR

Build observability around SLI/SLO and error-budget, measure RED/USE, collect latency histograms with p95/p99 and queue time, distribute a single trace-id from perimeter to DB, use tail/adaptive-sampling, hold per-tenant/cost cuts and two-window burn-rate-alert. Plan capacity according to queue laws and impact on business metrics; antipatterns - medium instead of percentiles, high cardinality and unaccounted external dependencies.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.