Operation Metrics API
1) Purpose and area of responsibility
Metrics API is a single point of access to platform operational and business metrics. It gives:- consistent SLI/SLO (login, deposit, rate, withdrawal);
- KRI (early risk indicators: PSP/KYC/queues/replications);
- business metrics (success of GEO/PSP/bank authorizations, share of successful bets, p95/p99 key path durations);
- safe, cheap and predictable readings for dashboards, alerting, status pages, reporting.
2) Architectural principles
Read-heavy, write-few: API only reads aggregations from TSDB/cache.
SLO-first: responses are predictable in time; errors and degradation - are transparently signaled.
Cost-aware: downsampling, quotas, canary features in the SDK.
Privacy-by-design: no PII in metadata/labels; tokens, geo-gate, SoD.
Multi-tenant: isolation by brand/region/environment.
3) Data model (surface)
Metric series = 'metric _ id' + 'labels {}' + 'timestamp' + 'value' (+ optional 'exemplar {trace _ id =...}').
3. 1 Categories
SLI/SLO: `auth_success_rate`, `bet_settle_p99_ms`, `withdraw_tat_p95_ms`, `api_5xx_rate`.
KRI: `queue_consumer_lag`, `db_replication_lag`, `psp_soft_decline_rate`.
Бизнес: `deposits_success_pct`, `bets_success_pct`, `kyc_pass_rate`.
Инфра: `cpu_util`, `cache_hit_ratio`, `cdn_waf_block_rate`.
3. 2 Labels (Strictly Limited)
`region`, `tenant`, `environment`, `service`, `psp`, `bank_group`, `geo`, `device`, `version`, `component`.
Prohibited: 'userId', 'sessionId', raw card/document numbers.
4) Versioning and compatibility
Base path: '/v1/metrics/... '; incompatible changes - only in the new 'vX'.
Adding labels/series - backward-compatible.
The semantics change is through the'schema _ version'field in the response and the grace period.
The schema directory is published as '/v1/schemas'.
5) Endpoints (REST, similar in gRPC/GraphQL)
1. `GET /v1/metrics/query`
Parameters:- `metric` (multi), `from`, `to`, `step` (резолюция), `agg` (`avg|sum|min|max|p50|p95|p99`),
- `filter[label]=value` (multi), `group_by=label1,label2`,
- `downsample=1m|5m|1h`, `exemplars=true|false`, `limit` (рядов), `page`.
- Answer: array of series' {metric, labels {}, points: [[ts, value]], exemplars?} '.
2. `POST /v1/metrics/bulk-query`
Body: Up to 50 requests in one batch. Saves requests for complex dashboards.
3. `GET /v1/metrics/instant`
Current values at 'ts' (or 'now') with the specified filters.
4. `GET /v1/metrics/catalog`
List of available metrics, descriptions, labels, allowed aggregations, SLO bindings.
5. `GET /v1/metrics/health`
The state of the API itself: latency p95, cache resiliency, share of cache hits.
6. `GET /v1/metrics/slo`
Ready SLO views: consumption of the error budget (fast/slow), target statuses.
6) Sample requests
6. 1 Success of PSP authorizations in TR, 1-min grid, p95:
GET /v1/metrics/query? metric=auth_success_rate&from=2025-11-01T13:00:00Z&to=2025-11-01T16:00:00Z&step=1m&agg=p95&filter[geo]=TR&group_by=psp&downsample=1m
6. 2 p99 "bet→settle" by region, with exemplars (trace examples):
GET /v1/metrics/query? metric=bet_settle_p99_ms&from=...&to=...&step=5m&group_by=region&exemplars=true
6. 3 EU instantaneous deposit SLO status:
GET /v1/metrics/slo? domain=payments®ion=EU&tenant=brandA
6. 4 Batch of 3 queries (POST/bulk-query) - for one graph with layers.
7) Aggregations and percentiles
Percentiles p50/p95/p99 are calculated at the TSDB/aggregator level; with 'downsample' - with the correct composition (t-digest/HDR).
'group _ by'is only allowed on whitelisted labels so as not to blow up the cardinality.
'step'is validated: minimum 10s for realtime, 1m for public dashboards.
8) Cash, downsampling and freshness
Multi-level cache: in-memory (up to 30-60 s), distributed (up to 5 min), CDN for public SLO views.
Downsampling: automatic with large windows ('> 24h') → 5m/1h points.
Freshness-заголовки: `X-Data-Freshness: 12s`, `X-Downsample: 1m`, `X-Partial: true|false`.
9) Multi-tenant and isolation
Each request must contain 'tenant' (in token/labels).
ABAC/RBAC: role/policy restricts access by'tenant, region, environment, metric_id'.
Show/charge-back: 'X-Query-Cost-Estimate' headers and usage-counters.
10) Authentication and security
OAuth2 mTLS/scope service tokens.
SoD: access to metrics with possible regulatory risks (finance, RG) - individual roles.
Rate limits: by client key and by 'metric _ id'.
PII sanitation: the server validates the absence of prohibited filters/labels.
11) Geo-Residency and Compliance
Data are read from regional storages (EU/LATAM/APAC) on residency policy.
Cross-regional queries - only for aggregates without PII and with 'compliance _ scope'.
12) Instances and correlation
With'exemplars = true ', the response at percentile points returns references to a pair of representative' trace _ id '(without PII) for fast RCA.
Correlation: 'correlation _ id' is available in the response metadata.
13) SLA API and bugs
Response SLA: p95 ≤ 300 ms (cache), ≤ 1. 5 s (cold path), availability ≥ 99. 9%.
Codes:- '400 '- invalid request (too much' group _ by ', bad' step '),
- '403 '- insufficient rights/tenant,
- '409 '- circuit conflict,
- '429 '- quota/rate limit,
- '502/504 '- degradation of storage (in the headers - recommendations for downsample/step),
- '206'is a partial response (some shards are not available).
- Diagnostic headers: 'X-Query-Plan', 'X-Query-Cache', 'X-Query-Shards', 'X-RateLimit-Remaining'.
14) Quotas, rate limits and backpressure
Default: 10 rps per client, 50 episodes per response, 3 hour window, 'step ≥ 10c'.
Burst tokens: for dashboards to the big screen, coordinated windows.
Backpressure: the server may return'Retry-After ', advising to increase' step '/enable'downsample'.
15) SDK and best practices
SDK: Typescript/Go/Python. Default: aggressive cache, exponential backoff, 'If-None-Match'.
Recommendations to customers:- group queries by '/bulk-query ';
- use 'group _ by' sparingly;
- for historical reviews - 'downsample = 1h';
- add timeouts ≤ 2 seconds and 'cancellation' tokens.
15. 1 Mini Example (TS)
ts const res = await client. query({
metric: ["auth_success_rate"],
from: "-3h", to: "now", step: "1m",
agg: "p95",
filter: { geo: "TR", tenant: "brandA" },
group_by: ["psp"],
downsample: "1m",
exemplars: true,
timeoutMs: 1800
});
16) Observability of API metrics
SLI самого API: p95_latency, error_rate, cache_hit_ratio, partial_response_rate.
Usage KPI: rps, average response volume, top cost metrics.
Alerts: burn-rate on errors, spike '429', drop cache-hit <target.
Logs: structured, without PII; 'tenant', 'metric _ id', 'query _ cost _ class'.
17) FinOps policies
Request classes: A (realtime dashboards), B (operational), C (analytics). Different quotas/TTL.
Cost: $/GB reads, $/request, $/graph. Monthly report on "heavy" metrics and labels.
Optimizations: server merge, pre-aggregates for popular SLO-view, auto-tips to the client (suggested 'step/downsample').
18) Integrations
Status page: Reads ready-made SLO views.
Alerting: rules rely on '/slo 'and' instant '.
Incident-bot: quick snippets of graphs/slices through short presets.
Workflow/Release-gates: release block at red SLOs.
19) Implementation Roadmap (6-10 weeks)
Ned. 1-2: metrics catalog, label whitelists, '/catalog 'schemas, '/query' prototype with cache, and downsample.
Ned. 3-4: '/bulk-query ', '/slo', exemplars, RBAC/ABAC, quotas/rate limits.
Ned. 5-6: geo-sharding, CDN for public view, FinOps headlines, SLI API dashboard.
Ned. 7-8: SDK (TS/Go/Py), recommendations/query linter, canary tests.
Ned. 9-10: chaos teachings (shard/cache failure), value optimization, deprecate policy.
20) Artifacts
Metric Catalog: id, units, descriptions, available 'agg', valid labels.
Access Policy: roles, areas, limits, SoD.
Query Style Guide - examples of correct/incorrect queries.
SLO Map: SLI compliance ↔ public goals.
Cost Report: top expensive queries/tags, optimization plan.
21) KPI/KRI API Metrics
p95/99 latency, error rate, partial responses.
Cache hit ratio and CPU/IO savings.
Average response size and $/request.
The proportion of dashboards that switched to '/bulk-query '.
Incidents due to high cardinality requests.
22) Antipatterns
Free 'group _ by' by dozens of marks → an explosion of cardinality.
Percentiles "folded" on the client → distortions.
Requests for 30-90 days without downsample → expensive and slow.
Mixing tenants/regions in one response without authorization.
Public panels without cache/CDN.
Changing the semantics of metrics without 'vX' and grace period.
Total
The operations metrics API is a stable, secure, and cost-effective reading layer over telemetry: standardized schematics and percentiles, cache and downsampling, strict labels and accesses, SLO view and exemplars for RCA, transparent SLAs, and cost. This layer allows you to build reliable dashboards, alerts, status communications and release gates without risking privacy, budget and performance.