Infrastructure KPIs and uptime
Why do you need it?
Infrastructure KPIs turn "feelings" about stability into measurable goals, manage risk and focus of work. The right metrics link technical SLIs to business results (conversion, Time-to-Wallet, LTV) and allow you to plan for development, load and share of innovation vs reliability.
Basic concepts: SLI, SLO, SLA and error budget
SLI (Service Level Indicator) - measured quality indicator: the proportion of successful requests, p95 latency, uptime per interval.
SLO (Service Level Objective) - SLI target (for example, "success ≥ 99. 9% in 30 days").
SLA (Agreement) - external promise with penalties/credits. Always derived from, but not equal to, SLO.
Error budget = '1 − SLO'. This is the maximum allowed failure rate per measurement window. Used to make decisions about risky releases and experiments.
- Availability SLO 99. 95% in 30 days → error budget 0. 05% ≈ 21. 6 minutes of "failure" in a calendar month.
Four gold signals and additional
1. Latency (p50/p90/p95/p99, tail is more important than average).
2. Errors (5xx/timeout/business errors).
3. Traffic/throughput (RPS/QPS, MBps).
4. Saturation (CPU/RAM/IO/FD/connections/GC/quotas).
Additional: cold start, queues/backlog, deploy time, SLO compliance.
SLI model for different types of services
HTTP/API
Availability: '(successful 2xx/3xx − logical errors )/( all requests)'
Latency: 'p95' for successful queries; target on hot routes.
Quality: the proportion of requests with 'audience/scope' correct (without authZ errors).
Queues/asynchronous
Message processing time: p95 end-to-end ≤ N seconds
Backlog: median <X, tail p99 <Y.
Delivery error: ≤ Z ppm.
DB/Cache
Operation latency: p95 get/put/commit.
Saturation: connection pool usage, cache hit-ratio.
Errors: timeouts, deadlocks, eviction storms.
CDN/Static
Hit Ratio: target level ≥; degradation → load growth on origin.
POP availability: Anycast, failures are compensated by neighbors.
Payments (Business SLI)
Time-to-Wallet p95, deposit/output success%, PSP failure rate.
Calculation of availability and uptime
Service availability = 'successful requests/all requests' (preferably not 'uptime minutes').
An alternative for infrastructure nodes is' green time/window time '.
Calendar window: 28-31 days, sliding window: last 30/90 days.
Working hours/critical windows: for backoffice can be considered uptime according to the schedule (for example, 08: 00-22: 00 local time).
- 'Availability (A) ≈ Av (B) × Av (C) × Av (A'B, C) '- it is important to lay SLOs at the boundaries.
Example SLO kit (sample)
Gateway API: ≥ 99 available. 95 %/30d; p95 latency ≤ 120 ms; error ≤ 0. 2%.
Checkout/Payments: deposit success ≥ 98. 5 %/30d; Time-to-Wallet p95 ≤ 90 с; PSP-timeouts ≤ 0. 3%.
Database: p95 read ≤ 10 ms; p95 write ≤ 25ms; replica lag p95 ≤ 150 мс.
Cache: hit ratio ≥ 85%; eviction storms = 0/30д.
Payouts: p95 processing ≤ 5 min; fraud-falls-positives ≤ 0. 3%.
Error budget and change management
If the error budget is exhausted by 50% + before the middle of the window, a "freeze" of features/releases is introduced, the focus is on stabilization.
If the budget is spent slowly, you can speed up experiments/canaries.
Connect budget consumption to specific releases/incidents via 'release _ id'.
Alerting: how not to "call at night" in vain
Alerts only for SLO degradation and vital symptoms, not for each metric.
Multi-window, multi-burn rate: short window (5-15 min) + long window (1-6 h).
Example: "Burn rate 14 × in 5 minutes AND 6 × in 1 hour" → on-call page.
Quiet hours for non-P1 signals; ownership routing.
Dashboards and visualization practices
SLO panel: service compliance, remaining budget, dependency maps.
Latency panel: p50/p90/p95/p99, decomposition by routes/tenants/countries/ASN.
Error panel: codes/reasons, correlation with releases/feature flags.
Capacity-panel: CPU/RAM/IO/network/FD/connections, trends and forecasts.
Business Panel: Conversion, Time-to-Wallet, Deposits/Withdrawals, Impact of Protections (WAF/Anti-Bots).
Incidents, MTTR and post-mortems
Reaction KPI:- MTTD (detection), MTTA (accept), MTTR/MTTC (recovery/containment),% incidents without RCA on time.
- Playbooks: who escalates, how to turn on feature flags/blocks, how to roll back the release, communication with the business.
- Postmortem (blameless): facts, time line, root causes (those/processes), actions: immediate/long-term, regression tests, effect on SLO.
Performance, saturation and degradation
Headroom: target resource headroom (e.g. CPU <70% p95, RAM <75% p95).
Hot paths: profiling critical routes; 'p99' is more important than average.
Degradation modes: cache-only, read-only, drop-grinding of unimportant requests, "rate limit "/quota.
Formulas and examples of calculations
1) On-demand availability
availability = (total_requests - error_requests) / total_requests
where'error _ requests' = 5xx + timeouts + business errors (configurable).
2) Error budget (minutes)
error_budget_minutes = window_minutes (1 - SLO)
Example: 30 days (43,200 min), SLO 99. 95% → 21. 6 min.
3) Burn rate
burn_rate = observed_error_ratio / (1 - SLO)
If SLO 99. 9% (budget 0. 1%) and error 1% → burn_rate = 10 ×.
4) Compound availability
A_total ≈ A_gw × A_auth × A_db × A_psp
Small falls multiply hit the overall A.
Measurement and exception policies
Unscheduled windows (incidents) - taken into account.
Planned maintenance windows - only taken into account if the SLA is so prescribed; for SLOs are often not subtracted (or marked separately as' planned _ downtime ').
Synthetics vs real users: it is useful to have both channels (RUM + synthetic checks).
Examples of artifacts
KQL/PromQL (ideas)
SLI error (5xx + timeouts) in 5 minutes:promql sum(rate(http_requests_total{status=~"5.. timeout"}[5m]))
/
sum(rate(http_requests_total[5m]))
p95 latency по route:
promql histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
Burn rate 5m/1h:
promql
(
sum(rate(errors_total[5m])) / sum(rate(requests_total[5m]))
) / (1 - 0. 999)
SQL (Payment Business SLI)
sql
SELECT date_trunc('minute', finished_at) AS ts,
100. 0 sum((status='SUCCESS')::int)::float / count() AS payment_success_pct,
percentile_cont(0. 95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (finished_at - started_at))) AS ttw_p95_sec
FROM payments
WHERE finished_at > now() - interval '30 days'
GROUP BY 1 ORDER BY 1;
Manage dependencies and cascades
SLO contracts between teams: gateway↔auth↔wallet↔PSP.
Degradation policies: when the dependency drops, the service goes into "simplified mode."
Feature flags: disabling non-critical functions, "gray release" to reduce latency tails.
Capacity Planning and Forecasts
Schomes. RPS/MBps forecast by trends and events (tournaments, matches, promotions).
Load testing by "golden paths," separate tests for PSP/payouts.
Stock at peak: target factor 1. 3×–2. 0 × of the expected load.
SLO/KPI implementation checklist
1. Identify critical user paths and negotiate SLI "from the customer's perspective."
2. Select SLO targets and window (30/90 days); calculate the error budget.
3. Build metric collection into gateways/services, normalize codes/reasons.
4. Configure burn-rate alerts (short + long window), routing and on-call.
5. Visualize SLO compliance, associate with releases/feature flags.
6. Create a budget against change policy and a freeze process.
7. Retrospectives and RCAs on each excess, regression tests.
8. Review SLOs quarterly for actual budget usage and business objectives.
Common mistakes
Measure "uptime by ping," ignoring application errors.
SLOs are set "in reserve" (99. 999%), but unattainable and solve nothing.
Alerts on low-level metrics instead of user symptoms.
There is no dependency map → it is not clear where it is burning.
There is no connection between SLO and releases → it is not clear who "ate" the budget.
Ignore p99 tails → good average but bad UX VIP users.
iGaming/fintech specific
Scheduled peaks: matches/events/promotions - increase capacity in advance, warm up cache/CDN, include special limit profiles.
Business SLI: Time-to-Wallet, deposit/withdrawal success, "payout speed" p95; at the root of dashboards.
PSP/partners: individual SLO/dashboards by provider, automatic route switching.
Antibot/anti-fraud: there should be no budget for errors - separate "legitimate blocks" from "technical errors."
Regulatory: log storage, reproducibility of SLO/SLA calculations, incident reports.
FAQ
Do I need to subtract planned work from the SLO?
Usually not: SLO reflects the experience experienced by the user. You can specify exceptions for SLAs.
Why p95, not average?
The middle one masks the tails; UX define tails (p95/p99).
Can I have one SLO for the entire product?
You need an SLO tree: aggregate by product and children by critical paths/components.
Total
A strong infrastructure KPI system is custom SLIs, realistic SLOs, error budget as a change control lever, smart alert and incident discipline, and RCAs. Connect technical indicators with business metrics, automate collection and visualization - and the infrastructure will become predictable, and uptime will be controlled even in peak scenarios.