Uptime tracking

1) Why monitor uptime

Uptime - the share of time when the service is available to the user. This is the "first line" of observability: instantly notice inaccessibility, degradation over the network, DNS/TLS failure, routing or CDN problems. For high-load and regulated systems (fintech, iGaming), uptime directly affects revenue, SLA performance and penalty risks.

2) Terms and formulas

Availability SLI: 'SLI = (successful checks/all checks) × 100%'.
SLO: target availability per window (usually 28-30 days), for example 99. 9%.
SLA: external obligation; always ≤ internal SLO.
MTBF/MTTR: mean time between failures/mean recovery time.

Nines Card (monthly, ~ 43,200 minutes):

99. 0% → ~ 432 min unavailable

99. 9% → ~ 43 min

99. 99% → ~4. 3 min

99. 999% → ~ 26 sec

3) What checks are needed (black box)

Launched from external points (different regions/providers) to see the service "through the eyes of the client."

1. ICMP (ping) - basic networking/node availability. Fast, but not reflective of business success.
2. TCP connect - port listening? Useful for brokers/DB/SMTP.
3. HTTP/HTTPS - status code, headers, size, redirects, time to first byte.
4. TLS/certificates - validity period, chain, algorithms, SNI, protocols.
5. DNS - A/AAAA/CNAME, NS-health, distribution, DNSSEC.
6. gRPC - call status, deadline, metadata.
7. WebSocket/SSE - handshake, connection maintenance, echo message.
8. Proxy/routing/CDN - different PoPs, cache hash, geo-variants.

9. Transactional synthetic scenarios (clicks/forms): "login → search → deposit (sandbox)."

10. Heartbeat/cron monitoring - the service must "pulse" (hook once every N minutes); no signal - alarm.

Tips:

Set timeouts closer to the real UX (for example, TTFB ≤ 300 ms, total ≤ 2 s).
Check the content asset (keyword/JSON field) so that "200 OK" with an error is not considered a success.
Duplicate checks through independent providers and networks (multi-hop, different ASNs).

4) White box and health service

Liveness/Readiness tests for the orchestrator (processes are alive? ready to receive traffic?).
Dependency health: DB, cache, event broker, external APIs (payments/KYC/AML).
Feature flags/degradation: in case of problems, gently disable non-critical paths.

White samples do not replace external checks: the service may be "healthy inside," but unavailable to the user due to DNS/TLS/route.

5) Geography and multi-regionality

Run synthetics from key traffic regions and near critical dependency providers.
Quorum: an incident is recorded if a failure in ≥ N regions (for example, 2 out of 3) to cut off local anomalies.
Threshold by cohort: separate SLI/SLO for important segments (countries, VIPs, carriers).

6) Alert policy (noise minimum)

Multi-region + multi-probe: pager only in case of a consistent failure (for example, HTTP and TLS simultaneously, ≥ 2 regions).
Debowns: N consecutive failures or 2-3 minute window before paging.

Escalation:

L1: on-call (production services).
L2 network/platform/security based on failure signature.
Auto-close: after stable M successful checks.
Quiet hours/concessions: for non-critical internal services - only tickets, no pager.

7) Status page and communication

Public (client) and private (internal) status pages.
Automatic incidents from synthetics + manual annotations.
Message Templates: Detected - Identified - Impact - Workaround - ETA - Resolved - Post-Mordem.
Planned windows: announce in advance, consider exceptions separately from SLO.

8) Consideration of external dependencies

For each provider (payments, KYC, mailings, CDN, clouds) - their own checks from several regions.
Failover routes: auto-switching to an alternative provider using a synthetic signal.
Separate SLOs at the provider level and integrated e2e-SLO.
Agree on SLA with providers (status webhooks, support priority).

9) Dashboards and key widgets

World map with the status of checks (by type: HTTP, DNS, TLS).
Timeline of incidents with release/flag annotations.
P50/P95/P99 TTFB/TTL/latency by region.
Availability by cohort (country/provider/device).
MTTR/MTBF, "idle minutes" and "burn-down" trends of the availability budget for the month.
Top reasons for failures (TLS-expiry, DNS-resolving, 5xx, timeouts).

10) Incident process (transient scenario)

1. Multi-region/multi-type alert is triggered.
2. The duty officer confirms, turns on the freezing of releases, notifies the owners.
3. Quick diagnostics: DNS/TLS/CDN status, latest releases, error schedule.
4. Bypass: route change, folback content/provider, enabling degradation mode.
5. Recovery: verify that synthetics/real traffic is green.
6. Communication on the status page; closing the incident.
7. RCA and action items: fixes, tests, alerts, playbooks.

11) SLA/SLO Reporting

Monthly reports: uptime by service/region, minutes of downtime, MTTR, reasons.
Comparison with SLA: credits/compensations, if applicable.
Quarterly reviews: updating thresholds, distribution of synthetics, list of dependencies.

12) Inspection templates (example)

HTTP API check:

Method: 'GET/healthz/public' (no secrets).
Timeout: 2 s, retry: 1.
Success: '2xx', header' X-App-Version' present, JSON field '"status":" ok"'.

TLS check:

Term> 14 days, valid chain, protocols' TLS 1. 2 + ', correct SNI.

DNS check:

Response time ≤ 100 ms, A/AAAA records are as planned, no SERVFAIL/REFUSED.

Heartbeat:

Webhook '/beat/{ service} 'every 5 minutes; absence of 2 signals in a row - L2 alert (background tasks/ETL).

13) Implementation checklist

Multi-region external checks (HTTP/TCP/DNS/TLS/deep scripts).
White readiness/liveness samples for the orchestrator.
Separation of critical/non-critical paths, degradation flags.
Quorum and debit in alerts, escalation and auto-close.
Public and internal status pages, message templates.
Separate checks and SLO for external providers + automatic failover.
Dashboards: map, timeline, percentiles, idle minutes, MTTR/MTBF.
Regular SLA/SLO reports and post-incident RCAs.

14) Frequent errors

Only ping/port without HTTP/content is green when not actually available.
One monitoring point - false positive/negative conclusions.
No TLS/DNS control - sudden outages due to delay/misconfiguration.
Extra noise: alerts for single failures from the same region/type of check.
No connection with changes - there are no annotations of releases and flags in dashboards.

Unaccounted dependencies - the payment provider has fallen, and the overall status is "green."

15) The bottom line

Uptime tracking isn't just about "peaking URLs." This is a system of synthetic checks from real regions, reasonable alerts without noise, transparent communication through status pages, accounting for external dependencies and strict reporting. Properly built uptime monitoring reduces MTTR, protects SLAs, and preserves the predictability of the user experience.

Uptime tracking

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects