Uptime and heartbeat monitoring

1) Why do you need it

Early detection of outages at the perimeter and inside (edge ↔ core).
Confirmation of user availability (not just "are the pods alive").
SLA/SLO contractual reporting and legal obligations.
Monitoring of background processes (cron, ETL, payment bruises) via heartbeat.

Methodologies: Golden Signals (latency/traffic/errors/saturation), RED, link to SLO and erroneous budget.

2) Types of checks (synthetics)

ICMP: basic networking/IP availability.
TCP: port is alive/handshake (e.g. 443/5432).
TLS: validity/term/chain of certificates.
HTTP (S): response code, latency, headers, key substrings in body.
DNS: resolution, TTL, NXDOMAIN/SERVFAIL.
Headless browser (user path): login → action → logout.
Custom probes: payment authorization in sandbox PSP, internal business synthetics (deposit simulation).

Tips: Check both edge and private endpoints (from inside the VPC/K8s) are different risk domains.

3) Uptime monitoring architecture

Trial agents by region (minimum 3 geo-points).
Blackbox exporter for HTTP/TCP/TLS/DNS.
Synthetics by paths (sequential steps) separately; store scripts.
Prometheus/Mimir/Thanos: collecting metrics, SLO/alert rule.
Alertmanager/Pager: routing P1/P2, escalation.
Status Page: transparent updates for business/customers.
Logs/traces: drilldown by 'trace _ id '/correlation.

4) Health-endpoints: design

/ healthz (liveness) - "is the process alive."

/ readyz (readiness) - "ready to receive traffic" (dependencies with thresholds).

/ startupz - "initialized."

/ check - advanced business health (easy database/cache checks with timeouts and circuit-breaker).
Semantic health: code 200 only when critical dependencies are functional; degradation of → 503.

Rules: timeout ≤ 2-3s, limited sub-checks, no PII in responses, cache heavy parts.

5) Heartbeat for job and workers

Dead Man's Switch model: if the tick did not arrive on time, alert.
Usage: cron/ETL/invoice jobs, off-chain payment checks, background workers.

Methods:

Push-heartbeat HTTP: job when finished does' POST/heartbeat/< job> '.
Metrics-pull: expose 'last _ success _ timestamp' and alert by "older than N minutes."
Watchdog: constant signal from the agent; missing - alert "monitoring break."

6) Configuration examples

6. 1 Blackbox-exporter (HTTP + TLS + DNS)

yaml modules:
http_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"
fail_if_not_ssl: true valid_http_versions: ["HTTP/1. 1","HTTP/2"]
tls_config:
insecure_skip_verify: false headers:
User-Agent: "uptime-probe"
body: ""
ip_protocol_fallback: false

tls_cert:
prober: tcp tcp:
query_response: []
tls: true tls_config:
insecure_skip_verify: false

dns:
prober: dns dns:
query_name: "api. example. com"
valid_rcodes: ["NOERROR"]
preferred_ip_protocol: "ip4"

6. 2 Prometheus: Targets and Jabs

yaml scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe params:
module: [http_2xx]
static_configs:
- targets:
- https://api. example. com/healthz
- https://pay. example. com/readyz relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance

6. 3 Heartbeat Job Metrics (Prometheus exporter)

Expose the metric:


job_last_success_timestamp_seconds{job="settlement"} 1. 730000e+09

Alert:

promql
(time() - job_last_success_timestamp_seconds{job="settlement"}) > 900

6. 4 Watchdog (Dead Man’s Switch)

In Alertmanager, enable the route for alert 'Watchdog' (always firing) → if the alert does not come, the monitoring is broken.

7) PromQL examples for uptime

HTTP availability (0/1):

promql probe_success{job="blackbox-http"} == 1

p95 latency by sample:

promql histogram_quantile(0. 95, sum by (le, instance) (rate(probe_http_duration_seconds_bucket[5m])))

TLS expires <7 days:

promql
(min_over_time(probe_ssl_earliest_cert_expiry[5m]) - time()) < 7243600

DNS errors:

promql rate(probe_dns_rcode{rcode!="NOERROR"}[5m]) > 0

Uptime SLI (rolling 28d):

promql sum_over_time((probe_success==1)[28d]) / (28246060)

8) Alerting: thresholds and anti-noise

Multi-region quorum: triggered if ≥2 regions see a drop.
Multi-window: 1-5 min (fast channel) + 30-60 min (steady trend).
Sensitivity: debounce/for: 2-5 minutes against flapping.
Correlation: associate uptime alert with leather metrics (edge, DNS, WAF, origin).
Maintenance windows: suppressing alerts by 'maintenance = true' tags.

Example rule:

promql
≥2 regions simultaneously failed sum by (target) (max_over_time (probe _ success = = 0) [3m]))> = 2

9) Multi-region and multi-vendor checks

Minimum 3 geographies (EU/NA/APAC) and different ASNs.
Duplicate: own samples + external uptime provider.
IPv4/IPv6, HTTP/2/3, different CDN POPs and WAF profiles.

10) Security checks

Allow IP ranges of samples on WAF/LB.
Rate-limits and captcha-bypass for health endpoints/probes.
Title signature (HMAC) for private health.
Separate domains: public samples and private (/internal/health).
Do not return internal versions/configs to/healthz; statuses only.

11) SLO and uptime reporting

SLI Availability: 2xx/3xx HTTP probe success rate.
SLO example: ≥ 99. 95% in 28 days in most regions.
Erroneous budget: '1 − SLO' → manages releases.
Burn-rate alerts: fast/slow channel for proportion of sample failures.

12) Heartbeat for payment and critical jobs

Jobs "around money" (transfers, registries) - double control: heartbeat + business counters (how many records are processed).
Alerts by "silence" (no new events> N minutes) and by lag (lag behind real-time).

13) Status pages

Separate components (APIs, payments, backends, CDNs).
Automatic updates from alerts, manual comments via Comms role.
Incident history, post-mortem links, planned work.

14) Integration with incident process

Alert SEV by quorum rules + duration.
Auto-creation of an incident card, war-room, IC assignment.
Communication templates (internal/external), Legal Hold if necessary.

Post-verification: synthetics green ≥ X minutes to "Resolved."

15) Performance and cost

Sampling frequency: critical - every 30-60 s; secondary - 1-5 min.
Storage: downsampling/recording rules for long windows.
Budget of external providers: limit advanced browser scripts to the schedule.

16) Quality checklist

There are/healthz ,/readyz ,/startupz with clear semantics.
Samples from ≥3 regions/ASN, IPv4/IPv6.
TLS/DNS checks and alerts T-30/T-7/T-1 days.
Heartbeat all critical jobs (and business "silence").
Multi-window + quorum, no flapping.
Drilldown: buttons to logs/tracks/dashboards.
Status page and communication templates.
Documentation of SLO/metrics and owners.

17) Implementation plan (3 iterations)

1. Week 1: HTTP/TLS/DNS blackbox probes by critical domains, status page, basic alerts.
2. Week 2: Multi-regionality, quorum rules, heartbeat top job, Watchdog.
3. Week 3: headless scripts (login/deposit), SLO reporting, integration with incident process.

18) Mini-FAQ

Why are external samples better than internal ones?
External users see the real user path (DNS/CDN/WAF), internal users see the origin state. We need both.

Do I need to check paid PSPs?
Yes: synthetics in sandbox and status page monitoring; in case of degradation - automatic smart-routing.

How to reduce noise?
Quorum, multi-window, for-delay, suppression on maintenance, clear SLO thresholds and ownership.

Total

Uptime monitoring is not only ping. This is a system: multi-regional synthetics + high-quality health endpoints + heartbeat job + SLO/alert + status pages. Standardize checks, reduce noise, protect samples and link everything to the incident process - this way you reduce MTTR and save the erroneous budget.

Uptime and heartbeat monitoring

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects