Uptime and heartbeat monitoring
1) Why do you need it
Early detection of outages at the perimeter and inside (edge ↔ core).
Confirmation of user availability (not just "are the pods alive").
SLA/SLO contractual reporting and legal obligations.
Monitoring of background processes (cron, ETL, payment bruises) via heartbeat.
Methodologies: Golden Signals (latency/traffic/errors/saturation), RED, link to SLO and erroneous budget.
2) Types of checks (synthetics)
ICMP: basic networking/IP availability.
TCP: port is alive/handshake (e.g. 443/5432).
TLS: validity/term/chain of certificates.
HTTP (S): response code, latency, headers, key substrings in body.
DNS: resolution, TTL, NXDOMAIN/SERVFAIL.
Headless browser (user path): login → action → logout.
Custom probes: payment authorization in sandbox PSP, internal business synthetics (deposit simulation).
Tips: Check both edge and private endpoints (from inside the VPC/K8s) are different risk domains.
3) Uptime monitoring architecture
Trial agents by region (minimum 3 geo-points).
Blackbox exporter for HTTP/TCP/TLS/DNS.
Synthetics by paths (sequential steps) separately; store scripts.
Prometheus/Mimir/Thanos: collecting metrics, SLO/alert rule.
Alertmanager/Pager: routing P1/P2, escalation.
Status Page: transparent updates for business/customers.
Logs/traces: drilldown by 'trace _ id '/correlation.
4) Health-endpoints: design
/ healthz (liveness) - "is the process alive."
/ readyz (readiness) - "ready to receive traffic" (dependencies with thresholds).
/ startupz - "initialized."
/ check - advanced business health (easy database/cache checks with timeouts and circuit-breaker).
Semantic health: code 200 only when critical dependencies are functional; degradation of → 503.
Rules: timeout ≤ 2-3s, limited sub-checks, no PII in responses, cache heavy parts.
5) Heartbeat for job and workers
Dead Man's Switch model: if the tick did not arrive on time, alert.
Usage: cron/ETL/invoice jobs, off-chain payment checks, background workers.
- Push-heartbeat HTTP: job when finished does' POST/heartbeat/< job> '.
- Metrics-pull: expose 'last _ success _ timestamp' and alert by "older than N minutes."
- Watchdog: constant signal from the agent; missing - alert "monitoring break."
6) Configuration examples
6. 1 Blackbox-exporter (HTTP + TLS + DNS)
yaml modules:
http_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"
fail_if_not_ssl: true valid_http_versions: ["HTTP/1. 1","HTTP/2"]
tls_config:
insecure_skip_verify: false headers:
User-Agent: "uptime-probe"
body: ""
ip_protocol_fallback: false
tls_cert:
prober: tcp tcp:
query_response: []
tls: true tls_config:
insecure_skip_verify: false
dns:
prober: dns dns:
query_name: "api. example. com"
valid_rcodes: ["NOERROR"]
preferred_ip_protocol: "ip4"
6. 2 Prometheus: Targets and Jabs
yaml scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe params:
module: [http_2xx]
static_configs:
- targets:
- https://api. example. com/healthz
- https://pay. example. com/readyz relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance
6. 3 Heartbeat Job Metrics (Prometheus exporter)
Expose the metric:
job_last_success_timestamp_seconds{job="settlement"} 1. 730000e+09
Alert:
promql
(time() - job_last_success_timestamp_seconds{job="settlement"}) > 900
6. 4 Watchdog (Dead Man’s Switch)
In Alertmanager, enable the route for alert 'Watchdog' (always firing) → if the alert does not come, the monitoring is broken.
7) PromQL examples for uptime
HTTP availability (0/1):promql probe_success{job="blackbox-http"} == 1
p95 latency by sample:
promql histogram_quantile(0. 95, sum by (le, instance) (rate(probe_http_duration_seconds_bucket[5m])))
TLS expires <7 days:
promql
(min_over_time(probe_ssl_earliest_cert_expiry[5m]) - time()) < 7243600
DNS errors:
promql rate(probe_dns_rcode{rcode!="NOERROR"}[5m]) > 0
Uptime SLI (rolling 28d):
promql sum_over_time((probe_success==1)[28d]) / (28246060)
8) Alerting: thresholds and anti-noise
Multi-region quorum: triggered if ≥2 regions see a drop.
Multi-window: 1-5 min (fast channel) + 30-60 min (steady trend).
Sensitivity: debounce/for: 2-5 minutes against flapping.
Correlation: associate uptime alert with leather metrics (edge, DNS, WAF, origin).
Maintenance windows: suppressing alerts by 'maintenance = true' tags.
promql
≥2 regions simultaneously failed sum by (target) (max_over_time (probe _ success = = 0) [3m]))> = 2
9) Multi-region and multi-vendor checks
Minimum 3 geographies (EU/NA/APAC) and different ASNs.
Duplicate: own samples + external uptime provider.
IPv4/IPv6, HTTP/2/3, different CDN POPs and WAF profiles.
10) Security checks
Allow IP ranges of samples on WAF/LB.
Rate-limits and captcha-bypass for health endpoints/probes.
Title signature (HMAC) for private health.
Separate domains: public samples and private (/internal/health).
Do not return internal versions/configs to/healthz; statuses only.
11) SLO and uptime reporting
SLI Availability: 2xx/3xx HTTP probe success rate.
SLO example: ≥ 99. 95% in 28 days in most regions.
Erroneous budget: '1 − SLO' → manages releases.
Burn-rate alerts: fast/slow channel for proportion of sample failures.
12) Heartbeat for payment and critical jobs
Jobs "around money" (transfers, registries) - double control: heartbeat + business counters (how many records are processed).
Alerts by "silence" (no new events> N minutes) and by lag (lag behind real-time).
13) Status pages
Separate components (APIs, payments, backends, CDNs).
Automatic updates from alerts, manual comments via Comms role.
Incident history, post-mortem links, planned work.
14) Integration with incident process
Alert SEV by quorum rules + duration.
Auto-creation of an incident card, war-room, IC assignment.
Communication templates (internal/external), Legal Hold if necessary.
Post-verification: synthetics green ≥ X minutes to "Resolved."
15) Performance and cost
Sampling frequency: critical - every 30-60 s; secondary - 1-5 min.
Storage: downsampling/recording rules for long windows.
Budget of external providers: limit advanced browser scripts to the schedule.
16) Quality checklist
- There are/healthz ,/readyz ,/startupz with clear semantics.
- Samples from ≥3 regions/ASN, IPv4/IPv6.
- TLS/DNS checks and alerts T-30/T-7/T-1 days.
- Heartbeat all critical jobs (and business "silence").
- Multi-window + quorum, no flapping.
- Drilldown: buttons to logs/tracks/dashboards.
- Status page and communication templates.
- Documentation of SLO/metrics and owners.
17) Implementation plan (3 iterations)
1. Week 1: HTTP/TLS/DNS blackbox probes by critical domains, status page, basic alerts.
2. Week 2: Multi-regionality, quorum rules, heartbeat top job, Watchdog.
3. Week 3: headless scripts (login/deposit), SLO reporting, integration with incident process.
18) Mini-FAQ
Why are external samples better than internal ones?
External users see the real user path (DNS/CDN/WAF), internal users see the origin state. We need both.
Do I need to check paid PSPs?
Yes: synthetics in sandbox and status page monitoring; in case of degradation - automatic smart-routing.
How to reduce noise?
Quorum, multi-window, for-delay, suppression on maintenance, clear SLO thresholds and ownership.
Total
Uptime monitoring is not only ping. This is a system: multi-regional synthetics + high-quality health endpoints + heartbeat job + SLO/alert + status pages. Standardize checks, reduce noise, protect samples and link everything to the incident process - this way you reduce MTTR and save the erroneous budget.