GH GambleHub

Uptime and heartbeat monitoring

1) Why do you need it

Early detection of outages at the perimeter and inside (edge ↔ core).
Confirmation of user availability (not just "are the pods alive").
SLA/SLO contractual reporting and legal obligations.
Monitoring of background processes (cron, ETL, payment bruises) via heartbeat.

Methodologies: Golden Signals (latency/traffic/errors/saturation), RED, link to SLO and erroneous budget.

2) Types of checks (synthetics)

ICMP: basic networking/IP availability.
TCP: port is alive/handshake (e.g. 443/5432).
TLS: validity/term/chain of certificates.
HTTP (S): response code, latency, headers, key substrings in body.
DNS: resolution, TTL, NXDOMAIN/SERVFAIL.
Headless browser (user path): login → action → logout.
Custom probes: payment authorization in sandbox PSP, internal business synthetics (deposit simulation).

Tips: Check both edge and private endpoints (from inside the VPC/K8s) are different risk domains.

3) Uptime monitoring architecture

Trial agents by region (minimum 3 geo-points).
Blackbox exporter for HTTP/TCP/TLS/DNS.
Synthetics by paths (sequential steps) separately; store scripts.
Prometheus/Mimir/Thanos: collecting metrics, SLO/alert rule.
Alertmanager/Pager: routing P1/P2, escalation.
Status Page: transparent updates for business/customers.
Logs/traces: drilldown by 'trace _ id '/correlation.

4) Health-endpoints: design

/ healthz (liveness) - "is the process alive."

/ readyz (readiness) - "ready to receive traffic" (dependencies with thresholds).

/ startupz - "initialized."

/ check - advanced business health (easy database/cache checks with timeouts and circuit-breaker).
Semantic health: code 200 only when critical dependencies are functional; degradation of → 503.

Rules: timeout ≤ 2-3s, limited sub-checks, no PII in responses, cache heavy parts.

5) Heartbeat for job and workers

Dead Man's Switch model: if the tick did not arrive on time, alert.
Usage: cron/ETL/invoice jobs, off-chain payment checks, background workers.

Methods:
  • Push-heartbeat HTTP: job when finished does' POST/heartbeat/< job> '.
  • Metrics-pull: expose 'last _ success _ timestamp' and alert by "older than N minutes."
  • Watchdog: constant signal from the agent; missing - alert "monitoring break."

6) Configuration examples

6. 1 Blackbox-exporter (HTTP + TLS + DNS)

yaml modules:
http_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"
fail_if_not_ssl: true valid_http_versions: ["HTTP/1. 1","HTTP/2"]
tls_config:
insecure_skip_verify: false headers:
User-Agent: "uptime-probe"
body: ""
ip_protocol_fallback: false

tls_cert:
prober: tcp tcp:
query_response: []
tls: true tls_config:
insecure_skip_verify: false

dns:
prober: dns dns:
query_name: "api. example. com"
valid_rcodes: ["NOERROR"]
preferred_ip_protocol: "ip4"

6. 2 Prometheus: Targets and Jabs

yaml scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe params:
module: [http_2xx]
static_configs:
- targets:
- https://api. example. com/healthz
- https://pay. example. com/readyz relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance

6. 3 Heartbeat Job Metrics (Prometheus exporter)

Expose the metric:

job_last_success_timestamp_seconds{job="settlement"} 1. 730000e+09
Alert:
promql
(time() - job_last_success_timestamp_seconds{job="settlement"}) > 900

6. 4 Watchdog (Dead Man’s Switch)

In Alertmanager, enable the route for alert 'Watchdog' (always firing) → if the alert does not come, the monitoring is broken.

7) PromQL examples for uptime

HTTP availability (0/1):
promql probe_success{job="blackbox-http"} == 1
p95 latency by sample:
promql histogram_quantile(0. 95, sum by (le, instance) (rate(probe_http_duration_seconds_bucket[5m])))
TLS expires <7 days:
promql
(min_over_time(probe_ssl_earliest_cert_expiry[5m]) - time()) < 7243600
DNS errors:
promql rate(probe_dns_rcode{rcode!="NOERROR"}[5m]) > 0
Uptime SLI (rolling 28d):
promql sum_over_time((probe_success==1)[28d]) / (28246060)

8) Alerting: thresholds and anti-noise

Multi-region quorum: triggered if ≥2 regions see a drop.
Multi-window: 1-5 min (fast channel) + 30-60 min (steady trend).
Sensitivity: debounce/for: 2-5 minutes against flapping.
Correlation: associate uptime alert with leather metrics (edge, DNS, WAF, origin).
Maintenance windows: suppressing alerts by 'maintenance = true' tags.

Example rule:
promql
≥2 regions simultaneously failed sum by (target) (max_over_time (probe _ success = = 0) [3m]))> = 2

9) Multi-region and multi-vendor checks

Minimum 3 geographies (EU/NA/APAC) and different ASNs.
Duplicate: own samples + external uptime provider.
IPv4/IPv6, HTTP/2/3, different CDN POPs and WAF profiles.

10) Security checks

Allow IP ranges of samples on WAF/LB.
Rate-limits and captcha-bypass for health endpoints/probes.
Title signature (HMAC) for private health.
Separate domains: public samples and private (/internal/health).
Do not return internal versions/configs to/healthz; statuses only.

11) SLO and uptime reporting

SLI Availability: 2xx/3xx HTTP probe success rate.
SLO example: ≥ 99. 95% in 28 days in most regions.
Erroneous budget: '1 − SLO' → manages releases.
Burn-rate alerts: fast/slow channel for proportion of sample failures.

12) Heartbeat for payment and critical jobs

Jobs "around money" (transfers, registries) - double control: heartbeat + business counters (how many records are processed).
Alerts by "silence" (no new events> N minutes) and by lag (lag behind real-time).

13) Status pages

Separate components (APIs, payments, backends, CDNs).
Automatic updates from alerts, manual comments via Comms role.
Incident history, post-mortem links, planned work.

14) Integration with incident process

Alert SEV by quorum rules + duration.
Auto-creation of an incident card, war-room, IC assignment.
Communication templates (internal/external), Legal Hold if necessary.

Post-verification: synthetics green ≥ X minutes to "Resolved."

15) Performance and cost

Sampling frequency: critical - every 30-60 s; secondary - 1-5 min.
Storage: downsampling/recording rules for long windows.
Budget of external providers: limit advanced browser scripts to the schedule.

16) Quality checklist

  • There are/healthz ,/readyz ,/startupz with clear semantics.
  • Samples from ≥3 regions/ASN, IPv4/IPv6.
  • TLS/DNS checks and alerts T-30/T-7/T-1 days.
  • Heartbeat all critical jobs (and business "silence").
  • Multi-window + quorum, no flapping.
  • Drilldown: buttons to logs/tracks/dashboards.
  • Status page and communication templates.
  • Documentation of SLO/metrics and owners.

17) Implementation plan (3 iterations)

1. Week 1: HTTP/TLS/DNS blackbox probes by critical domains, status page, basic alerts.
2. Week 2: Multi-regionality, quorum rules, heartbeat top job, Watchdog.
3. Week 3: headless scripts (login/deposit), SLO reporting, integration with incident process.

18) Mini-FAQ

Why are external samples better than internal ones?
External users see the real user path (DNS/CDN/WAF), internal users see the origin state. We need both.

Do I need to check paid PSPs?
Yes: synthetics in sandbox and status page monitoring; in case of degradation - automatic smart-routing.

How to reduce noise?
Quorum, multi-window, for-delay, suppression on maintenance, clear SLO thresholds and ownership.

Total

Uptime monitoring is not only ping. This is a system: multi-regional synthetics + high-quality health endpoints + heartbeat job + SLO/alert + status pages. Standardize checks, reduce noise, protect samples and link everything to the incident process - this way you reduce MTTR and save the erroneous budget.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.