GH GambleHub

DNS routing and failover

1) The role of DNS in fault tolerance

DNS is the user's first "router." The following depend on its design:
  • Availability (fast/reliable failover);
  • Performance (geo/latency-routing);
  • Cost (minimizing interregional egress and 3rd-party calls);
  • Security (DNSSEC, anti-hijack, CAA/DMARC/SPF control).

Key: short TTLs where dynamics are important, and stable zonal architecture (public + private, split-horizon).

2) Types of records and practices

A/AAAA - main addresses; always publish IPv6 wherever possible.
CNAME vs ALIAS/ANAME: At the root of the domain, use ALIAS/ANAME (or provider apex-flattening).
TXT - SPF/DMARC/DKIM, verification; CAA - limitation of certificate issuers.
SRV/NS - service discovery and delegation.
SVCB/HTTPS is a modern alternative mechanism with prioritization and parameters (ALPN, ports).

Recommendation: fix TTL standards by class (edge/API/static).

3) Routing policies

Weighted - controlled shares of traffic (canaries/blue-green).
Latency-based - Select the pool that is closest in latency.
Geo-routing - by country/continent/region; important for data residency.
Failover (primary/secondary) - active monitoring and switching.
Multi-value - several A/AAAA; the client chooses itself (does not replace health-checks).
Proximity/ASN routing - for some providers: over the client's network.

Combine: geo → latency → weight → health.

4) TTL, caching and propagation

TTL API/speakers: 30-120 s (balance between feiler speed and load).
Static/CDN: 1–24 ч.

Negative TTL (SOA 'Minimum') - ≤ 60-300 s, otherwise NXDOMAIN will be "sticky."

Remember: resolvers are not required to instantly throw out the cache; consider the "dirty tail."

5) Health and checking endpoints

Health-checks from multiple regions: TCP/443 + HTTP 2xx/3xx and lambda business criteria checks (e.g. successful '/health? deep = true 'with dependency checking).
Synthetic (RUM/active): API samples along the main routes, TLS/OCSP checks, DNSSEC checks.
Expose '/ready '(deep) and '/live' (superficial); Bind the DNS pool to/ready.

6) Public vs private DNS (split-horizon)

Public zone - client access.
Private zone - internal resolution to private endpoints (VPC/VNet, on-prem).
Conditional forwarding между on-prem ↔ cloud, region ↔ region.
Naming: 'api. <brand>.<region>.internal. corp` и `api. <brand>.com`.

7) Security: DNSSEC and domain policy

DNSSEC: enable zone signature (KSK/ZSK), monitor key rotation and trust chain.
CAA: list valid CAs; include 'iodef' for alerts.
SPF/DMARC/DKIM: reputation of mail and protection against phishing.
Registrar lock and MFA for DNS provider accounts; change log (WORM store).

8) Designing failover

8. 1 Models

Active-Active: two + healthy pools; balance through latency/weight, health-checks rule out unhealthy.
Active-Passive: main pool + reserve (0% weight before accident).
Regional ring: traffic to the "neighboring" region in a local disaster.
Degraded mode: write to the "easy" site/landing if the backend is not available.

8. 2 Step-by-step scenario

1. Monitoring records degradation of '/ready '.
2. DNS changes responses (eliminates pool or changes weights).
3. Traffic goes to a healthy region, TTL determines the speed.
4. After stabilization - grace period (15-30 min) and only then the return of the scales.

9) Configuration examples

9. 1 AWS Route 53 — latency + health + weighted

hcl
Two latency aliases for different regions resource "aws_route53_record" "api_latency_eu" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "eu1"
latency_routing_policy { region = "eu-central-1" }
alias { name = aws_lb. api_eu. dns_name zone_id = aws_lb. api_eu. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_eu. id ttl = 60
}

resource "aws_route53_record" "api_latency_us" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "us1"
latency_routing_policy { region = "us-east-1" }
alias { name = aws_lb. api_us. dns_name zone_id = aws_lb. api_us. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_us. id ttl = 60
}

Canary in EU: 10% of the weight of the resource "aws_route53_record" "api_weighted_canary" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "eu1-canary"
weighted_routing_policy { weight = 10 }
alias { name = aws_lb. api_eu_canary. dns_name zone_id = aws_lb. api_eu_canary. zone_id evaluate_target_health = true }
ttl = 30
}

9. 2 Cloudflare - geo/ASN and failover pool (idea)

Load Balancer Pools c health-checks (HTTP/TCP), Load Balancer with Geo Steering (continents/countries) and Session affinity.
Fallback: Page Rule/Transform Rule to a simplified backend at 5xx peaks.

9. 3 Azure/GCP

Azure Traffic Manager: Priority/Weighted/Performance/Geographic.
Google Cloud Load Balancing + Cloud DNS policy: geo-policy + health-checks через External HTTP(S) LB.

10) Observability and DNS SLO

SLI: success-rate resolution, 95th percentile of resolution time, proportion of fresh (non-stale) responses within TTL.
SLO: for example, '99. 95% 'of successful responses ≤ 100 ms.
Metrics: NXDOMAIN-rate, SERVFAIL-rate, health-state pools, traffic share by region, canary share.
Exemplars: Associate SLI with HTTP traces via 'trace _ id' in synthetics.

11) Testing and operation

Synthetics from different ASN/regions (RIPE Atlas, Catchpoint, k6-DNS).
dnsviz/' delv 'to check DNSSEC;' dig + trace 'for anomalies.
Staging zone ('stg. example. com ') for feilover rehearsals; rehearsal script changes weights/priorities and returns.

Runbook: who and how manually raises/lowers weights, how to turn off the pool, how to perform "freeze."

12) Antipatterns

TTL = 3000 + on critical A/AAAA → slow/chaotic feilover.
No health-checks or TCP-only port checks without business invariants.
A bunch of CNAME chains → slow resolutions, cache chaos.
The only DNS provider without secondary/axfr backup.
Unsigned zone when DNSSEC is required; irrelevant CAAs.
Entries pointing to the public IP of private backends/databases.

13) Specifics of iGaming/Finance

Jurisdictions: geo/country-routing for compliance (redirection to local domain/front).
PSP/KYC: dedicated subdomains with individual TTL and feilover policies; fast transfer to standby PSP.
Responsible play: subdomains with legal pages are always available (backup static/CDN).
Audit - Log zone changes to WORM store, sign changes, and review regularly.
Block lists: DNS compliance rules by region (edge filtering + DNS routing).

14) Prod Readiness Checklist

  • TTL profiles by class; negative TTL ≤ 300 s.
  • Two independent DNS networks (primary/secondary), MFA/registrar lock.
  • Policies: geo/latency/weight + health-checks from multiple regions.
  • DNSSEC enabled, CAA/DMARC/DKIM/SPF up to date.
  • Split-horizon (public/private), private zones for internal traffic.
  • Flyer/return runbook, rehearsal script, canary domains.
  • SLI/SLO monitoring, alerts on NXDOMAIN/SERVFAIL/RTT growth.
  • Staging area and regular failover "drills."
  • For iGaming: routing by jurisdiction, separate domains for PSP/KYC, unchangeable audit.

15) TL; DR

Build a combined policy: geo/latency + health-checks + weights, with TTL 30-120 s on speaker. Separate public/private (split-horizon), enable DNSSEC and CAA, keep secondary DNS. Make a rehearsal-feilover and observe SLI/SLO DNS. For iGaming, consider jurisdictions and PSP/KYC domain reservations with separate rules and logging of changes in WORM.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.