DNS routing and failover

1) The role of DNS in fault tolerance

DNS is the user's first "router." The following depend on its design:

Availability (fast/reliable failover);
Performance (geo/latency-routing);
Cost (minimizing interregional egress and 3rd-party calls);
Security (DNSSEC, anti-hijack, CAA/DMARC/SPF control).

Key: short TTLs where dynamics are important, and stable zonal architecture (public + private, split-horizon).

2) Types of records and practices

A/AAAA - main addresses; always publish IPv6 wherever possible.
CNAME vs ALIAS/ANAME: At the root of the domain, use ALIAS/ANAME (or provider apex-flattening).
TXT - SPF/DMARC/DKIM, verification; CAA - limitation of certificate issuers.
SRV/NS - service discovery and delegation.
SVCB/HTTPS is a modern alternative mechanism with prioritization and parameters (ALPN, ports).

Recommendation: fix TTL standards by class (edge/API/static).

3) Routing policies

Weighted - controlled shares of traffic (canaries/blue-green).
Latency-based - Select the pool that is closest in latency.
Geo-routing - by country/continent/region; important for data residency.
Failover (primary/secondary) - active monitoring and switching.
Multi-value - several A/AAAA; the client chooses itself (does not replace health-checks).
Proximity/ASN routing - for some providers: over the client's network.

Combine: geo → latency → weight → health.

4) TTL, caching and propagation

TTL API/speakers: 30-120 s (balance between feiler speed and load).
Static/CDN: 1–24 ч.

Negative TTL (SOA 'Minimum') - ≤ 60-300 s, otherwise NXDOMAIN will be "sticky."

Remember: resolvers are not required to instantly throw out the cache; consider the "dirty tail."

5) Health and checking endpoints

Health-checks from multiple regions: TCP/443 + HTTP 2xx/3xx and lambda business criteria checks (e.g. successful '/health? deep = true 'with dependency checking).
Synthetic (RUM/active): API samples along the main routes, TLS/OCSP checks, DNSSEC checks.
Expose '/ready '(deep) and '/live' (superficial); Bind the DNS pool to/ready.

6) Public vs private DNS (split-horizon)

Public zone - client access.
Private zone - internal resolution to private endpoints (VPC/VNet, on-prem).
Conditional forwarding между on-prem ↔ cloud, region ↔ region.
Naming: 'api. <brand>.<region>.internal. corp` и `api. <brand>.com`.

7) Security: DNSSEC and domain policy

DNSSEC: enable zone signature (KSK/ZSK), monitor key rotation and trust chain.
CAA: list valid CAs; include 'iodef' for alerts.
SPF/DMARC/DKIM: reputation of mail and protection against phishing.
Registrar lock and MFA for DNS provider accounts; change log (WORM store).

8) Designing failover

8. 1 Models

Active-Active: two + healthy pools; balance through latency/weight, health-checks rule out unhealthy.
Active-Passive: main pool + reserve (0% weight before accident).
Regional ring: traffic to the "neighboring" region in a local disaster.
Degraded mode: write to the "easy" site/landing if the backend is not available.

8. 2 Step-by-step scenario

1. Monitoring records degradation of '/ready '.
2. DNS changes responses (eliminates pool or changes weights).
3. Traffic goes to a healthy region, TTL determines the speed.
4. After stabilization - grace period (15-30 min) and only then the return of the scales.

9) Configuration examples

9. 1 AWS Route 53 — latency + health + weighted

hcl
Two latency aliases for different regions resource "aws_route53_record" "api_latency_eu" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "eu1"
latency_routing_policy { region = "eu-central-1" }
alias { name = aws_lb. api_eu. dns_name zone_id = aws_lb. api_eu. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_eu. id ttl = 60
}

resource "aws_route53_record" "api_latency_us" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "us1"
latency_routing_policy { region = "us-east-1" }
alias { name = aws_lb. api_us. dns_name zone_id = aws_lb. api_us. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_us. id ttl = 60
}

Canary in EU: 10% of the weight of the resource "aws_route53_record" "api_weighted_canary" {
zone_id = var. zone_id name  = "api. example. com"
type  = "A"
set_identifier = "eu1-canary"
weighted_routing_policy { weight = 10 }
alias { name = aws_lb. api_eu_canary. dns_name zone_id = aws_lb. api_eu_canary. zone_id evaluate_target_health = true }
ttl = 30
}

9. 2 Cloudflare - geo/ASN and failover pool (idea)

Load Balancer Pools c health-checks (HTTP/TCP), Load Balancer with Geo Steering (continents/countries) and Session affinity.
Fallback: Page Rule/Transform Rule to a simplified backend at 5xx peaks.

9. 3 Azure/GCP

Azure Traffic Manager: Priority/Weighted/Performance/Geographic.
Google Cloud Load Balancing + Cloud DNS policy: geo-policy + health-checks через External HTTP(S) LB.

10) Observability and DNS SLO

SLI: success-rate resolution, 95th percentile of resolution time, proportion of fresh (non-stale) responses within TTL.
SLO: for example, '99. 95% 'of successful responses ≤ 100 ms.
Metrics: NXDOMAIN-rate, SERVFAIL-rate, health-state pools, traffic share by region, canary share.
Exemplars: Associate SLI with HTTP traces via 'trace _ id' in synthetics.

11) Testing and operation

Synthetics from different ASN/regions (RIPE Atlas, Catchpoint, k6-DNS).
dnsviz/' delv 'to check DNSSEC;' dig + trace 'for anomalies.
Staging zone ('stg. example. com ') for feilover rehearsals; rehearsal script changes weights/priorities and returns.

Runbook: who and how manually raises/lowers weights, how to turn off the pool, how to perform "freeze."

12) Antipatterns

TTL = 3000 + on critical A/AAAA → slow/chaotic feilover.
No health-checks or TCP-only port checks without business invariants.
A bunch of CNAME chains → slow resolutions, cache chaos.
The only DNS provider without secondary/axfr backup.
Unsigned zone when DNSSEC is required; irrelevant CAAs.
Entries pointing to the public IP of private backends/databases.

13) Specifics of iGaming/Finance

Jurisdictions: geo/country-routing for compliance (redirection to local domain/front).
PSP/KYC: dedicated subdomains with individual TTL and feilover policies; fast transfer to standby PSP.
Responsible play: subdomains with legal pages are always available (backup static/CDN).
Audit - Log zone changes to WORM store, sign changes, and review regularly.
Block lists: DNS compliance rules by region (edge filtering + DNS routing).

14) Prod Readiness Checklist

TTL profiles by class; negative TTL ≤ 300 s.
Two independent DNS networks (primary/secondary), MFA/registrar lock.
Policies: geo/latency/weight + health-checks from multiple regions.
DNSSEC enabled, CAA/DMARC/DKIM/SPF up to date.
Split-horizon (public/private), private zones for internal traffic.
Flyer/return runbook, rehearsal script, canary domains.
SLI/SLO monitoring, alerts on NXDOMAIN/SERVFAIL/RTT growth.
Staging area and regular failover "drills."
For iGaming: routing by jurisdiction, separate domains for PSP/KYC, unchangeable audit.

15) TL; DR

Build a combined policy: geo/latency + health-checks + weights, with TTL 30-120 s on speaker. Separate public/private (split-horizon), enable DNSSEC and CAA, keep secondary DNS. Make a rehearsal-feilover and observe SLI/SLO DNS. For iGaming, consider jurisdictions and PSP/KYC domain reservations with separate rules and logging of changes in WORM.

DNS routing and failover

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects