DNS routing and failover
1) The role of DNS in fault tolerance
DNS is the user's first "router." The following depend on its design:- Availability (fast/reliable failover);
- Performance (geo/latency-routing);
- Cost (minimizing interregional egress and 3rd-party calls);
- Security (DNSSEC, anti-hijack, CAA/DMARC/SPF control).
Key: short TTLs where dynamics are important, and stable zonal architecture (public + private, split-horizon).
2) Types of records and practices
A/AAAA - main addresses; always publish IPv6 wherever possible.
CNAME vs ALIAS/ANAME: At the root of the domain, use ALIAS/ANAME (or provider apex-flattening).
TXT - SPF/DMARC/DKIM, verification; CAA - limitation of certificate issuers.
SRV/NS - service discovery and delegation.
SVCB/HTTPS is a modern alternative mechanism with prioritization and parameters (ALPN, ports).
Recommendation: fix TTL standards by class (edge/API/static).
3) Routing policies
Weighted - controlled shares of traffic (canaries/blue-green).
Latency-based - Select the pool that is closest in latency.
Geo-routing - by country/continent/region; important for data residency.
Failover (primary/secondary) - active monitoring and switching.
Multi-value - several A/AAAA; the client chooses itself (does not replace health-checks).
Proximity/ASN routing - for some providers: over the client's network.
Combine: geo → latency → weight → health.
4) TTL, caching and propagation
TTL API/speakers: 30-120 s (balance between feiler speed and load).
Static/CDN: 1–24 ч.
Negative TTL (SOA 'Minimum') - ≤ 60-300 s, otherwise NXDOMAIN will be "sticky."
Remember: resolvers are not required to instantly throw out the cache; consider the "dirty tail."
5) Health and checking endpoints
Health-checks from multiple regions: TCP/443 + HTTP 2xx/3xx and lambda business criteria checks (e.g. successful '/health? deep = true 'with dependency checking).
Synthetic (RUM/active): API samples along the main routes, TLS/OCSP checks, DNSSEC checks.
Expose '/ready '(deep) and '/live' (superficial); Bind the DNS pool to/ready.
6) Public vs private DNS (split-horizon)
Public zone - client access.
Private zone - internal resolution to private endpoints (VPC/VNet, on-prem).
Conditional forwarding между on-prem ↔ cloud, region ↔ region.
Naming: 'api. <brand>.<region>.internal. corp` и `api. <brand>.com`.
7) Security: DNSSEC and domain policy
DNSSEC: enable zone signature (KSK/ZSK), monitor key rotation and trust chain.
CAA: list valid CAs; include 'iodef' for alerts.
SPF/DMARC/DKIM: reputation of mail and protection against phishing.
Registrar lock and MFA for DNS provider accounts; change log (WORM store).
8) Designing failover
8. 1 Models
Active-Active: two + healthy pools; balance through latency/weight, health-checks rule out unhealthy.
Active-Passive: main pool + reserve (0% weight before accident).
Regional ring: traffic to the "neighboring" region in a local disaster.
Degraded mode: write to the "easy" site/landing if the backend is not available.
8. 2 Step-by-step scenario
1. Monitoring records degradation of '/ready '.
2. DNS changes responses (eliminates pool or changes weights).
3. Traffic goes to a healthy region, TTL determines the speed.
4. After stabilization - grace period (15-30 min) and only then the return of the scales.
9) Configuration examples
9. 1 AWS Route 53 — latency + health + weighted
hcl
Two latency aliases for different regions resource "aws_route53_record" "api_latency_eu" {
zone_id = var. zone_id name = "api. example. com"
type = "A"
set_identifier = "eu1"
latency_routing_policy { region = "eu-central-1" }
alias { name = aws_lb. api_eu. dns_name zone_id = aws_lb. api_eu. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_eu. id ttl = 60
}
resource "aws_route53_record" "api_latency_us" {
zone_id = var. zone_id name = "api. example. com"
type = "A"
set_identifier = "us1"
latency_routing_policy { region = "us-east-1" }
alias { name = aws_lb. api_us. dns_name zone_id = aws_lb. api_us. zone_id evaluate_target_health = true }
health_check_id = aws_route53_health_check. api_us. id ttl = 60
}
Canary in EU: 10% of the weight of the resource "aws_route53_record" "api_weighted_canary" {
zone_id = var. zone_id name = "api. example. com"
type = "A"
set_identifier = "eu1-canary"
weighted_routing_policy { weight = 10 }
alias { name = aws_lb. api_eu_canary. dns_name zone_id = aws_lb. api_eu_canary. zone_id evaluate_target_health = true }
ttl = 30
}
9. 2 Cloudflare - geo/ASN and failover pool (idea)
Load Balancer Pools c health-checks (HTTP/TCP), Load Balancer with Geo Steering (continents/countries) and Session affinity.
Fallback: Page Rule/Transform Rule to a simplified backend at 5xx peaks.
9. 3 Azure/GCP
Azure Traffic Manager: Priority/Weighted/Performance/Geographic.
Google Cloud Load Balancing + Cloud DNS policy: geo-policy + health-checks через External HTTP(S) LB.
10) Observability and DNS SLO
SLI: success-rate resolution, 95th percentile of resolution time, proportion of fresh (non-stale) responses within TTL.
SLO: for example, '99. 95% 'of successful responses ≤ 100 ms.
Metrics: NXDOMAIN-rate, SERVFAIL-rate, health-state pools, traffic share by region, canary share.
Exemplars: Associate SLI with HTTP traces via 'trace _ id' in synthetics.
11) Testing and operation
Synthetics from different ASN/regions (RIPE Atlas, Catchpoint, k6-DNS).
dnsviz/' delv 'to check DNSSEC;' dig + trace 'for anomalies.
Staging zone ('stg. example. com ') for feilover rehearsals; rehearsal script changes weights/priorities and returns.
Runbook: who and how manually raises/lowers weights, how to turn off the pool, how to perform "freeze."
12) Antipatterns
TTL = 3000 + on critical A/AAAA → slow/chaotic feilover.
No health-checks or TCP-only port checks without business invariants.
A bunch of CNAME chains → slow resolutions, cache chaos.
The only DNS provider without secondary/axfr backup.
Unsigned zone when DNSSEC is required; irrelevant CAAs.
Entries pointing to the public IP of private backends/databases.
13) Specifics of iGaming/Finance
Jurisdictions: geo/country-routing for compliance (redirection to local domain/front).
PSP/KYC: dedicated subdomains with individual TTL and feilover policies; fast transfer to standby PSP.
Responsible play: subdomains with legal pages are always available (backup static/CDN).
Audit - Log zone changes to WORM store, sign changes, and review regularly.
Block lists: DNS compliance rules by region (edge filtering + DNS routing).
14) Prod Readiness Checklist
- TTL profiles by class; negative TTL ≤ 300 s.
- Two independent DNS networks (primary/secondary), MFA/registrar lock.
- Policies: geo/latency/weight + health-checks from multiple regions.
- DNSSEC enabled, CAA/DMARC/DKIM/SPF up to date.
- Split-horizon (public/private), private zones for internal traffic.
- Flyer/return runbook, rehearsal script, canary domains.
- SLI/SLO monitoring, alerts on NXDOMAIN/SERVFAIL/RTT growth.
- Staging area and regular failover "drills."
- For iGaming: routing by jurisdiction, separate domains for PSP/KYC, unchangeable audit.
15) TL; DR
Build a combined policy: geo/latency + health-checks + weights, with TTL 30-120 s on speaker. Separate public/private (split-horizon), enable DNSSEC and CAA, keep secondary DNS. Make a rehearsal-feilover and observe SLI/SLO DNS. For iGaming, consider jurisdictions and PSP/KYC domain reservations with separate rules and logging of changes in WORM.