DNS Management and Routing
Brief summary
DNS is a "name-level router." It depends on competent TTL, zones and policies how quickly and predictably users get to the desired fronts/gateways. Minimum set: Anycast provider, healthy TTL, health-checks with automatic failover, DNSSEC + CAA, IaC management and observability (SLO by response and resolution time).
Basic architecture
Authoritative servers (zones) - are responsible for the company's domains.
Recursive resolvers (clients/ISPs/own) - ask the root → TLDs → authoritative.
Anycast is the same IP addressing on many PoPs: the near PoP responds faster and survives accidents.
Zones and delegation
The root zone of the domain → 'NS'to providers of authoritative servers.
Subdomains (e.g. 'api. example. com ') can be delegated to individual' NS '/providers for independence.
Record types (minimum)
'A '/'AAAA '- IPv4/IPv6 addresses.
'CNAME '- alias for the name; do not use at the root of the zone (instead ALIAS/ANAME at providers).
'TXT '- verification, SPF, custom labels.
'MX '- mail (if used).
'SRV '- services (SIP, LDAP, etc.).
'CAA '- who can issue certificates for the domain.
'NS '/'SOA '- delegation/zone parameters.
'DS '- DNSSEC keys to parent TLD.
Sample zone (fragment)
$TTL 300
@ IN SOA ns1.dns.example. noc.example. (2025110501 3600 600 604800 300)
IN NS ns1.dns.example.
IN NS ns2.dns.example.
@ IN A 203.0.113.10
@ IN AAAA 2001:db8::10 api IN CNAME api-prod.global.example.
_www IN CNAME cdn.example.net.
_caa IN CAA 0 issue "letsencrypt.org"
TTL and caching
Short TTL (30-300 s) - for dynamics (API fronts, failover).
Medium TTL (300-3600 s) - for CDN/statics.
Long TTL (≥ 1 day) - for rare changes (MX/NS/DS).
When planning migrations, reduce TTL 24-72 hours in advance.
Consider Negative Caching TTL (NXDOMAIN): managed by 'SOA MINIMUM'
Routing Policies (GSLB layer)
Failover (active/passive) - we give the main IP to the fail health-check, then the reserve.
Weighted (traffic-split) - traffic distribution (for example, canary 5/95).
Latency-based is the closest PoR/region by network delay.
Geo-routing - by country/continent; useful for local/PCI/PII laws.
Multivalue - several 'A/AAAA' with health checks of each.
Councils
For critical APIs, connect latency-based + health-checks + short TTL.
For smooth releases - weighted and gradual share growth.
For regional restrictions - geo and lists of allowed providers.
Health and automatic switching
Health-checks: HTTP (S) (200 OK, body/header), TCP (port), ICMP.
Reputation/fingerprint: check not only the port, but also the correctness of the backend'a (version, build-id).
Sensitivity threshold: 'N' successful/unsuccessful checks in a row to avoid flapping.
Taking metrics: share of healthy-endpoints, reaction time, number of switches.
Private areas and split-horizon
Private DNS: internal zones in VPC/VNet/On-prem (e.g. 'svc. local. example`).
Split-horizon: different responses for internal and external clients (internal IP vs public).
Leak protection: do not use "internal" names outside; check that private areas do not resolve through public providers.
DNS security
DNSSEC: zone signatures (ZSK/KSK), publishing'DS 'in parent zone, key rollover.
CAA: Limit the release of TLS serts to trusted CAs.
DoT/DoH for recursors - encrypting client requests.
ACL/Rate-limit on authoritative: protection against reflective DDoS/ANY requests.
Subdomain Takeover: regularly scan "hanging" CNAME/ALIAS for remote services (resource deleted - CNAME remains).
NS/Glue records: consistency between registrar and DNS provider.
SLO and observability
SLO (examples)
Availability of authoritative answers: ≥ 99. 99 %/30 days.
Recursion response time (p95): ≤ 50 ms local/ ≤ 150 ms global.
Success health-checks: ≥ 99. 9%, false positives - ≤ 0. 1%.
Propagation time: ≤ 5 min at TTL 60 s.
Metrics
RCODE (NOERROR/NXDOMAIN/SERVFAIL), QPS, p50/p95 response time.
Fractions IPv6/IPv4, EDNS size, Truncated (TC) responses.
Number of health-check switches, flapping, DNSSEC signature errors.
Shares of DoH/DoT queries (if you control the recursion).
Logs
Queries (qname, qtype, rcode, client ASN/geo), anomalies (ANY storms, frequent NXDOMAIN by one prefix).
IaC and Automation
Terraform/DNS providers: keep zones in the repository, PR review, plan/app.
ExternalDNS (K8s): automatic creation/deletion of records from Ingress/Service.
Intermediate environments: 'dev. '/' stg.' prefixes and individual DNS provider accounts.
Terraform (simplified example)
hcl resource "dns_a_record_set" "api" {
zone = "example.com."
name = "api"
addresses = ["203.0.113.10","203.0.113.20"]
ttl = 60
}
resource "dns_caa_record" "caa" {
zone = "example.com."
name = "@"
ttl = 3600 record {
flags = 0 tag = "issue"
value = "letsencrypt.org"
}
}
Resolvers, Cache, and Performance
Unbound/Knot/Bind is closer to applications → less than p95.
Turn on prefetch hot records, serve-stale when authority is unavailable.
EDNS (0) and correct buffer size, DNS Cookies, minimal-responses.
Separate resolution flows and application traffic (QoS).
Consider Negative TTL: A lot of NXDOMAIN from a broken client can clog the cache.
DDoS and resilience
Anycast provider with global PoP and bot traffic aggregation.
Response Rate Limiting (RRL) on authoritative, protection against amplification.
'ANY 'prohibition, EDNS buffer restriction, filters on "heavy" types.
Zone segmentation: critical - at the provider with the best DDoS shield; less critical - separately.
Backup provider (secondaries) with 'AXFR/IXFR' and automatic fylover NS at the registrar level.
Operations and Processes
Changes: PR-review, canary-records, warm-up caches (low TTL → deploy → return TTL).
Rollover DNSSEC: regulation, windows, validity monitoring (RFC 8901 KSK/ZSK).
Runbook: PoP drop, incorrect NS delegation, fallen off health-check, massive SERVFAIL.
DR plan: alternative DNS provider, ready-made zone templates, access to the registrar, SLA to replace NS.
Implementation checklist
- Two independent authoritative providers/RoP (Anycast), correct 'NS' at the registrar.
- TTL strategy: short for dynamics, long for stable records; negative TTL under control.
- Health-checks and policies: failover/weighted/latency/geo by service profile.
- DNSSEC (KSK/ZSK/DS), 'CAA' restricts the release of serts.
- IaC for zones, ExternalDNS for K8s, separate environments/accounts.
- Monitoring: rcode/QPS/latency/propagation, alerts by SERVFAIL/signatures.
- DDoS: Anycast, RRL, EDNS restrictions, list block/ACL.
- Regulations for domain migrations and TTL downgrades in 48-72 hours.
- Regular audit of "hanging" CNAME/ALIAS, MX/SPF/DKIM/DMARC (if mail is used).
Common mistakes
Too much TTL on critical'A/AAAA '- long migrations/fylovers.
One DNS provider/one PoP is SPOF.
Absence of DNSSEC/CAA - risk of substitution/uncontrolled serts.
Inconsistent split-horizon → internal names to leak out.
No health-checks on GSLB - hand switching and delays.
Forgotten CNAMEs on external services → the risk of takeover.
Absence of IaC → "snowflake" configs and errors during manual edits.
Specificity for iGaming/fintech
Regional versions and PSP: geo/latency-routing, IP/ASN partner whitelists, fast failover gateways.
Picks (matches/tournaments): short TTL, warm up CDN, separate names for events ('event-N. example. com ') with managed policy.
Legal correctness: record the time and version of zones during critical changes (audit log).
Antifraud/BOT protection: separate names for tiebreakers/captcha/check endpoints; fast withdrawal to the "black hole" (sinkhole) in attacks.
Mini playbooks
Canary release of the front (weighted):1. `api-canary. example. com '→ 5% of traffic; 2) monitor p95/p99/errors; 3) increase to 25/50/100%; 4) roll up during degradation.
Emergency failover:1. TTL 60 s; 2) health-check marked region down → GSLB removed from responses; 3) check of external resolvers; 4) status communication.
DNS provider migration:1. Import a zone into a new provider; 2) Turn on the synchronous secondary for the old one; 3) Change the'NS'of the recorder to a "quiet" window; 4) Observe SERVFAIL/val errors.
Result
A reliable DNS loop is Anycast authority + reasonable TTL + health/latency routing + DNSSEC/CAA + IaC and observability. Record the processes of migrations and rollovers, keep a backup provider, regularly check the zone for "hanging" records - and your users will stably get to the desired fronts even in the hottest hour.