Service Discovery и DNS
Service Discovery и DNS
1) Why do you need it
In distributed systems, nodes appear and disappear, and customers must find working instances of the service quickly and reliably. DNS - universal name layer; service discovery - a strategy for matching the service name with real endpoints, taking into account health, weight and routing policy.
Key objectives:- stable names instead of ephemeral addresses,
- accurate, but not noisy update (balance between freshness and TTL),
- degradation without complete fall (failover/health-check),
- minimum "guesswork" on the client: timeouts, retrays, cache policies.
2) Service discovery models
2. 1 Client-side
The client itself resolves the name to a set of endpoints and balances (round-robin, EWMA, hash by key). Source - DNS (A/AAAA/SRV), service registry (Consul/Eureka), static list.
Pros: fewer central SPOFs, flexible algorithms.
Cons: customer heterogeneity, it is more difficult to update logic.
2. 2 Server (server-side)
The client goes to front/LB (L4/L7 gateway/ingress). Balancing and health-checking - on the proxy/balancer side.
Pros: a single place of politics, observability.
Cons: Need a highly available perimeter (N + 1, multi-AZ).
2. 3 Hybrid
DNS gives a set of entry points (regional LBs), then L7/mesh balancing.
3) DNS: basics, records and TTL
3. 1 Basic types
A/AAAA - IPv4/IPv6 addresses.
CNAME - alias to another name (not to apex).
SRV — `_service._proto. name '→ host/port/weight/priority (for gRPC/LDAP/SIP, etc.).
TXT/HTTP/HTTPS - metadata/pointers (including for HTTP-discovery).
NS/SOA - zone delegation and attributes.
3. 2 TTL and Cache
The cache is available from: OS resolver, local stub resolver, nodes (NodeLocal DNS/CoreDNS), provider, intermediate recursers and library client. Actual freshness = min (TTL, client policy). Negative cache (NXDOMAIN) is also cached over'SOA. MINIMUM`/`TTL`.
Recommendations:- Prod - TTL 30-120s for dynamic records, 300-600s for stable.
- For switches (feilover), prepare a lowered TTL in advance, and not "during a fire."
- Consider the sticky cache of libraries (Java/Go/Node) - if necessary, configure the TTL of the resolver inside the runtime.
4) DNS balancing and fault tolerance policies
Weighted RR - weights on A/AAAA/SRV.
Failover - primary/secondary sets (health-check outside).
Geo/Latency - response to the "nearest" POP/region.
Anycast - one IP in different POP (BGP); resilient to regional disruptions.
Split-horizon - different answers inside VPC/on-prem and on the Internet.
GSLB is a global balancer with health checks and policies (latency, geo, capacity).
5) Health-checks and freshness
DNS itself is "dumb": it does not know the health of backends. Therefore:- Or an external health-checker manages records/weights (GSLB, Route53/Traffic-policy, external-dns + samples).
- Or the/mesh client makes an active outlier-ejection and retry from many endpoints.
6) Kubernetes: discovery out of the box
Service names: 'svc. namespace. svc. cluster. local`.
ClusterIP: stable virtual IP + kube-proxy/ebpf.
Headless Service ('clusterIP: None'): gives A-records to pods (or their subdomains), SRV for ports.
EndpointSlice: scalable list of endpoints (replacing Endpoints).
CoreDNS: cluster DNS resolver; plugins rewrite/template/forward/cache; 'kube-dns' zone.
NodeLocal DNSCache: local cache on the node → less latency and interception of upstream resolver problems.
Example: Headless + SRV
yaml apiVersion: v1 kind: Service metadata: { name: payments, namespace: prod }
spec:
clusterIP: None selector: { app: payments }
ports:
- name: grpc port: 50051 targetPort: 50051
The client can resolve '_ grpc. _ tcp. payments. prod. svc. cluster. local '(SRV) and get host/port/weights.
CoreDNS (ConfigMap fragment)
yaml apiVersion: v1 kind: ConfigMap metadata: { name: coredns, namespace: kube-system }
data:
Corefile:
.:53 {
errors health ready cache 30 loop forward. /etc/resolv. conf prometheus:9153 reload
}
NodeLocal DNS (ideas):
- DaemonSet with local resolver on '169. 254. 20. 10`; kubelet specifies this point.
- Reduces p99 name-resolution and protects against upstream DNS flap.
7) Service discovery вне K8s
Consul: agent, health-checks, service directory, DNS interface ('.consul'), KV for configs.
Eureka/ZooKeeper/etcd: registries for JVM/legacy; often in conjunction with a sidecar/gateway.
Envoy/Istio: EDS/xDS (Endpoint Discovery) and SDS (secrets); services are declared via the control-plane.
8) DNS security
DNSSEC: protect record integrity (zone signature). Critical for public domains.
DoT/DoH: channel to recursion encryption (internal policies, compatibility).
ACL and split-horizon: private zone - only from VPC/VPN.
Protection against cache poisoning: port/ID randomization, short TTLs for dynamics.
Policies on egress: allow DNS only on trusted resolvers, log.
9) Customer and retreat behavior
Respect TTL: do not cache endlessly, do not "lawless" with frequent resolutions (storm to recursive).
Happy Eyeballs (IPv4/IPv6), parallel connections to multiple A/AAAAs reduce tail.
Retrays only for idempotent requests; jitter, limiting budget retrays.
- Java: `networkaddress. cache. ttl`, `networkaddress. cache. negative. ttl`.
- Go: `GODEBUG=netdns=go`/`cgo`, `Resolver. PreferGo`, `DialTimeout`.
- Node: `dns. setDefaultResultOrder('ipv4first')`, `lookup` с `all:true`.
10) GSLB/DNS switching: practice
Lower the TTL from the 300→60 24-48 hours before the scheduled switchover.
Hold a canary set of low weight endpoints for validation.
Use weighted + health-check instead of a manual mass update of A-records.
For statics/edge - Anycast; for API - Geo/Latency + fast L7-feiler.
11) Observability and SLO for name
Metrics:- Rate/latency of DNS queries, cache hit-ratio, errors by type (SERVFAIL/NXDOMAIN).
- The percentage of requests with stale responses (if using stale-cache).
- Success of user operations on record changes (business SLI).
- p95/p99 resolve-time in applications.
- Stratify the path: client → local cache → nodal cache → cluster resolver → provider recursion.
- Track NXDOMAIN (Name/Typo Errors) and SERVFAIL (Recursion Issues/Resource Limits) spikes.
12) Configuration examples
CoreDNS: rewrite and stub zone
yaml
.:53 {
log errors cache 60 rewrite name suffix. svc. cluster. local. svc. cluster. local forward. 10. 0. 0. 2 10. 0. 0. 3
}
example. internal:53 {
file /zones/example. internal. signed dnssec
}
systemd-resolved
ini
[Resolve]
DNS=169. 254. 20. 10
FallbackDNS=1. 1. 1. 1 8. 8. 8. 8
Domains=~cluster. local ~internal
DNSSEC=yes
Envoy: dynamic DNS-refresh
yaml dns_refresh_rate: 5s dns_failure_refresh_rate:
base_interval: 2s max_interval: 30s respect_dns_ttl: true
external-dns (public zone support)
yaml args:
- --source=service
- --source=ingress
- --domain-filter=example. com
- --policy=upsert-only
- --txt-owner-id=cluster-prod
13) Implementation checklist (0-30 days)
0-7 days
Service name directory, model selection (client-/server-side/hybrid).
Basic TTL, enable NodeLocal DNSCache, DNS metrics dashboards.
Prohibition of "hard IP" in the config/code.
8-20 days
Headless services + SRV for gRPC; EndpointSlice is enabled.
GSLB/weighted for external; health-checks and canary.
Client timeouts/retrays and retray budget are configured.
21-30 days
Split-horizon and private areas; DoT/DoH by policy.
Switching test (by TTL) and feilover; post-analysis.
Mesh/EDS, outlier-ejection policies are enabled.
14) Anti-patterns
TTL = 0 in prod → storm to recurses, unpredictable delays.
IP/port hardcode, no CNAME/aliases for levels.
Changing records "manually" without health-checks and canaries.
One global resolver with no node cache (bottleneck).
Ignoring negative cache (NXDOMAIN spikes).
Attempts to "heal" a database failure via DNS instead of the data/feilover layer.
15) Maturity metrics
100% of services use names; zero hard-IP cases.
CoreDNS/NodeLocal in sales, cache hit-ratio> 90% on nodes.
GSLB with health-checks, documented TTL and runbook switches.
SRV/EndpointSlice for stateful/gRPC, p99 resolve-time in applications ≤ 20-30 ms.
Alerts for SERVFAIL/NXDOMAIN and cache hit-ratio degradation.
Checks in CI: ban ': latest' and hard-IP in charts/configs.
16) Conclusion
Service discovery is a stable name contract and cache discipline. Build a hybrid model: DNS gives quick and easy login, L7/mesh - health and smart policies. Maintain smart TTL, host cache, headless services and SRV where needed, use GSLB/Anycast for regional boundaries, keep an eye on NXDOMAIN/SERVFAIL and p99 resolve-time. Then your name will be as reliable an asset as the service itself.