Service Discovery и DNS

1) Why do you need it

In distributed systems, nodes appear and disappear, and customers must find working instances of the service quickly and reliably. DNS - universal name layer; service discovery - a strategy for matching the service name with real endpoints, taking into account health, weight and routing policy.

Key objectives:

stable names instead of ephemeral addresses,
accurate, but not noisy update (balance between freshness and TTL),
degradation without complete fall (failover/health-check),
minimum "guesswork" on the client: timeouts, retrays, cache policies.

2) Service discovery models

2. 1 Client-side

The client itself resolves the name to a set of endpoints and balances (round-robin, EWMA, hash by key). Source - DNS (A/AAAA/SRV), service registry (Consul/Eureka), static list.

Pros: fewer central SPOFs, flexible algorithms.
Cons: customer heterogeneity, it is more difficult to update logic.

2. 2 Server (server-side)

The client goes to front/LB (L4/L7 gateway/ingress). Balancing and health-checking - on the proxy/balancer side.

Pros: a single place of politics, observability.
Cons: Need a highly available perimeter (N + 1, multi-AZ).

2. 3 Hybrid

DNS gives a set of entry points (regional LBs), then L7/mesh balancing.

3) DNS: basics, records and TTL

3. 1 Basic types

A/AAAA - IPv4/IPv6 addresses.
CNAME - alias to another name (not to apex).
SRV — `_service._proto. name '→ host/port/weight/priority (for gRPC/LDAP/SIP, etc.).
TXT/HTTP/HTTPS - metadata/pointers (including for HTTP-discovery).
NS/SOA - zone delegation and attributes.

3. 2 TTL and Cache

The cache is available from: OS resolver, local stub resolver, nodes (NodeLocal DNS/CoreDNS), provider, intermediate recursers and library client. Actual freshness = min (TTL, client policy). Negative cache (NXDOMAIN) is also cached over'SOA. MINIMUM`/`TTL`.

Recommendations:

Prod - TTL 30-120s for dynamic records, 300-600s for stable.
For switches (feilover), prepare a lowered TTL in advance, and not "during a fire."
Consider the sticky cache of libraries (Java/Go/Node) - if necessary, configure the TTL of the resolver inside the runtime.

4) DNS balancing and fault tolerance policies

Weighted RR - weights on A/AAAA/SRV.
Failover - primary/secondary sets (health-check outside).
Geo/Latency - response to the "nearest" POP/region.
Anycast - one IP in different POP (BGP); resilient to regional disruptions.
Split-horizon - different answers inside VPC/on-prem and on the Internet.
GSLB is a global balancer with health checks and policies (latency, geo, capacity).

5) Health-checks and freshness

DNS itself is "dumb": it does not know the health of backends. Therefore:

Or an external health-checker manages records/weights (GSLB, Route53/Traffic-policy, external-dns + samples).
Or the/mesh client makes an active outlier-ejection and retry from many endpoints.

6) Kubernetes: discovery out of the box

Service names: 'svc. namespace. svc. cluster. local`.
ClusterIP: stable virtual IP + kube-proxy/ebpf.
Headless Service ('clusterIP: None'): gives A-records to pods (or their subdomains), SRV for ports.
EndpointSlice: scalable list of endpoints (replacing Endpoints).
CoreDNS: cluster DNS resolver; plugins rewrite/template/forward/cache; 'kube-dns' zone.
NodeLocal DNSCache: local cache on the node → less latency and interception of upstream resolver problems.

Example: Headless + SRV

yaml apiVersion: v1 kind: Service metadata: { name: payments, namespace: prod }
spec:
clusterIP: None selector: { app: payments }
ports:
- name: grpc port: 50051 targetPort: 50051

The client can resolve '_ grpc. _ tcp. payments. prod. svc. cluster. local '(SRV) and get host/port/weights.

CoreDNS (ConfigMap fragment)

yaml apiVersion: v1 kind: ConfigMap metadata: { name: coredns, namespace: kube-system }
data:
Corefile:
.:53 {
errors health ready cache 30 loop forward. /etc/resolv. conf prometheus:9153 reload
}

NodeLocal DNS (ideas):

DaemonSet with local resolver on '169. 254. 20. 10`; kubelet specifies this point.
Reduces p99 name-resolution and protects against upstream DNS flap.

7) Service discovery вне K8s

Consul: agent, health-checks, service directory, DNS interface ('.consul'), KV for configs.
Eureka/ZooKeeper/etcd: registries for JVM/legacy; often in conjunction with a sidecar/gateway.
Envoy/Istio: EDS/xDS (Endpoint Discovery) and SDS (secrets); services are declared via the control-plane.

8) DNS security

DNSSEC: protect record integrity (zone signature). Critical for public domains.
DoT/DoH: channel to recursion encryption (internal policies, compatibility).
ACL and split-horizon: private zone - only from VPC/VPN.
Protection against cache poisoning: port/ID randomization, short TTLs for dynamics.
Policies on egress: allow DNS only on trusted resolvers, log.

9) Customer and retreat behavior

Respect TTL: do not cache endlessly, do not "lawless" with frequent resolutions (storm to recursive).
Happy Eyeballs (IPv4/IPv6), parallel connections to multiple A/AAAAs reduce tail.
Retrays only for idempotent requests; jitter, limiting budget retrays.

Fine-tuning the runtime resolver:

Java: `networkaddress. cache. ttl`, `networkaddress. cache. negative. ttl`.
Go: `GODEBUG=netdns=go`/`cgo`, `Resolver. PreferGo`, `DialTimeout`.
Node: `dns. setDefaultResultOrder('ipv4first')`, `lookup` с `all:true`.

10) GSLB/DNS switching: practice

Lower the TTL from the 300→60 24-48 hours before the scheduled switchover.
Hold a canary set of low weight endpoints for validation.
Use weighted + health-check instead of a manual mass update of A-records.
For statics/edge - Anycast; for API - Geo/Latency + fast L7-feiler.

11) Observability and SLO for name

Metrics:

Rate/latency of DNS queries, cache hit-ratio, errors by type (SERVFAIL/NXDOMAIN).
The percentage of requests with stale responses (if using stale-cache).
Success of user operations on record changes (business SLI).
p95/p99 resolve-time in applications.

Diagnostics:

Stratify the path: client → local cache → nodal cache → cluster resolver → provider recursion.
Track NXDOMAIN (Name/Typo Errors) and SERVFAIL (Recursion Issues/Resource Limits) spikes.

12) Configuration examples

CoreDNS: rewrite and stub zone

yaml
.:53 {
log errors cache 60 rewrite name suffix. svc. cluster. local. svc. cluster. local forward. 10. 0. 0. 2 10. 0. 0. 3
}

example. internal:53 {
file /zones/example. internal. signed dnssec
}

systemd-resolved

ini
[Resolve]
DNS=169. 254. 20. 10
FallbackDNS=1. 1. 1. 1 8. 8. 8. 8
Domains=~cluster. local ~internal
DNSSEC=yes

Envoy: dynamic DNS-refresh

yaml dns_refresh_rate: 5s dns_failure_refresh_rate:
base_interval: 2s max_interval: 30s respect_dns_ttl: true

external-dns (public zone support)

yaml args:
- --source=service
- --source=ingress
- --domain-filter=example. com
- --policy=upsert-only
- --txt-owner-id=cluster-prod

13) Implementation checklist (0-30 days)

0-7 days

Service name directory, model selection (client-/server-side/hybrid).
Basic TTL, enable NodeLocal DNSCache, DNS metrics dashboards.
Prohibition of "hard IP" in the config/code.

8-20 days

Headless services + SRV for gRPC; EndpointSlice is enabled.
GSLB/weighted for external; health-checks and canary.
Client timeouts/retrays and retray budget are configured.

21-30 days

Split-horizon and private areas; DoT/DoH by policy.
Switching test (by TTL) and feilover; post-analysis.
Mesh/EDS, outlier-ejection policies are enabled.

14) Anti-patterns

TTL = 0 in prod → storm to recurses, unpredictable delays.
IP/port hardcode, no CNAME/aliases for levels.
Changing records "manually" without health-checks and canaries.
One global resolver with no node cache (bottleneck).
Ignoring negative cache (NXDOMAIN spikes).
Attempts to "heal" a database failure via DNS instead of the data/feilover layer.

15) Maturity metrics

100% of services use names; zero hard-IP cases.
CoreDNS/NodeLocal in sales, cache hit-ratio> 90% on nodes.
GSLB with health-checks, documented TTL and runbook switches.
SRV/EndpointSlice for stateful/gRPC, p99 resolve-time in applications ≤ 20-30 ms.
Alerts for SERVFAIL/NXDOMAIN and cache hit-ratio degradation.
Checks in CI: ban ': latest' and hard-IP in charts/configs.

16) Conclusion

Service discovery is a stable name contract and cache discipline. Build a hybrid model: DNS gives quick and easy login, L7/mesh - health and smart policies. Maintain smart TTL, host cache, headless services and SRV where needed, use GSLB/Anycast for regional boundaries, keep an eye on NXDOMAIN/SERVFAIL and p99 resolve-time. Then your name will be as reliable an asset as the service itself.

Service Discovery и DNS