GH GambleHub

Service Discovery и DNS

Service Discovery и DNS

1) Why do you need it

In distributed systems, nodes appear and disappear, and customers must find working instances of the service quickly and reliably. DNS - universal name layer; service discovery - a strategy for matching the service name with real endpoints, taking into account health, weight and routing policy.

Key objectives:
  • stable names instead of ephemeral addresses,
  • accurate, but not noisy update (balance between freshness and TTL),
  • degradation without complete fall (failover/health-check),
  • minimum "guesswork" on the client: timeouts, retrays, cache policies.

2) Service discovery models

2. 1 Client-side

The client itself resolves the name to a set of endpoints and balances (round-robin, EWMA, hash by key). Source - DNS (A/AAAA/SRV), service registry (Consul/Eureka), static list.

Pros: fewer central SPOFs, flexible algorithms.
Cons: customer heterogeneity, it is more difficult to update logic.

2. 2 Server (server-side)

The client goes to front/LB (L4/L7 gateway/ingress). Balancing and health-checking - on the proxy/balancer side.

Pros: a single place of politics, observability.
Cons: Need a highly available perimeter (N + 1, multi-AZ).

2. 3 Hybrid

DNS gives a set of entry points (regional LBs), then L7/mesh balancing.

3) DNS: basics, records and TTL

3. 1 Basic types

A/AAAA - IPv4/IPv6 addresses.
CNAME - alias to another name (not to apex).
SRV — `_service._proto. name '→ host/port/weight/priority (for gRPC/LDAP/SIP, etc.).
TXT/HTTP/HTTPS - metadata/pointers (including for HTTP-discovery).
NS/SOA - zone delegation and attributes.

3. 2 TTL and Cache

The cache is available from: OS resolver, local stub resolver, nodes (NodeLocal DNS/CoreDNS), provider, intermediate recursers and library client. Actual freshness = min (TTL, client policy). Negative cache (NXDOMAIN) is also cached over'SOA. MINIMUM`/`TTL`.

Recommendations:
  • Prod - TTL 30-120s for dynamic records, 300-600s for stable.
  • For switches (feilover), prepare a lowered TTL in advance, and not "during a fire."
  • Consider the sticky cache of libraries (Java/Go/Node) - if necessary, configure the TTL of the resolver inside the runtime.

4) DNS balancing and fault tolerance policies

Weighted RR - weights on A/AAAA/SRV.
Failover - primary/secondary sets (health-check outside).
Geo/Latency - response to the "nearest" POP/region.
Anycast - one IP in different POP (BGP); resilient to regional disruptions.
Split-horizon - different answers inside VPC/on-prem and on the Internet.
GSLB is a global balancer with health checks and policies (latency, geo, capacity).

5) Health-checks and freshness

DNS itself is "dumb": it does not know the health of backends. Therefore:
  • Or an external health-checker manages records/weights (GSLB, Route53/Traffic-policy, external-dns + samples).
  • Or the/mesh client makes an active outlier-ejection and retry from many endpoints.

6) Kubernetes: discovery out of the box

Service names: 'svc. namespace. svc. cluster. local`.
ClusterIP: stable virtual IP + kube-proxy/ebpf.
Headless Service ('clusterIP: None'): gives A-records to pods (or their subdomains), SRV for ports.
EndpointSlice: scalable list of endpoints (replacing Endpoints).
CoreDNS: cluster DNS resolver; plugins rewrite/template/forward/cache; 'kube-dns' zone.
NodeLocal DNSCache: local cache on the node → less latency and interception of upstream resolver problems.

Example: Headless + SRV

yaml apiVersion: v1 kind: Service metadata: { name: payments, namespace: prod }
spec:
clusterIP: None selector: { app: payments }
ports:
- name: grpc port: 50051 targetPort: 50051

The client can resolve '_ grpc. _ tcp. payments. prod. svc. cluster. local '(SRV) and get host/port/weights.

CoreDNS (ConfigMap fragment)

yaml apiVersion: v1 kind: ConfigMap metadata: { name: coredns, namespace: kube-system }
data:
Corefile:
.:53 {
errors health ready cache 30 loop forward. /etc/resolv. conf prometheus:9153 reload
}
NodeLocal DNS (ideas):
  • DaemonSet with local resolver on '169. 254. 20. 10`; kubelet specifies this point.
  • Reduces p99 name-resolution and protects against upstream DNS flap.

7) Service discovery вне K8s

Consul: agent, health-checks, service directory, DNS interface ('.consul'), KV for configs.
Eureka/ZooKeeper/etcd: registries for JVM/legacy; often in conjunction with a sidecar/gateway.
Envoy/Istio: EDS/xDS (Endpoint Discovery) and SDS (secrets); services are declared via the control-plane.

8) DNS security

DNSSEC: protect record integrity (zone signature). Critical for public domains.
DoT/DoH: channel to recursion encryption (internal policies, compatibility).
ACL and split-horizon: private zone - only from VPC/VPN.
Protection against cache poisoning: port/ID randomization, short TTLs for dynamics.
Policies on egress: allow DNS only on trusted resolvers, log.

9) Customer and retreat behavior

Respect TTL: do not cache endlessly, do not "lawless" with frequent resolutions (storm to recursive).
Happy Eyeballs (IPv4/IPv6), parallel connections to multiple A/AAAAs reduce tail.
Retrays only for idempotent requests; jitter, limiting budget retrays.

Fine-tuning the runtime resolver:
  • Java: `networkaddress. cache. ttl`, `networkaddress. cache. negative. ttl`.
  • Go: `GODEBUG=netdns=go`/`cgo`, `Resolver. PreferGo`, `DialTimeout`.
  • Node: `dns. setDefaultResultOrder('ipv4first')`, `lookup` с `all:true`.

10) GSLB/DNS switching: practice

Lower the TTL from the 300→60 24-48 hours before the scheduled switchover.
Hold a canary set of low weight endpoints for validation.
Use weighted + health-check instead of a manual mass update of A-records.
For statics/edge - Anycast; for API - Geo/Latency + fast L7-feiler.

11) Observability and SLO for name

Metrics:
  • Rate/latency of DNS queries, cache hit-ratio, errors by type (SERVFAIL/NXDOMAIN).
  • The percentage of requests with stale responses (if using stale-cache).
  • Success of user operations on record changes (business SLI).
  • p95/p99 resolve-time in applications.
Diagnostics:
  • Stratify the path: client → local cache → nodal cache → cluster resolver → provider recursion.
  • Track NXDOMAIN (Name/Typo Errors) and SERVFAIL (Recursion Issues/Resource Limits) spikes.

12) Configuration examples

CoreDNS: rewrite and stub zone

yaml
.:53 {
log errors cache 60 rewrite name suffix. svc. cluster. local. svc. cluster. local forward. 10. 0. 0. 2 10. 0. 0. 3
}

example. internal:53 {
file /zones/example. internal. signed dnssec
}

systemd-resolved

ini
[Resolve]
DNS=169. 254. 20. 10
FallbackDNS=1. 1. 1. 1 8. 8. 8. 8
Domains=~cluster. local ~internal
DNSSEC=yes

Envoy: dynamic DNS-refresh

yaml dns_refresh_rate: 5s dns_failure_refresh_rate:
base_interval: 2s max_interval: 30s respect_dns_ttl: true

external-dns (public zone support)

yaml args:
- --source=service
- --source=ingress
- --domain-filter=example. com
- --policy=upsert-only
- --txt-owner-id=cluster-prod

13) Implementation checklist (0-30 days)

0-7 days

Service name directory, model selection (client-/server-side/hybrid).
Basic TTL, enable NodeLocal DNSCache, DNS metrics dashboards.
Prohibition of "hard IP" in the config/code.

8-20 days

Headless services + SRV for gRPC; EndpointSlice is enabled.
GSLB/weighted for external; health-checks and canary.
Client timeouts/retrays and retray budget are configured.

21-30 days

Split-horizon and private areas; DoT/DoH by policy.
Switching test (by TTL) and feilover; post-analysis.
Mesh/EDS, outlier-ejection policies are enabled.

14) Anti-patterns

TTL = 0 in prod → storm to recurses, unpredictable delays.
IP/port hardcode, no CNAME/aliases for levels.
Changing records "manually" without health-checks and canaries.
One global resolver with no node cache (bottleneck).
Ignoring negative cache (NXDOMAIN spikes).
Attempts to "heal" a database failure via DNS instead of the data/feilover layer.

15) Maturity metrics

100% of services use names; zero hard-IP cases.
CoreDNS/NodeLocal in sales, cache hit-ratio> 90% on nodes.
GSLB with health-checks, documented TTL and runbook switches.
SRV/EndpointSlice for stateful/gRPC, p99 resolve-time in applications ≤ 20-30 ms.
Alerts for SERVFAIL/NXDOMAIN and cache hit-ratio degradation.
Checks in CI: ban ': latest' and hard-IP in charts/configs.

16) Conclusion

Service discovery is a stable name contract and cache discipline. Build a hybrid model: DNS gives quick and easy login, L7/mesh - health and smart policies. Maintain smart TTL, host cache, headless services and SRV where needed, use GSLB/Anycast for regional boundaries, keep an eye on NXDOMAIN/SERVFAIL and p99 resolve-time. Then your name will be as reliable an asset as the service itself.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.