Load balancing and failover
Load balancing and failover
1) Objectives and terms
Balancing distributes traffic across instances/zones/regions for performance and resilience.
Failover - controlled failover.
RTO/RPO - Recovery Time Objective and Acceptable Data Loss.
SLO: target level of availability/latency; serves as a "gate" for automatic feilover and rollback.
2) Balancing layers
2. 1 L4 (TCP/UDP)
Pros: performance, simplicity, TLS passthrough. Cons: No route understanding/cookies.
Examples: NLB/GLB, HAProxy/Envoy L4, IPVS.
2. 2 L7 (HTTP/gRPC)
Pros: path routing/headers, canary weights, sticky. Cons: more expensive in CPU/latency.
Examples: NGINX/HAProxy/Envoy/Cloud ALB/API Gateway.
2. 3 Global
DNS/GSLB: health-checks + geo/weighted response.
Anycast/BGP: one IP worldwide, nearest announcement point.
CDN/Edge: Cache/Feilover on the perimeter.
3) Distribution algorithms
Round-robin/weighted - basic.
Least connections/latency - for "heavy" requests.
Consistent hashing - user/tenant stickiness without a center session.
Hash-based locality - for caches and stateful services.
4) Sessions and sticky
Cookie-sticky: L7 LB sets a cookie to return to the instance.
Src-IP sticky: on L4, worse with NAT/CGNAT.
Consistent hashing: better for horizontal caches/chats.
Aim: if possible, do the stateless service, otherwise - take out the state (sessions in Redis/DB) to simplify failover.
5) Reliability: health-checks and removal from rotation
Active checks: HTTP 200/deep business path probes (e.g. '/healthz/withdraw 'with dependencies).
Passive (outlier detection): backend ejection at 5xx/timeouts.
Warm-up: smooth inclusion of new instances (slow-start).
Graceful drain - Remove from pool → wait for requests to complete.
nginx upstream api {
zone api 64k;
least_conn;
server app-1:8080 max_fails=2 fail_timeout=10s;
server app-2:8080 max_fails=2 fail_timeout=10s;
keepalive 512;
}
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
Envoy outlier detection (fragment):
yaml outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50
6) Fault management: timeout/retry/circuit-breaking
Timeouts: shorter than client timeout; specify per-route.
Retries: 1-2 with jitter and idempotency; prohibition of retrays on POST without idempotence keys.
Circuit breaker: limiting simultaneous requests/errors; "semi-open" recovery.
Budgets: Retray limits/merger of bursts so as not to arrange self-DDOS.
7) Kubernetes-patterns
ClusterIP/NodePort/LoadBalancer/Ingress - basic primitives.
Readiness/Liveness: traffic only on ready-made boards.
PodDisruptionBudget cannot drop N replicas at the same time.
HPA/VPA: scaling by CPU/RED metrics, link to LB.
ServiceTopology/Topology Aware Hints: locality by zone.
Service type = LoadBalancer (zonal): at least 2 replicas in each AZ.
yaml readinessProbe:
httpGet: { path: /healthz/dependencies, port: 8080 }
periodSeconds: 5 failureThreshold: 2
8) Cross-zone and cross-regional traffic
Multi-AZ (within region): distribute evenly (zonal LB), storage - synchronous replicas.
Multi-region:- Active-Active: both regions serve traffic; more complex - you need data replication, consistency, and geographic routing.
- Active-Passive: main region serves, reserve - "hot/warm/cold." Easier, faster switching, but higher RPO.
- Geo-DNS (nearest region).
- Weighted DNS (canaries/redistribution).
- Latency-based (RTT measurements).
- Failover = by health/availability signals (probes from multiple vantage points).
9) Data and failover
Cache/state: if possible - regionally local; for Active-Active - CRDT/consistent hashes.
DB:- Synchronous replication = low RPO, higher latency.
- Asynchronous = lower latency, but RPO> 0.
- Queues: mirroring/multicluster topicals; event deduplication.
- Design idempotency of operations and replay mechanics.
10) Perimeter: DNS/Anycast/BGP/CDN
DNS: short TTL (30-60s) + health-checks out of your network.
Anycast: several POPs with one IP - the nearest one receives traffic, the feilover is at the routing level.
CDN/Edge: cache and "gateway" for protection, static/media are serviced when origin falls; origin-shield + пер-POP health.
11) Sample configurations
HAProxy L7:haproxy defaults timeout connect 2s timeout client 15s timeout server 15s retries 2 option redispatch
backend api balance leastconn option httpchk GET /healthz/dependencies http-check expect status 200 server app1 app-1:8080 check inter 5s fall 2 rise 2 slowstart 3000 server app2 app-2:8080 check inter 5s fall 2 rise 2 slowstart 3000
NGINX sticky по cookie:
nginx upstream api {
hash $cookie_session_id consistent;
server app-1:8080;
server app-2:8080;
}
Envoy retry/timeout (route):
yaml route:
timeout: 2s retry_policy:
retry_on: 5xx,connect-failure,reset num_retries: 1 per_try_timeout: 500ms
12) Automatic failover: signals and gates
Tech-SLI: 5xx-rate, p95/p99, saturation, TLS handshakes, TCP resets.
Business SLI: success of deposits/disbursements, no payment errors at PSP.
Gates: if the thresholds are exceeded, turn off the zone/instance, raise the weights of the stable pool, switch GSLB.
Runbook: step-by-step rollback instruction.
13) Tests and inspections (chaos & game-days)
Chaos tests: disabling AZ/regions, DB/cache degradation, packet-loss simulation.
Game-day: Training Faylover featuring on-call teams.
Diagnostics: tracing from perimeter to backends, matching release annotations and metrics.
14) Safety and compliance
mTLS between LB↔servisy, WAF/Rate limits on perimeter.
Failure/segmentation zones: blast-radius isolation.
Policies: single point of failure (SPOF) prohibition, requirements for "minimum N replicas/AZ."
15) Anti-patterns
One LB/one zone for all traffic (SPOF).
No deep check '/healthz '(green - but DB/queue unavailable).
Retray without idempotency → double transactions/payments.
Sticky per IP with mass NAT → imbalance.
DNS feilover with high TTL (hours before switching).
No graceful drain when depleted - request break.
16) Implementation checklist (0-45 days)
0-10 days
Post instances to the AZ ≥2; enable readiness/liveness, health-checks.
Configure L7-timeouts/retries (1 attempt), outlier detection.
Enable graceful drain and slow-start.
11-25 days
Enter GSLB (geo/weighted) or Anycast for the perimeter.
Canary weights/route policies; sticky via cookie/consistent hash.
SLO gates for auto-feilover (p95/5xx + business SLI).
26-45 days
Regional DR: Active-Active or Active-Passive with translation test.
Chaos days with AZ/regions off, RTO/RPO reports.
Automated runbook 'and (pause/shift/rollback scripts).
17) Maturity metrics
Multi-AZ coverage ≥ 99% of critical paths.
DNS/GSLB/Anycast are implemented for public endpoints.
MTTR when one AZ falls <5 minutes (p95).
RPO for critical data ≤ target (for example, ≤ 30 seconds).
Quarterly game-days and successful scheduled feilover.
18) Conclusion
Reliable balancing and failover is a layered architecture: local L7-policies (timeouts/retries/CB, health-checks), correct stickiness and hashing, cross-zone stability, and on the perimeter - GSLB/DNS/Anycast. Add SLO gates, idempotency, graceful drain and regular chaos tests - and any loss of a node, zone or even region will become a manageable event with predictable RTO/RPO.