Load balancing and failover

1) Objectives and terms

Balancing distributes traffic across instances/zones/regions for performance and resilience.
Failover - controlled failover.
RTO/RPO - Recovery Time Objective and Acceptable Data Loss.
SLO: target level of availability/latency; serves as a "gate" for automatic feilover and rollback.

2) Balancing layers

2. 1 L4 (TCP/UDP)

Pros: performance, simplicity, TLS passthrough. Cons: No route understanding/cookies.
Examples: NLB/GLB, HAProxy/Envoy L4, IPVS.

2. 2 L7 (HTTP/gRPC)

Pros: path routing/headers, canary weights, sticky. Cons: more expensive in CPU/latency.
Examples: NGINX/HAProxy/Envoy/Cloud ALB/API Gateway.

2. 3 Global

DNS/GSLB: health-checks + geo/weighted response.
Anycast/BGP: one IP worldwide, nearest announcement point.
CDN/Edge: Cache/Feilover on the perimeter.

3) Distribution algorithms

Round-robin/weighted - basic.
Least connections/latency - for "heavy" requests.
Consistent hashing - user/tenant stickiness without a center session.
Hash-based locality - for caches and stateful services.

4) Sessions and sticky

Cookie-sticky: L7 LB sets a cookie to return to the instance.
Src-IP sticky: on L4, worse with NAT/CGNAT.
Consistent hashing: better for horizontal caches/chats.
Aim: if possible, do the stateless service, otherwise - take out the state (sessions in Redis/DB) to simplify failover.

5) Reliability: health-checks and removal from rotation

Active checks: HTTP 200/deep business path probes (e.g. '/healthz/withdraw 'with dependencies).
Passive (outlier detection): backend ejection at 5xx/timeouts.
Warm-up: smooth inclusion of new instances (slow-start).
Graceful drain - Remove from pool → wait for requests to complete.

NGINX (example):

nginx upstream api {
zone api 64k;
least_conn;
server app-1:8080 max_fails=2 fail_timeout=10s;
server app-2:8080 max_fails=2 fail_timeout=10s;
keepalive 512;
}
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;

Envoy outlier detection (fragment):

yaml outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50

6) Fault management: timeout/retry/circuit-breaking

Timeouts: shorter than client timeout; specify per-route.
Retries: 1-2 with jitter and idempotency; prohibition of retrays on POST without idempotence keys.
Circuit breaker: limiting simultaneous requests/errors; "semi-open" recovery.
Budgets: Retray limits/merger of bursts so as not to arrange self-DDOS.

7) Kubernetes-patterns

ClusterIP/NodePort/LoadBalancer/Ingress - basic primitives.
Readiness/Liveness: traffic only on ready-made boards.
PodDisruptionBudget cannot drop N replicas at the same time.
HPA/VPA: scaling by CPU/RED metrics, link to LB.
ServiceTopology/Topology Aware Hints: locality by zone.
Service type = LoadBalancer (zonal): at least 2 replicas in each AZ.

An example of readiness for a critical route:

yaml readinessProbe:
httpGet: { path: /healthz/dependencies, port: 8080 }
periodSeconds: 5 failureThreshold: 2

8) Cross-zone and cross-regional traffic

Multi-AZ (within region): distribute evenly (zonal LB), storage - synchronous replicas.

Multi-region:

Active-Active: both regions serve traffic; more complex - you need data replication, consistency, and geographic routing.
Active-Passive: main region serves, reserve - "hot/warm/cold." Easier, faster switching, but higher RPO.

GSLB strategies:

Geo-DNS (nearest region).
Weighted DNS (canaries/redistribution).
Latency-based (RTT measurements).
Failover = by health/availability signals (probes from multiple vantage points).

9) Data and failover

Cache/state: if possible - regionally local; for Active-Active - CRDT/consistent hashes.

DB:

Synchronous replication = low RPO, higher latency.
Asynchronous = lower latency, but RPO> 0.
Queues: mirroring/multicluster topicals; event deduplication.
Design idempotency of operations and replay mechanics.

10) Perimeter: DNS/Anycast/BGP/CDN

DNS: short TTL (30-60s) + health-checks out of your network.
Anycast: several POPs with one IP - the nearest one receives traffic, the feilover is at the routing level.
CDN/Edge: cache and "gateway" for protection, static/media are serviced when origin falls; origin-shield + пер-POP health.

11) Sample configurations

HAProxy L7:

haproxy defaults timeout connect 2s timeout client 15s timeout server 15s retries 2 option redispatch

backend api balance leastconn option httpchk GET /healthz/dependencies http-check expect status 200 server app1 app-1:8080 check inter 5s fall 2 rise 2 slowstart 3000 server app2 app-2:8080 check inter 5s fall 2 rise 2 slowstart 3000

NGINX sticky по cookie:

nginx upstream api {
hash $cookie_session_id consistent;
server app-1:8080;
server app-2:8080;
}

Envoy retry/timeout (route):

yaml route:
timeout: 2s retry_policy:
retry_on: 5xx,connect-failure,reset num_retries: 1 per_try_timeout: 500ms

12) Automatic failover: signals and gates

Tech-SLI: 5xx-rate, p95/p99, saturation, TLS handshakes, TCP resets.
Business SLI: success of deposits/disbursements, no payment errors at PSP.
Gates: if the thresholds are exceeded, turn off the zone/instance, raise the weights of the stable pool, switch GSLB.
Runbook: step-by-step rollback instruction.

13) Tests and inspections (chaos & game-days)

Chaos tests: disabling AZ/regions, DB/cache degradation, packet-loss simulation.
Game-day: Training Faylover featuring on-call teams.
Diagnostics: tracing from perimeter to backends, matching release annotations and metrics.

14) Safety and compliance

mTLS between LB↔servisy, WAF/Rate limits on perimeter.
Failure/segmentation zones: blast-radius isolation.

Policies: single point of failure (SPOF) prohibition, requirements for "minimum N replicas/AZ."

15) Anti-patterns

One LB/one zone for all traffic (SPOF).
No deep check '/healthz '(green - but DB/queue unavailable).
Retray without idempotency → double transactions/payments.
Sticky per IP with mass NAT → imbalance.
DNS feilover with high TTL (hours before switching).
No graceful drain when depleted - request break.

16) Implementation checklist (0-45 days)

0-10 days

Post instances to the AZ ≥2; enable readiness/liveness, health-checks.
Configure L7-timeouts/retries (1 attempt), outlier detection.
Enable graceful drain and slow-start.

11-25 days

Enter GSLB (geo/weighted) or Anycast for the perimeter.
Canary weights/route policies; sticky via cookie/consistent hash.
SLO gates for auto-feilover (p95/5xx + business SLI).

26-45 days

Regional DR: Active-Active or Active-Passive with translation test.
Chaos days with AZ/regions off, RTO/RPO reports.
Automated runbook 'and (pause/shift/rollback scripts).

17) Maturity metrics

Multi-AZ coverage ≥ 99% of critical paths.
DNS/GSLB/Anycast are implemented for public endpoints.
MTTR when one AZ falls <5 minutes (p95).
RPO for critical data ≤ target (for example, ≤ 30 seconds).
Quarterly game-days and successful scheduled feilover.

18) Conclusion

Reliable balancing and failover is a layered architecture: local L7-policies (timeouts/retries/CB, health-checks), correct stickiness and hashing, cross-zone stability, and on the perimeter - GSLB/DNS/Anycast. Add SLO gates, idempotency, graceful drain and regular chaos tests - and any loss of a node, zone or even region will become a manageable event with predictable RTO/RPO.

Load balancing and failover