Load balancing

1) Why and where is it in architecture

The balancer is a "turnstile" between the client and the backend fleet. Its objectives are:

availability (without a single point of failure), latency (p95 down), scale (horizontal), security (TLS/WAF), release manageability (canary/blue-green).

Application layers:

Edge/Global: Anycast, GSLB/GeoDNS, CDN/Edge-LB, DDoS.
L4 (TCP/UDP): NLB, maglev, proxy without termination.
L7 (HTTP/2, gRPC, WebSocket, QUIC): path routing/headers/stamps, cache/compression/retrays.
Data-tier: DB-прокси (PgBouncer/ProxySQL), Redis Cluster/Consistent Hash, Kafka partitioning.

2) Balancing models and algorithms

Round-Robin (RR): simple uniform.
Least Connections (LC): Good for long connections (WS, gRPC).
Least Request/Power-of-Two (P2C): Comparing two random ones is a good speed/quality balance.
Weighted RR/LC: weights for canary/hot nodes.
Consistent Hashing (CH): session stickiness without a table (cart, Redis).
Maglev/Flow-hash: fast L3/L4 distribution with flapping resistance.
Latency-aware: Selection by sliding p50/p95.
EWMA: Takes into account the history of delays.

Recommendation: by default P2C (least-request) on L7; for stateful/caches - consistent hash; для WS/gRPC — least-connections.

3) Upstream health: checks and "evictions"

Health-checks: TCP, HTTP 200/匹配 тела, gRPC status; intervals/timeouts/error threshold.
Outlier Ejection: automatic exclusion of "noisy" instances (sequential-5xx, success-rate-ejection).
Slow-start & warmup: soft entry of new instances (gradual weight growth).
Connection draining: when off/reset - "topping up" of active connections without interruption.

4) Sessions and stickiness (stickiness)

Cookie-stickiness (L7): `Set-Cookie: lb=<id>; SameSite; Secure`.
CH by key: 'hash (userId' sessionId 'cartId)'.
IP-hash - only in closed networks (NAT breaks).
TTL stickiness + fallback in nodal eviction.
Important: minimize the need for stickiness → store the state outside the instance (Redis/DB/JWT).

5) Global Balancing (GTM/GSLB)

Anycast + health-probe: one IP, traffic to the nearest PoP; automatic feilover.
GeoDNS/Latency-DNS: Geo/Latency Response.
Regional clusters: "resident data" remains in the region (GDPR); interregional failover with replication.
Politicians: geo-blocks, "stickeregion" by account/token.

6) Protocols and features

HTTP/2: multiplex, priorities; need a competent connection-pool for upstream.
gRPC: long-lived streams → least-connections, aggressive health-checks.
WebSocket/SSE: stickiness on the connection, large idle timeouts, TCP keep-alive.
QUIC/HTTP/3: fast start, resistance to loss; monitor MTU/patch-MTU.
TLS-termination/mTLS: terminate on edge/L7-LB; oral mTLS/identity (SPIFFE).

7) Overload control

Rate-limit: per-IP, per-key, per-route; burst+sustain.
Adaptive Concurrency (Envoy) - dynamic limit of simultaneous requests.
Queue/Surge-buffer: limited queue size with fair refusal 503.
Hedging/Parallel racing: duplicating slow queries (idempotent only).
Timeout budget: separate connect/read/write.
Backpressure: '503 + Retry-After', jitter client exponential retreas.
Slow-loris protection: read/write timeouts, minimum speed.

8) Releases and Traffic Management

Canary (weighted): 1–5–10–25–50–100% с guardrails (p95, 5xx, timeouts).
Blue-Green: instant switch, rollback - DNS/LB.
Shadow/Mirror: copy of requests without affecting the response; PII masking.
Header/Claim-routing: `X-Canary: 1` или `JWT. claims. region/role`.

9) Autoscaling and drainage

HPA/ASG по CPU+RPS+p95+queue-depth.
PreStop hook: Wait for connections to complete.
Warm pool/instance reuse: shortening cold starts.
Capacity planning: target 'utilization of 60-70%' at p95 is normal.

10) Observability and SLO

LB metrics: RPS, p50/p95/p99, 4xx/5xx, open-connections, queue-len, ejections, retries, hit-ratio cache.
Tracing: 'traceparent/x-request-id' through LB → services → databases.
Logs: structural, PII/PAN masks, correlation with upstream.
Route SLO: for example, 'latency p95 ≤ 300 ms', 'availability ≥ 99. 9%`, `5xx ≤ 0. 5%`.
Alerts: by deviations (burn-rate SLO, ejection surge, 5xx/timeout growth).

11) Balancing data and caches

PostgreSQL/MySQL:

Read/Write split (ProxySQL/pgpool) + read-replicas; sticky-txn.
Failover: synchronous replica for RPO = 0 (more expensive).

Redis:

Redis Cluster + hash-slot; for sessions - CH; timeouts/Retryable errors.

Kafka/Redpanda:

Balance through partitioning and consumer-groups; not to be confused with HTTP-LB.
Object Storage (S3/MinIO): multi-region failover через GSLB/replication.

12) K8s and cloud LBs

Service (ClusterIP/NodePort/LoadBalancer) - base L4.
Ingress/Gateway API - L7 routing, canary weights, TLS.
AWS: NLB (L4, high bandwidth), ALB (L7, WAF, sticky, header-routing).
GCP: Global LB (L7/HTTP(S) с Anycast), TCP/UDP proxy LB.
Azure: Front Door (global), Application Gateway (L7), Load Balancer (L4).

13) Configuration examples

13. 1 NGINX (L7, least_conn, sticky, canary)

nginx upstream api_pool {
least_conn;
server api-1:8080 max_fails=3 fail_timeout=10s;
server api-2:8080 max_fails=3 fail_timeout=10s;
sticky cookie lb_id expires=30m path=/ secure httponly;
}

map $http_x_canary $dst {
default api_pool;
1    canary_pool;
}

upstream canary_pool {
least_conn;
server api-canary:8080 weight=1;
}

server {
listen 443 ssl http2;
location /api/ {
proxy_read_timeout 5s;
proxy_connect_timeout 1s;
proxy_set_header X-Request-Id $request_id;
proxy_pass http://$dst;
}
}

13. 2 HAProxy (P2C, health, slowstart, stick-table)

haproxy backend api balance leastconn option httpchk GET /health default-server inter 3s fall 3 rise 2 slowstart 10s server s1 10. 0. 0. 11:8080 check server s2 10. 0. 0. 12:8080 check stick-table type ip size 100k expire 30m http-request track-sc0 src rate limit per IP http-request deny deny_status 429 if { sc_http_req_rate(0) gt 50 }

13. 3 Envoy (P2C, outlier, retries, adaptive concurrency)

yaml load_assignment: {... }
lb_policy: LEAST_REQUEST least_request_lb_config: { choice_count: 2 }
outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s typed_extension_protocol_options:
envoy. extensions. filters. http. adaptive_concurrency. v3. AdaptiveConcurrency:
gradient_controller_config:
sample_aggregate_percentile: PERCENTILE_50 retry_policy:
retry_on: "5xx,reset,connect-failure"
num_retries: 2 per_try_timeout: 1s

13. 4 Kubernetes (Gateway API, weighted canary)

yaml apiVersion: gateway. networking. k8s. io/v1 kind: HTTPRoute spec:
rules:
- matches: [{ path: { type: PathPrefix, value: /api }}]
backendRefs:
- name: api-v1 weight: 90 port: 8080
- name: api-v2-canary weight: 10 port: 8080

14) Checklists

Before LB/route release

Algorithm selected (P2C/LC/CH) for traffic type.
Health-checks and ejection thresholds are configured.
Slow-start, warmup, connection-drain enabled.
TLS/mTLS, HSTS, secure ciphers; HTTP/2/3 if necessary.
Sticky/CH only if required; TTL и fallback.
Rate-limit/burst, timeouts, retry-budget, adaptive concurrency.
Logs/trails: 'trace-id' is thrown; PII masking.
SLO/alerts by p95/5xx/elections/queue-len.
Canary weights + rollback plan; shadow with large changes.

For payment/compliance routes

Idempotency-Key.
Failover between PSPs; same-method checks.
Error codes are normalized; ETA/reasons per customer.

For DB/Caches

RW-split/replicas; timeouts, network retry.
CH/slot-hash for Redis; protection against "hot keys."
Latency monitoring and replication-lag.

15) Quality metrics (minimum)

Latency p50/p95/p99 by route/method.
Error rate 4xx/5xx, timeout/overflow.
Open/active connections, queue depth, retry count.
Outlier ejections and causes.
Sticky hit-ratio / cache hit-ratio.
GSLB: regional distribution, faylovers, PoP availability.

16) Anti-patterns

One unprotected monolithic LB.
Sticky sessions "for everything," instead of taking out the state.
Global infinite queues (hide the problem, grow p99).
Retrai without jitter/budget is a "storm" of requests.
Trust'X-Forwarded-For'without a list of trusted proxies.
Lack of drain during depletion → WS/gRPC breaks.
Failure to take into account long-lived connections when autoscale.

17) iGaming specificity

Peaks and tournaments: micro-cache on directories/listings (1-5 s), auto-scale in turn.
Live games/streams: LC for long connections, priority of the nearest PoP.
Payments: geo/currency/amount/provider routing; strict timeouts and idempotency.
Responsible play and compliance: priority to skip requests for limits/locks even with degradation (fail-open/close by policy).

18) Implementation process (4 sprints)

1. Traffic map: protocols, p95/p99 loads, critical routes.
2. LB configuration: algorithms, health/outlier, TLS, limits/timeouts, observability.
3. GSLB/Edge: Anycast/GeoDNS, PoP support, regional data policies.
4. Release strategy: canary/shadow, SLO alerts, autoscale + drain, post-incident analysis.

Final cheat sheet

Choose an algorithm for the type of traffic (P2C/LC/CH) and the duration of the connection.
Keep your upstream "healthy": health-checks + outlier + slow-start + drain.
Manage peak load: rate-limit, adaptive concurrency, queues with failure.
Use GSLB/Anycast for global availability and compliance by region.
Observability and SLO are mandatory; releases - via canary/shadow with rollback plan.
Where possible, remove session from instances and stickiness from LB.

Load balancing

For payment/compliance routes

For DB/Caches

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects