Low latency architecture
Why Low Latency Architecture
Low delay is not only a "fast average," but stable tails (p95/p99) under real load. The path to this is delay budget, queue/retray discipline, data and cache proximity, correct protocols/connections, and strict exploitation (limits, observability, degradation).
Delay goals and budget
1. Define SLO: "p95 ≤ 120ms, p99 ≤ 250ms, error ≤ 0. 3%».
2. Collect the budget: client → edge → region → services → stor → answer.
- Client-edge: 15 ms
- Edge-region: 15ms
- Gateway/L7: 10 ms
- Business service: 40 ms
- Storage/Cache: 25ms
- Stock/Jitter: 15ms
Metrics and tails
Measure p50/p90/p95/p99, through and on every hop.
Break down by labels: region, method, client version, network type (mobile/broadband), payload size.
Distinguish between queue time and execution time (see Little's Law: L = λ· W).
Tail-sensitive techniques: hedged requests (rarely and with protection), prohibition of cascading retrays.
Network and Protocols
QUIC/HTTP/3: less losses on mobile/roaming, multiplexing without head-of-line.
TLS 1. 3 and 0-RTT (for secure idempotent queries only).
DNS: short TTL for dynamic routes, Anycast for POP.
TCP: 'TCP _ NODELAY' (prudent), disabling extra 'Nagle '/' Delayed ACK' where justified; keep-alive and fast recovery of connections.
gRPC/HTTP/2: multiplex, flow-control and window settings; avoid excessive compression on small payload.
Connections and Pools
Separate pools by domain/destination (so that "slow neighbors" do not take away slots).
Warm-up/Keep-alive: Maintain a steady number of warm connections.
Connection coalescing (HTTP/2/3) и reuse.
Timeouts: 'connect', 'TLS handshake', 'request', 'idle'. Different values on different hops.
Data and computation locality
Edge/region: Bring readings and easy calculations closer to the user (see Edge nodes and regional logic).
Read-local/Write-global: replica to read, global true to write.
Cache hierarchy: CDN/edge cache → regional KV/Redis → service cache → local in-proc.
Warming: loading hot keys during release/scaling.
Stale-while-revalidate for low-risk data.
Repositories and Indexes
Select access schemes O (1 )/O (logN); keep narrow indexes under frequent queries.
Hot-keys: Shard by 'hash (id)' or add 'salt' for evenness.
Batching at the exit to the database/cache (to a reasonable size) instead of dozens of single calls.
For OLTP, the shortest possible transactions; read-committed/snapshot instead of serial locks.
Competitive and non-blocking
First, eliminate waiting in queues, then optimize the CPU.
Async I/O and non-blocking drivers; lock-free structures where appropriate.
Avoid global mutexes; granular locks, CAS/versioning.
Thread pools: Fix sizes so you don't run into context switches.
NUMA awareness: binding threads to sockets, local allocators.
JVM/GC and runtime tuning (if applicable)
Code generation and allocation: fewer side effects → fewer GC pauses.
Modern reservoirs (G1/ZGC/Shenandoah) with target pauses; escapes and buffer rentals.
Class/Data sharing, JIT warming, AOT/native-image for start-dependent functions.
Include GC pause histograms in the total delay budget.
Queues, backpressure, overload protection
Queue size = small: long queues give a "beautiful p50" and kill p99.
Explicit backpressure: answer "slower" than save.
Adaptive concurrency: Reduce parallelism with increasing error/latency (VEGAS/gradient algorithms, AIMD).
Circuit breaker: fast failures during upstream degradation, bulkhead (cabin companies) for pools and resources.
Rate limit: sliding window/tokens, prioritization (user tier/critical-path).
Retrai, hedging and idempotency
Retrai only for transient errors, with jitter and maximum attempts.
Idempotent operations and'Idempotency-Key 'are required for repetitions.
Hedged requests: send doubles after the threshold (for example, p95 + 10 ms) and always cancel the excess.
Never retract inside each layer without coordination - get a storm.
Caching and warming up
Hot path shall be without network at typical load (in-proc/LRU).
Negative cache for 10-60 s so as not to hammer the missing keys.
Mass warming up during release/scaling: hot key lists, read-ahead, background refresh.
Degradation and follbecks
Graceful Degradation: Cut back on minor features when latency rises (less detailed response, no enrichment).
Soft timeouts: Return the base response/cache instead of 5xx.
Fail-open/Fail-closed - explicitly document for each call.
Observability and profiling
Distributive tracing: spans on each hop, tail-based sampling.
RED/USE метрики: Rate, Errors, Duration / Utilization, Saturation, Errors.
Top-N "slow" routes daily.
Profilers (alloc/cpu/lock) in a low-overhead product (eBPF/async-profiler/Flight Recorder).
Synthetics from different ASN/networks and mobile channels.
Performance Testing
Latency-SLO tests (p95/p99) with real payload and variability.
Chaos scenarios: DNS degradation, increased packet loss, TLS delays, slow store.
Cold-start/scale-up: Measure the first minutes after release when the caches are empty.
Separate load pools according to scripts (do not interfere with read/write tests).
Mini Templates
Timeout/Retract Policy (Pseudo)
yaml timeouts:
connect: 100ms tls_handshake: 150ms request_p95_budget: 80ms retries:
max_attempts: 2 backoff: exp_jitter(10ms..60ms)
retry_on: [CONNECT_ERROR, TIMEOUT, 502, 503, 504]
hedging:
enabled: true threshold: p95 + 10ms cancel_extra_on_first_success: true circuit_breaker:
error_rate_threshold: 5%
p95_threshold_increase: 30%
half_open_after: 10s
Pools and bulkheads
yaml pools:
checkout:
max_conns: 256 per_host: 64 queue: 8 # small analytics queue:
max_conns: 64 queue: 4
Response with degradation
json
{
"status": "ok",
"profile": { "id": "u123", "name": "…"},
"recommendations": "degraded, "//disabled the heavy part
"served_from": "edge-cache",
"trace_id": "…"
}
Application cases
iGaming/finance: payment authorization <200 ms p95, limits/balance - reading from regional projections, records - idempotent with the version.
Marketing/recommendations: answers <100 ms p95, cache of feature flags on edge, models - preliminary scoring + quick rules on the hot way.
Mobile clients: HTTP/3, aggressive reuse connections, reduced payload (Protobuf), security timeouts and offline cache.
Anti-patterns
Long lines in front of the workers: "beautiful average" and killed p99.
Cascade retrays on each layer without coordination.
Global "mega-cache" without disability and warming up.
Fuzzy timeouts (everywhere "by default") - uncontrolled tails.
One common connection pool for all traffic is head-of-line blocking.
Heavy logic on edge with stateful effects.
Disabled tail telemetry - you "can't see" p99.
Production checklist
- There is a hop delay budget and timeouts for it.
- Enabled HTTP/2/3, TLS 1. 3, connection pools and warm-up.
- Cache hierarchy, hot key list, and warm-up strategies.
- Read-local/Write-global and hot key sharding.
- Explicit backpressure, small queues, circuit-breakers and bulkheads.
- Retrai with jitter, idempotency, limited hedging.
- Tracing with region/version/client labels; monitoring p95/p99.
- ASN/Mobile synthetic perf tests, cold-start and chaos scripts.
- Degradation procedures and follbacks are documented.
- p95/p99 correspond to SLOs on real load.
FAQ
Why is p99 more important than the average?
Because users are faced with tails, not average. p99 shows "how much really hurts."
Should you include hedging everywhere?
No, it isn't. It is useful for rare tails in critical pathways and only under strict limits/idempotency.
How to reduce a cold start?
Warm up caches/connections, pre-compile/JIT warm up, minimize lazy initializations, warm pools.
Is it possible to "defeat the network"?
Partially: HTTP/3, edge-POP, Anycast, compact payload, connection reuse and reasonable timeouts.
Total
Low latency architecture is a system of arrangements and disciplines: latency budget, data proximity, small queues, predictable retrays, cache hierarchies, correct protocols, and ruthless tail observability. Following these principles, you keep p95/p99 in line without the sacrifice of stability and wallet.