Edge nodes and points of presence
Brief Summary
Edge nodes (PoP) reduce network latency, offload origin, and provide the "first line" of security. Basic set: Anycast/DNS routing, local cache, L7 policies (WAF, rate-limit, bot filters), observability, automatic failover and SLO discipline. We start with a map of traffic and SLAs of countries/regions, then select providers/locations, build CI/CD and IaC, run failure scenarios.
Why edge and where you need it
Reduce p95/TTFB and jitter for users away from the main data center.
Load shift "left": cache of static assets, images, configs and API responses.
Safety: WAF, mTLS terminators, antiboot logic, DDoS absorption at the edge.
Geo-alignment: compliance with localization requirements/geo-policies, A/B at the PoP level.
PoP Architectural Models
1. CDN-Fully managed
Edge as a service: CDN + WAF + functions (Workers/Compute @ Edge). Fast start, minimum opex.
2. Reverse-proxy PoP (Self/Hybrid)
Bare-metal/VM with Nginx/Envoy/HAProxy + local cache + botfilter + mTLS to origin. Flexible but requires operation.
3. Service-edge/micro-data center
Small cluster (k3s/Nomad/MicroK8s) for near-edge compute: personalization, feature-flags, lightweight ML-inference, preview renders.
The control plane (control, policies, deployment) is separate from the data plane (client traffic). Configs - via GitOps/IaC.
Traffic Routing and Mapping
Anycast: one IP on many PoPs → the "closest" over BGP. Quickly survives PoP failure (withdraw/32).
Geo-DNS/Latency routing: different IP/names for regions; TTL 30–300 c, health-checks.
Fallback paths: Secondary PoP in the region, then global origin.
Anti-pattern: rigid binding to one PoP without health→routing communication (black holes during degradation).
Edge caching
Layers: static assets → aggressive TTL; semi-dynamics (catalogs, configs) → TTL + stale-while-revalidate; GET API → short TTL/disability keys.
Cache key: method + URI + variable headers (Accept-Encoding, Locale, Device-Class) + auth context where allowed.
Disability: by tags/prefixes, event-driven (webhook from CI/CD), time + versioning (asset hashing).
Cache poisoning protection: URL normalization, Vary limit, header limit, strict rules on 'Cache-Control'.
nginx proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=EDGE:512m max_size=200g inactive=7d;
map $http_accept $vary_key { default ""; "~image/avif" "avif"; "~image/webp" "webp"; }
server {
location /static/ {
proxy_cache EDGE;
proxy_cache_key "$scheme$request_method$host$uri?$args $vary_key";
proxy_ignore_headers Set-Cookie;
add_header Cache-Control "public, max-age=86400, stale-while-revalidate=600" always;
proxy_pass https://origin_static;
}
}
Compute at the edge (lightweight)
WAF and bot management: signature/behavioral metrics verification, device-fingerprint, click rate.
Rate-limit/gray-oxen: tokens/sliding window, captcha/challenge, "transfer" of dubious traffic to a degraded route.
Low-state personalization: geo/language/PII-independent banners; KV caches (edge KV) for fast flags.
Functions on events: generating previews, resaying images, signing links, canary redirects.
Security on PoP
mTLS to origin and end-to-end TLS (TLS 1. 3) on all hops.
Segmentation: mgmt-plane (WireGuard/IPsec), prod-traffic, logs/metrics - in separate VRF/VLAN.
Secrets: only "reader" keys/serts; Write operations to critical systems are prohibited at the edge.
WAF/ACL: ASN/botnet block lists, header/body restrictions, slowloris/oversized payloads protection.
Supply-chain: signed artifacts (SBOM), verification on depla.
Observability and telemetry
Metrics:- L3/L4: CPS/RPS, established, SYN backlog, drops, retransmits.
- L7: p50/95/99 TTFB, upstream time, cache hit-ratio, WAF trigger, 4xx/5xx/429.
- TLS: version/algorithm, handshake p95, resumption rate, OCSP stapling state.
- Logs: access (with PII cut-off), WAF log, rate-limit and bot-rules events.
- Traces: sampled: edge→origin, correlation 'traceparent' or 'x-request-id'.
- Log delivery: debaffer to a local queue/file → asynchronous sending to the central Log Hub (Loki/ELK) with retrays.
SLO for Edge/PoP (examples)
PoP availability: ≥ 99. 95 %/30 days.
p95 TTFB (static): ≤ 100-150 ms regionally.
p95 TTFB (API GET cached): ≤ 200-250 ms; non-cached - ≤ 300-400 ms.
Hit-ratio cache: static ≥ 90%, half-dynamics ≥ 60%.
WAF FP-rate: ≤ 0. 1% legitimate requests.
Disability time by tag: ≤ 60 s.
Alerts: hit-ratio drop, 5xx/525 growth, handshake failures, 429 growth, health-checks flapping, Anycast degradation (withdraw more often N/h).
Deploy and CI/CD
GitOps: PoP configs (WAF/rate-limit/routes/cache rules) - in the repository, PR review, canary rollout of 1 PoP.
Versioning: prefix policies for test ('/canary/'), quick rollback.
Secrets: distribution through Vault agents/KSMS, short TTL tokens.
Updates: Staging-PoP, then validated pool, then mass rollout.
PoP Topology and Infrastructure
Hardware/network: 10/25/40G uplinks, two independent providers, separate routers for Anycast/BGP, RoH (redundancy).
Storage: ephemeral + local cache SSD only; no long-lived PII.
Edge-compute clusters: k3s/Containerd, node tains for network functions, PodDisruptionBudget.
Out-of-band access: a separate mgmt channel (LTE/second provider) to "get back on your feet" in an accident.
FinOps and Economics
Traffic profile: shares by region/ASN/CDN-boost; dynamics of peaks (matches/events).
$/GB egress and $/ms p95 as target metrics; compare Managed Edge vs Self-PoP TCO.
Cache economy: hit-ratio growth reduces egress Origin and the cost of cloud functions.
Local channels: package discounts from providers, IX-peers, cache peering with mobile network providers.
iGaming/fintech specific
Peaks in match minutes: canary "gray wolves," registration/deposit limits, prioritization of PSP routes.
Antifraud: TLS decryption at the edge + device fingerprint, scoring and soft challenges; "dark API" for a bot with a different output.
Localization of content/rules: gambling countries with special restrictions - geo-routes and ASN block lists.
Regulation: timing/offset timing, no PII at the edge, end-to-end encryption and strict SLA PSPs.
Implementation checklist
- Traffic/region map, p95/availability targets by country.
- Model selection (CDN-Managed/Self-PoP/Hybrid), location and aplink plan.
- Anycast/BGP + Geo-DNS with health-checks and automatic withdraw.
- Cache policies: keys, TTL, disability, poisoning protection.
- Edge-security: WAF, rate-limit, mTLS to origin, secrets with short TTL.
- Observability: metrics/L7 logs/trails, delivery to central stacks.
- CI/CD/GitOps, canary PoP, fast rollback.
- DR scenarios: loss of PoP/aplinka, Anycast degradation, CDN drop.
- FinOps: egress/PoP hosting budgets, plan IX/peering.
Common errors
One provider/one aplink in PoP → SPOF.
Cache "default" without'Vary 'control → cache poisoning and leakage.
No health→routing communication (DNS/GSLB/BGP) → delays and black holes.
Secrets with wide rights at the edge → high blast radius.
PII logs without editing → compliance problems.
Manual PoP configs → desynchronization and drift.
Mini playbooks
1) Emergency shutdown of the problem PoP (Anycast/BGP)
1. Health falls below threshold → 2) controller removes/32 announcement → 3) external sample monitoring; 4) rca and return by manual flag.
2) Hot Disabled Tag Cache
1. CI/CD sends the webhook to PoP → 2) invalidation via 'cache-tag:' ≤ 60 c → 3) hit-ratio and p95 checks.
3) Reflecting a burst of bots
1. Activate the "gray" route (captcha/challenge) for suspicious ASN → 2) increase the cost of the path to origin → 3) remove the rules after the recession.
4) Loss of one aplinka
1. Switching ECMP to a live provider; 2) egress policy reduces bulk class; 3) SLA report and ticket to the provider.
Example of Envoy config skeleton on PoP (L7 + cache + WAF hooks)
yaml static_resources:
listeners:
- name: https address: { socket_address: { address: 0. 0. 0. 0, port_value: 443 } }
filter_chains:
- filters:
- name: envoy. filters. network. http_connection_manager typed_config:
"@type": type. googleapis. com/envoy. extensions. filters. network. http_connection_manager. v3. HttpConnectionManager stat_prefix: edge http_filters:
- name: envoy. filters. http. waf # external or custom filter
- name: envoy. filters. http. ratelimit
- name: envoy. filters. http. router route_config:
virtual_hosts:
- name: app domains: ["app. example. com"]
routes:
- match: { prefix: "/static/" }
route:
cluster: origin_static response_headers_to_add:
- header: { key: "Cache-Control", value: "public, max-age=86400, stale-while-revalidate=600" }
- match: { prefix: "/" }
route: { cluster: origin_api, timeout: 5s }
clusters:
- name: origin_static connect_timeout: 2s type: STRICT_DNS lb_policy: ROUND_ROBIN load_assignment:
endpoints: [{ lb_endpoints: [{ endpoint: { address: { socket_address: { address: "origin-static", port_value: 443 }}}}]}]
transport_socket:
name: envoy. transport_sockets. tls
- name: origin_api connect_timeout: 2s type: STRICT_DNS lb_policy: ROUND_ROBIN transport_socket:
name: envoy. transport_sockets. tls
Total
A strong edge contour is the correct geography of PoP + Anycast/Geo-DNS, smart caching and compute on the edge, tight security, observability and automation. Set measurable SLOs, link health → routing, hold canary levers, and train DR scenarios. Then your platform will be fast and stable everywhere - from Santiago to Seoul, even at the peak of decisive matches and sales.