Scaling network nodes
(Section: Ecosystem and Network)
1) Node roles and traffic loops
Validation/production (consensus/block/rollup-sequencer): a critical path of finalization.
Reader/indexer (read-only/API/archive): serves application and analytics requests.
Relay/bridge (cross-domain): transferring messages/assets between domains.
Gateway/edge (ingress/gRPC/WebSocket/QUIC): receiving client requests, rate-limit, cache.
Tele metric/observability: collection of metrics/logs/traces, synthetic samples.
Each role has its own SLO, error budget and scaling policy.
2) Scaling models
2. 1 Scale-up
Increase CPU/RAM/SSD/NIC. Fast for peaks, but limited by iron and can increase the cost per unit of traffic.
2. 2 Scale-out
Adding replicas behind balancers/queues. Requires idempotence, sticky policies, quorum and consistent caches (or their disability).
2. 3 Functional diversity
Separation of duties: consensus nodes are isolated; RPC/API - separately; indexer/archive - separately; bridge/relayer - separately.
2. 4 Geo-scale
Regional clusters (EU/US/AP) + anycast/GeoDNS/Latency Aware LB; Replication with finalization/latency and local caches
2. 5 Sharding/partitioning
Separation by keys (chainId, shard, topic) for queues/indexers and column storages.
3) Request path: balancing, caching, QoS
L4/L7 balancing: health-checks, sticky by token/trace-id, circuit-breaker, outlier-ejection.
Caches:- on edge (short-TTL for frequently read RPCs);
- inside the processor (read-through, write-around for indexes);
- negative caches (not found).
- QoS classes: P0 (finalization/bridge/payments), P1 (product), P2 (bulk/archive).
- Backpressure: tokens/credits, restriction of concur requests, queues with DLQ.
- Admissions: pre-filter (auth, limits, geo/sanctions), early rejection of "expensive" requests.
4) Status management: snapshots, pruning, archive
Full/Pruned: pruned nodes for RPC; Archive - for retrospective queries in a separate pool.
Snapshots/fast-sync: regular snapshots, fast bootstrap of new replicas.
Hot/Warm/Cold storage: hot state on NVMe, historical blocks - S3/object with indices.
Garbadge-collect/compaction: scheduled windows, not during peaks.
DA/Batch buffers (for L2/bridges): delivery guarantees and cleaning period with proof receipts.
5) Queues and streaming
Ingress: Kafka/Pulsar/NATS с partition-key = `chainId|shard|topic`.
Consumer groups: scaling by parties, idempotent handler (outbox/inbox).
DLQ and retrai: exponential backoff, poison-message quarantine.
Agreed order: within the party for determinism.
6) Transport and network optimizations
QUIC/HTTP/2: multiplexing, head-of-line correction.
TCP tuning: BBR/CUBIC, increased buffers, 'SO _ REUSEPORT'.
Kernel/eBPF: accelerated network stack, consistent hash for balancing.
NIC offload и pinning IRQ к NUMA.
gRPC: keepalive/ping parameters, max-inflight constraints.
WebSocket: connection pools, ping/pong, limit subscriptions per client.
7) Reliability: Quorums, degradation, chaos tests
Read/write quorum (if applicable), leader fencing.
Degradation modes: readonly, "only finalized," turning off heavy methods.
Chaos engineering: delays/losses, restarts, disk/network failure, "high-speed reorg" scenarios.
8) SLI/SLO and targets
SLI (example):- p95 RPC latency by method class;
- Success-rate; Queue-lag p95;
- Time-to-finality p95 (for rails/bridges);
- Snapshot bootstrap time;
- State growth/day; CPU/IO saturation.
- P0 RPC p95 ≤ 400 ms; Availability ≥ 99. 95%;
- Finality relay p95 ≤ 3 min;
- Queue-lag P0 p95 ≤ 2 с;
- Bootstrap new reader ≤ 30 мин (fast-sync+snapshot);
- Error budget burn on 2-hour window ≤ 2 ×.
9) Observability and alerting
Metrics: latency (histogram), RPS, errors (by class), queue-lag, GC/heap, disk-io, p2p peers, gossip-rate.
Traces: end-to-end 'trace _ id' through the edge→RPC→indeksator→khraneniye→most.
Logs: structured, correlation by 'request _ id'.
Alerts: burn-rate P0, queue-lag, peer-count below threshold, reorg-spikes, snapshot-drift.
10) Autoscaling patterns
HPA/VPA (K8s): по CPU/latency/RPS/queue-lag; KEDA by the length of the topiaries.
Step-scaling: day peak profiles; Predictive by ML/seasonality.
Warm-spares: warm-up replicas without traffic (graceful promote).
Safe rollout: canary + outlier-ejection + SLO-гейты.
11) Safety and isolation
mTLS/key pinning; RBAC/ABAC per methods; QoS limits per org/tenant.
Rate-limit and DoS-shield: tokens, captchas for public RPCs, anomaly-detection.
Secret management: short-lived tokens, rotation.
Sandboxes: Separate Poules for Archive/Public Clients.
12) Reference configurations
12. 1 K8s: RPC Gateway (Scale Out)
yaml apiVersion: apps/v1 kind: Deployment metadata: { name: rpc-gateway }
spec:
replicas: 6 strategy: { type: RollingUpdate, rollingUpdate: { maxSurge: 2, maxUnavailable: 0 } }
selector: { matchLabels: { app: rpc-gateway } }
template:
metadata: { labels: { app: rpc-gateway, qos: P0 } }
spec:
containers:
- name: gateway image: org/rpc-gateway:2. 4. 1 ports: [{ containerPort: 443 }]
resources:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "4", memory: "6Gi" }
env:
- { name: MAX_CONCURRENCY, value: "400" }
- { name: CACHE_TTL_MS, value: "200" }
readinessProbe: { httpGet: { path: /healthz, port: 443 }, initialDelaySeconds: 5, periodSeconds: 5 }
livenessProbe: { httpGet: { path: /livez, port: 443 }, initialDelaySeconds: 10, periodSeconds: 10 }
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: { name: rpc-gateway-hpa }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: rpc-gateway }
minReplicas: 6 maxReplicas: 36 metrics:
- type: Pods pods:
metric:
name: request_latency_p95_ms target:
type: AverageValue averageValue: 350m # 350 мс
12. 2 Envoy: Prioritization and outlier-ejection
yaml clusters:
- name: readers type: EDS lb_policy: LEAST_REQUEST outlier_detection:
consecutive_5xx: 5 interval: 2s base_ejection_time: 30s circuit_breakers:
thresholds:
- priority: DEFAULT max_connections: 20000 max_pending_requests: 5000 max_requests: 20000 health_checks:
- timeout: 1s interval: 3s http_health_check: { path: /healthz }
route_config:
request_headers_to_add:
- header: { key: x-trace-id, value: "%REQ(X-TRACE-ID)%" }
weighted_clusters:
clusters:
- name: readers weight: 100
12. 3 Kafka: partitioning by domain
yaml topic: "rpc. events"
partitions: 48 replicationFactor: 3 config:
retention. ms: 604800000 # 7 days max. message. bytes: 1048576 min. insync. replicas: 2 cleanup. policy: delete
12. 4 QoS and Limits Policy
yaml qos:
P0:
rps_limit_per_org: 1500 queue_lag_p95_ms: 2000 retry: { attempts: 3, backoff_ms: [100,400,800] }
P1:
rps_limit_per_org: 800
P2:
rps_limit_per_org: 200 admissions:
denylist_methods: ["eth_getLogs(>10k blocks)"]
heavy_query_guard: { max_range_blocks: 5000, require_token: true }
13) Data schemas and sample queries
13. 1 Node metrics (directory)
sql
CREATE TABLE node_metrics (
ts TIMESTAMPTZ,
node_id TEXT, role TEXT, region TEXT,
rps INT, latency_p95_ms INT, errors_5xx INT,
queue_lag_ms INT, cpu NUMERIC, mem NUMERIC, io_wait NUMERIC
);
13. 2 SLO control and burn rate
sql
SELECT date_trunc('hour', ts) AS h, role,
AVG(latency_p95_ms) AS p95,
100. 0 SUM(CASE WHEN latency_p95_ms <= 400 THEN 1 ELSE 0 END)/COUNT() AS slo_hit_pct
FROM node_metrics
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY 1,2;
13. 3 Load planning
sql
SELECT region, role,
PERCENTILE_CONT(0. 95) WITHIN GROUP (ORDER BY rps) AS rps_p95,
PERCENTILE_CONT(0. 95) WITHIN GROUP (ORDER BY queue_lag_ms) AS lag_p95
FROM node_metrics
WHERE ts >= now() - INTERVAL '7 days'
GROUP BY region, role;
14) Operating regulations
Daily: SLO report, capacy delta, snapshots status, peer-health.
Weekly: revision of limits/QoS, DR test (bootstrap from snapshot), checking pruning and compressens.
Before release: canary rollout, SLO gates and observed metrics, rollback plan.
Cost accounting: CTS per 1k requests, TPS_per_$ (efficiency per dollar).
15) Playbook incidents
A. RPC p95 latency explosion
1. Enable P2-throttle and lower sampling; 2) increase gateway/reader replicas;
2. Transfer some traffic to the cache only. 4) open the analysis of hot methods, if necessary - deny-rules.
B. Queue-lag on bus> SLO
1. Autoscale consumers (KEDA), 2) redistribute parties, 3) temporarily stop bulk jobs.
C. Peer-count drop at validator/relay
1. Restart p2p modules, 2) change seats, 3) check network ACL/NAT, 4) switch protection.
D. Long bootstrap new replica
1. Switch to fresh snapshot, 2) raise IO bandwidth, 3) temporarily remove archive indexes.
E. Spike reorg/bridge delays
1. Enlarge K-acknowledgements/window, 2) enable "finalized-only" mode, 3) inform consumers.
16) Implementation checklist
1. Define site roles and their SLOs/error budgets.
2. To carry functions: consensus/RPC/indexer/archive/bridge/edge.
3. Enable balancing, QoS, backpressure, and queue with DLQ.
4. Set up snapshots/fast-sync, pruning and tiering.
5. Connect metrics/trails/logs, dashboards and burn-rate alerts.
6. Set up autoscaling (HPA/KEDA) and canary releases.
7. Conduct chaos tests and regular DR exercises.
8. Introduce operating regulations and cost control.
17) Glossary
Backpressure - mechanisms for controlling input flow during overload.
DLQ - "dead queue" for problem messages.
Pruning - delete the historical state outside the current window.
Fast-sync/Snapshot is an accelerated way to synchronize a new replica.
Outlier-ejection - exclusion of degraded instances from the pool.
Burn-rate - error budget consumption rate relative to SLO.
Bottom line: scaling network nodes is not only "add replicas," but the system discipline of architecture, QoS, state management and operational rigor. By following this framework (role separation, queues, caches, autoscale, observability, and clear SLOs), the ecosystem gains predictable performance, peak resilience, and controllable cost per unit of traffic.