Scaling network nodes

(Section: Ecosystem and Network)

1) Node roles and traffic loops

Validation/production (consensus/block/rollup-sequencer): a critical path of finalization.
Reader/indexer (read-only/API/archive): serves application and analytics requests.
Relay/bridge (cross-domain): transferring messages/assets between domains.
Gateway/edge (ingress/gRPC/WebSocket/QUIC): receiving client requests, rate-limit, cache.
Tele metric/observability: collection of metrics/logs/traces, synthetic samples.

Each role has its own SLO, error budget and scaling policy.

2) Scaling models

2. 1 Scale-up

Increase CPU/RAM/SSD/NIC. Fast for peaks, but limited by iron and can increase the cost per unit of traffic.

2. 2 Scale-out

Adding replicas behind balancers/queues. Requires idempotence, sticky policies, quorum and consistent caches (or their disability).

2. 3 Functional diversity

Separation of duties: consensus nodes are isolated; RPC/API - separately; indexer/archive - separately; bridge/relayer - separately.

2. 4 Geo-scale

Regional clusters (EU/US/AP) + anycast/GeoDNS/Latency Aware LB; Replication with finalization/latency and local caches

2. 5 Sharding/partitioning

Separation by keys (chainId, shard, topic) for queues/indexers and column storages.

3) Request path: balancing, caching, QoS

L4/L7 balancing: health-checks, sticky by token/trace-id, circuit-breaker, outlier-ejection.

Caches:

on edge (short-TTL for frequently read RPCs);
inside the processor (read-through, write-around for indexes);
negative caches (not found).
QoS classes: P0 (finalization/bridge/payments), P1 (product), P2 (bulk/archive).
Backpressure: tokens/credits, restriction of concur requests, queues with DLQ.
Admissions: pre-filter (auth, limits, geo/sanctions), early rejection of "expensive" requests.

4) Status management: snapshots, pruning, archive

Full/Pruned: pruned nodes for RPC; Archive - for retrospective queries in a separate pool.
Snapshots/fast-sync: regular snapshots, fast bootstrap of new replicas.
Hot/Warm/Cold storage: hot state on NVMe, historical blocks - S3/object with indices.
Garbadge-collect/compaction: scheduled windows, not during peaks.
DA/Batch buffers (for L2/bridges): delivery guarantees and cleaning period with proof receipts.

5) Queues and streaming

Ingress: Kafka/Pulsar/NATS с partition-key = `chainId|shard|topic`.
Consumer groups: scaling by parties, idempotent handler (outbox/inbox).
DLQ and retrai: exponential backoff, poison-message quarantine.
Agreed order: within the party for determinism.

6) Transport and network optimizations

QUIC/HTTP/2: multiplexing, head-of-line correction.
TCP tuning: BBR/CUBIC, increased buffers, 'SO _ REUSEPORT'.
Kernel/eBPF: accelerated network stack, consistent hash for balancing.
NIC offload и pinning IRQ к NUMA.
gRPC: keepalive/ping parameters, max-inflight constraints.
WebSocket: connection pools, ping/pong, limit subscriptions per client.

7) Reliability: Quorums, degradation, chaos tests

Read/write quorum (if applicable), leader fencing.
Degradation modes: readonly, "only finalized," turning off heavy methods.
Chaos engineering: delays/losses, restarts, disk/network failure, "high-speed reorg" scenarios.

8) SLI/SLO and targets

SLI (example):

p95 RPC latency by method class;
Success-rate; Queue-lag p95;
Time-to-finality p95 (for rails/bridges);
Snapshot bootstrap time;
State growth/day; CPU/IO saturation.

SLO (landmarks):

P0 RPC p95 ≤ 400 ms; Availability ≥ 99. 95%;
Finality relay p95 ≤ 3 min;
Queue-lag P0 p95 ≤ 2 с;
Bootstrap new reader ≤ 30 мин (fast-sync+snapshot);
Error budget burn on 2-hour window ≤ 2 ×.

9) Observability and alerting

Metrics: latency (histogram), RPS, errors (by class), queue-lag, GC/heap, disk-io, p2p peers, gossip-rate.
Traces: end-to-end 'trace _ id' through the edge→RPC→indeksator→khraneniye→most.
Logs: structured, correlation by 'request _ id'.
Alerts: burn-rate P0, queue-lag, peer-count below threshold, reorg-spikes, snapshot-drift.

10) Autoscaling patterns

HPA/VPA (K8s): по CPU/latency/RPS/queue-lag; KEDA by the length of the topiaries.
Step-scaling: day peak profiles; Predictive by ML/seasonality.
Warm-spares: warm-up replicas without traffic (graceful promote).
Safe rollout: canary + outlier-ejection + SLO-гейты.

11) Safety and isolation

mTLS/key pinning; RBAC/ABAC per methods; QoS limits per org/tenant.
Rate-limit and DoS-shield: tokens, captchas for public RPCs, anomaly-detection.
Secret management: short-lived tokens, rotation.
Sandboxes: Separate Poules for Archive/Public Clients.

12) Reference configurations

12. 1 K8s: RPC Gateway (Scale Out)

yaml apiVersion: apps/v1 kind: Deployment metadata: { name: rpc-gateway }
spec:
replicas: 6 strategy: { type: RollingUpdate, rollingUpdate: { maxSurge: 2, maxUnavailable: 0 } }
selector: { matchLabels: { app: rpc-gateway } }
template:
metadata: { labels: { app: rpc-gateway, qos: P0 } }
spec:
containers:
- name: gateway image: org/rpc-gateway:2. 4. 1 ports: [{ containerPort: 443 }]
resources:
requests: { cpu: "1", memory: "2Gi" }
limits:  { cpu: "4", memory: "6Gi" }
env:
- { name: MAX_CONCURRENCY, value: "400" }
- { name: CACHE_TTL_MS, value: "200" }
readinessProbe: { httpGet: { path: /healthz, port: 443 }, initialDelaySeconds: 5, periodSeconds: 5 }
livenessProbe: { httpGet: { path: /livez, port: 443 }, initialDelaySeconds: 10, periodSeconds: 10 }
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: { name: rpc-gateway-hpa }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: rpc-gateway }
minReplicas: 6 maxReplicas: 36 metrics:
- type: Pods pods:
metric:
name: request_latency_p95_ms target:
type: AverageValue averageValue: 350m  # 350 мс

12. 2 Envoy: Prioritization and outlier-ejection

yaml clusters:
- name: readers type: EDS lb_policy: LEAST_REQUEST outlier_detection:
consecutive_5xx: 5 interval: 2s base_ejection_time: 30s circuit_breakers:
thresholds:
- priority: DEFAULT max_connections: 20000 max_pending_requests: 5000 max_requests: 20000 health_checks:
- timeout: 1s interval: 3s http_health_check: { path: /healthz }
route_config:
request_headers_to_add:
- header: { key: x-trace-id, value: "%REQ(X-TRACE-ID)%" }
weighted_clusters:
clusters:
- name: readers weight: 100

12. 3 Kafka: partitioning by domain

yaml topic: "rpc. events"
partitions: 48 replicationFactor: 3 config:
retention. ms:  604800000 # 7 days max. message. bytes: 1048576 min. insync. replicas: 2 cleanup. policy: delete

12. 4 QoS and Limits Policy

yaml qos:
P0:
rps_limit_per_org: 1500 queue_lag_p95_ms: 2000 retry: { attempts: 3, backoff_ms: [100,400,800] }
P1:
rps_limit_per_org: 800
P2:
rps_limit_per_org: 200 admissions:
denylist_methods: ["eth_getLogs(>10k blocks)"]
heavy_query_guard: { max_range_blocks: 5000, require_token: true }

13) Data schemas and sample queries

13. 1 Node metrics (directory)

sql
CREATE TABLE node_metrics (
ts TIMESTAMPTZ,
node_id TEXT, role TEXT, region TEXT,
rps INT, latency_p95_ms INT, errors_5xx INT,
queue_lag_ms INT, cpu NUMERIC, mem NUMERIC, io_wait NUMERIC
);

13. 2 SLO control and burn rate

sql
SELECT date_trunc('hour', ts) AS h, role,
AVG(latency_p95_ms) AS p95,
100. 0 SUM(CASE WHEN latency_p95_ms <= 400 THEN 1 ELSE 0 END)/COUNT() AS slo_hit_pct
FROM node_metrics
WHERE ts >= now() - INTERVAL '24 hours'
GROUP BY 1,2;

13. 3 Load planning

sql
SELECT region, role,
PERCENTILE_CONT(0. 95) WITHIN GROUP (ORDER BY rps) AS rps_p95,
PERCENTILE_CONT(0. 95) WITHIN GROUP (ORDER BY queue_lag_ms) AS lag_p95
FROM node_metrics
WHERE ts >= now() - INTERVAL '7 days'
GROUP BY region, role;

14) Operating regulations

Daily: SLO report, capacy delta, snapshots status, peer-health.
Weekly: revision of limits/QoS, DR test (bootstrap from snapshot), checking pruning and compressens.
Before release: canary rollout, SLO gates and observed metrics, rollback plan.
Cost accounting: CTS per 1k requests, TPS_per_$ (efficiency per dollar).

15) Playbook incidents

A. RPC p95 latency explosion

1. Enable P2-throttle and lower sampling; 2) increase gateway/reader replicas;

2. Transfer some traffic to the cache only. 4) open the analysis of hot methods, if necessary - deny-rules.

B. Queue-lag on bus> SLO

1. Autoscale consumers (KEDA), 2) redistribute parties, 3) temporarily stop bulk jobs.

C. Peer-count drop at validator/relay

1. Restart p2p modules, 2) change seats, 3) check network ACL/NAT, 4) switch protection.

D. Long bootstrap new replica

1. Switch to fresh snapshot, 2) raise IO bandwidth, 3) temporarily remove archive indexes.

E. Spike reorg/bridge delays

1. Enlarge K-acknowledgements/window, 2) enable "finalized-only" mode, 3) inform consumers.

16) Implementation checklist

1. Define site roles and their SLOs/error budgets.
2. To carry functions: consensus/RPC/indexer/archive/bridge/edge.
3. Enable balancing, QoS, backpressure, and queue with DLQ.
4. Set up snapshots/fast-sync, pruning and tiering.
5. Connect metrics/trails/logs, dashboards and burn-rate alerts.
6. Set up autoscaling (HPA/KEDA) and canary releases.
7. Conduct chaos tests and regular DR exercises.
8. Introduce operating regulations and cost control.

17) Glossary

Backpressure - mechanisms for controlling input flow during overload.
DLQ - "dead queue" for problem messages.
Pruning - delete the historical state outside the current window.
Fast-sync/Snapshot is an accelerated way to synchronize a new replica.
Outlier-ejection - exclusion of degraded instances from the pool.
Burn-rate - error budget consumption rate relative to SLO.

Bottom line: scaling network nodes is not only "add replicas," but the system discipline of architecture, QoS, state management and operational rigor. By following this framework (role separation, queues, caches, autoscale, observability, and clear SLOs), the ecosystem gains predictable performance, peak resilience, and controllable cost per unit of traffic.

Scaling network nodes

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects