Ecosystem updates without downtime
(Section: Ecosystem and Network)
1) Zero-downtime purpose and principles
Zero-downtime updates ensure continuous operation of the network and products during changes in code, configurations, data schemes and protocols. Basic principles:- Backward/forward compatibility at contract boundaries.
- Progressive delivery instead of "big switch."
- Observability and reversibility: metrics, traces, fast rollback.
- Idempotency and secure retreats for network and payment flows.
- Fault isolation: cell architecture, circuit-breakers, fan-out limits.
2) Downtime-free release strategies
Blue-Green - two identical stacks (Blue = prod, Green = new). Traffic is switched atomically at the balancer level with the possibility of instant rollback.
Canary - phased share of traffic (1%→5%→20%→50%→100%) with SLO gates.
Rolling - node-by-node pool update with readiness check and connection drainage.
Shadow/Traffic Mirroring - mirror requests for a new version without affecting responses.
Feature Flags - business switching of features over an unchanged API (gradual rollout).
Dark Launch - Enable hidden logic branches for telemetry and profiling.
Recommendation: for critical services - a combination of canary + rolling + feature flags; for gateways and APIs - blue-green with short switching.
3) Contractual compatibility (API/events/protocol)
API: versioning by URI/headers; adding fields - valid, deleting/renaming - only through the "deprecate window."
Events (event-bus): "only adding" fields; keys are immutable; new types - as new themes/versions.
Schemas (Avro/JSON-Schema/Protobuf): registry schema, BACKWARD'FULL 'compatibility.
Network protocol/P2P: version handshake and capability negotiation (nodes declare supported versions/features).
Gateways: adapters between vN and vN + 1 (transcoding/field mapping) for the migration period.
Deprecate policy (example): announcement → ≥90 days of warning → "deprecated" checkbox → deletion of field/endpoint.
4) Expand → Migrate → Contract
1. Expand - add new structures/indexes/columns (nullable/default), dual-write (dual-write) to the old and new formats.
2. Migrate - background migrations, backfill, consistency validators; read through an adapter that supports both schemes.
3. Contract - disable reading/writing to the old scheme, remove technical debt after completing the "deprecate window."
SQL (Simplified):sql
-- Expand
ALTER TABLE payouts ADD COLUMN payout_ref TEXT NULL;
CREATE INDEX CONCURRENTLY ix_payouts_ref ON payouts(payout_ref);
-- Migrate (batch + idempotent)
UPDATE payouts SET payout_ref = concat('ref_', id) WHERE payout_ref IS NULL;
-- Contract (after compatibility window)
ALTER TABLE payouts ALTER COLUMN payout_ref SET NOT NULL;
Event transactionality: Use Outbox (transaction with event record) + CDC for guaranteed delivery.
5) Long-lived connections and drainage
Graceful shutdown: SIGTERM → stop accepting new requests → set 'readiness = fail' → wait for WebSocket/HTTP2/QUIC streams to drain → close.
Connection draining on the balancer: 'deregister _ delay' 30-120 s, sticky sessions - through tokens, not IP.
Back-pressure: Limit new upstream p99_latency.
6) SDK and Client Versioning
SemVer for SDK; LTS branch with extended support window (e.g. 12 months).
Policy: "at least two active minor versions"; telemetry per client by version; Automatic upgrade alerts
Critical changes (security): forced flag of disabling old versions through the gateway after the deadline.
7) Updates of protocols and network nodes
Soft-fork: extending rules without violating old nodes (capabilities).
Hard-fork: pre-announced window, double validation, "canary validators," protection against "reorg/rollback" conflicts, time-lock for activation.
Cross-chain updates: governance bridges transmit activation signals; in case of misalignment - local circuit-breaker.
8) Configurations and secrets as data
Centralized config-service with versioning, digital signatures and rollback.
Secrets rotation without downtime: double keys (old/new), alternate inclusion; zero downtime for KMS/PKI.
Feature-flags in a separate row, audit of on/off.
9) Pipeline release and automatic "gates"
Стадии: build → unit → security scan → e2e/stage → shadow → canary → 100%.
Gates-stops:- Error-budget burn-rate, p95/p99 latency, error-rate, decrease in success-rate events/payments, growth of dead-letter queues.
- Auto-rollback in case of SLO violation in any of the stages.
yaml release:
strategy: canary steps:
- name: shadow traffic_mirror: 5%
gates: [no_data_loss, no_pii_leak]
- name: canary_1 traffic: 1%
gates: [error_rate<0. 2%, p99<400ms]
- name: canary_2 traffic: 10%
gates: [slo_ok_1h, zero_deadletters]
- name: rollout traffic: 100%
gates: [stability_6h]
- name: bake duration: 24h action: finalize_or_rollback
10) Observability and SLO for releases
Key SLIs:- p95/p99 latency by endpoints; error-rate (5xx + fatal 4xx); success-rate events; proportion of retrays; queue lag; "relay" share in P2P; share of customers by version.
- p99 API ≤ 400 ms; error-rate ≤ 0. 2%; success-rate events ≥ 99. 5%; queue lag ≤ 2 s; Rollback MTTR ≤ 15 min.
- Release dashboards: before/after comparison, canary graphs, dependency map (service map), burn-rate alerts 1h/6h.
11) Rollbacks and kill-switch
Auto-rollback: keep the latest "good" artifacts and configs; "1-button" rollback on the balancer (Blue←Green).
Partial rollback: The phicheflag turns off the new logic when saving the binary.
Data rollback: only for "read-paths"; for write-paths - protected migrations (never delete old columns until the end of the window).
Kill-switch: centralized flag to disable the unstable subsystem.
12) Testing without downtime
Contract tests against customer stabilization (consumer-driven).
Schema-compat tests.
Chaos tests in staging: failure of% of nodes/regions, degradation of DHT/TURN/KMS/DNS, "retray storm."
Load/remarket tests: canary regions and hot routes.
13) Communications and Compliance Procedures
Release notes: what changes, influence, windows/deadlines of deprecate, actions for partners.
SLA of incident responses: MTTA ≤ 5 min, first status update ≤ 15 min, post-mortem ≤ 72 h.
Trace audit: linking all config changes and releases to applications/sites, signatures of artifacts.
14) Special cases
Payment/financial flows: strict idempotency, dedup according to idempotency-key, outbox + CDC, "non-destructive" migrations only.
WebSocket/streams: protocol version in handshake, reconnect with summary (resume tokens).
Cache/edge: 'stale-while-revalidate', dual cache versions, TTL hygiene during release period.
Mobile clients: phased rollout in sectors, forced update on security releases.
15) Zero-downtime checklist
1. Contract compatibility and the registry scheme are configured.
2. Expand→Migrate→Contract is described and automated.
3. Balance/Ingress supports blue-green and connection drainage.
4. Canary-pipeline with SLO-gates and auto-rollback.
5. Feature-flags and kill-switch are available 24/7.
6. Outbox + CDC and idempotency are enabled for all write paths.
7. Release-health dashboards and burn-rate alerts are active.
8. Communications and deprecate policy announced to partners in advance.
9. Weekly rehearsal rollback; quarterly chaos-day.
16) Glossary
Progressive delivery - phased release of features with risk control.
Schema registry - a repository of schema versions with compatibility policies.
Outbox/CDC - a template for guaranteed publication of events from transactions.
Blue-Green - parallel stacks with atomic traffic switching.
Canary - gradually increasing the share of traffic on the new version.
Graceful shutdown/drainage - correct termination of active connections.
Bottom line: zero downtime is not one trick, but a system: contracts, scheme compatibility, phased release strategies, observability, safe migrations and guaranteed rollbacks. Following this framework, the ecosystem is updated quickly, predictably and without pain for users and partners.