Zero-Downtime deployments

(Section: Architecture and Protocols)

1) What is Zero-Downtime and why it is needed

Zero-Downtime (ZDT) is a way to release new versions of an application without the service being unavailable to users and without losing requests. Objectives:

Zero downtime for customers and integrations.
Predictable releases, fast rollbacks, and manageable risk.
Preservation of SLO/SLI (latency, errors, availability) within the boundaries of the agreements.

The key to ZDT is not one "magic" technique, but a combination of delivery patterns, data compatibility and competent traffic routing.

2) Basic Zero-Downtime principles

1. Version compatibility: New and old versions must handle traffic and data correctly at the same time.
2. Idempotency of operations: reprocessing should not break the state.
3. Graceful shutdown and connection drainage.
4. Step-by-step health check: readiness/liveness tests, health-endpoints.
5. Rollback as first class citizen: rollback is easier and faster than hotfix.
6. Observability by design: release marks, single dashboards, SLO alerts.
7. Automation: Release and rollback scenarios are code, not manual instructions.

3) Downtime-free delivery patterns

3. 1 Rolling Update

Gradually remove part of the instances of the old version from traffic, update them to the new one and return them to the pool.

Pros: economical in infrastructure, just in k8s/ASG.
Cons: for some time the cluster works with two versions at the same time (version skew).

3. 2 Blue-Green

Two full prods: active (Blue) and candidate (Green). Traffic switching - atomic flip.

Pros: instant rollback, clean isolation.
Cons: ↑ infrastructure costs, more difficult with stateful.

3. 3 Canary/Progressive rollout

We give a small share of traffic (1-5-10-25-50-100%) to the new version with gates by metrics.

Pros: minimal blast radius, data-driven solutions.
Cons: Need mature observability and intelligent routing.

3. 4 Shadow traffic / Dark launch

Mirror real requests to the new version (without answering the user) or launch hidden to collect metrics.

Pros: Early identification of problems.
Cons: double load on addictions, you need to control side effects.

4) Traffic and connection management

4. 1 Readiness/Liveness

Liveness tells the orchestrator to "restart me."

Readiness - "do not direct traffic, I'm not ready yet."

Cannot be released without correct readiness logic and timeouts.

4. 2 Connection Drainage

Before removing an instance from the pool:

stop accepting new connections,
waiting for the completion of active,
interrupt the "hung" timeout.

4. 3 Sticky sessions and L7 routing

Sticky is useful for stateful scenarios, but complicates load balance.
L7 rules (path, header, cookie, API versions) are convenient for canary/ring.

4. 4 Long-lived connections

WebSocket/gRPC streaming: turn on drain mode + "GOAWAY" signal before updating.
Plan windows to outweigh streams and client backhoe.

5) Data compatibility and database migration

5. 1 Expand-Migrate-Contract

1. Expand: add new columns/indexes/tables without breaking the old version.
2. Migrate: we transfer data in the background and idempotently (batches, checkpoints).
3. Contract: delete the old only after stabilization.

5. 2 Practices

Avoid exclusive DDL locks in the release window.
Versioning API/event contracts (schema registry, CDC).
For heavy migrations - online tools, replicas, phased switching.
Dual-write only with deduplication and idempotent consumers.
Outbox/Inbox for reliable integration through queues.

6) Caches, sessions and background jobs

Sessions and cache are external (Redis/Memcached) so that versions are interchangeable.
Warm up cache/jits/tempo indexes before pooling.
Split the background queues by version or use leadership to avoid racing.

7) Observability and SLO gates

Golden signals: p95/p99 latency, error rate, RPS, saturation, queue lag.
Business SLA: authorizations, conversions, successful payments, refusal by funnel steps.
Gates: rollout is promoted only if canary ≤ baseline + degradation thresholds, and error budget does not burn out.

8) Safe completion and rollback

Rollback is the same pipeline, only in the opposite direction: fixed commands, not "manual kraft."

For blue-green - flip back; for canary - weight loss to 0% or previous stable step.
Data: offsetting transactions, reprocessing, event deduplication.

9) Zero-Downtime checklists

Before release

One signed artifact (immutable), SBOM and dependency check are collected.
Readiness/liveness implemented and tested.
Migration plan in expand mode, reversibility confirmed.
Dashboards and alerts for the new version are ready, release marks are thrown.
Rollback checked for staging/pre-prod.

At the time of release

Connection drainage is enabled, timeouts are adequate.
Traffic is canary/ring or flip (blue-green).
Metrics are compared to baseline, gate thresholds are met.

After release

Post-monitoring N hours, no incidents.
Completed contract migrations, removed temporary flags/routes.
Retrospective, playbook update.

10) Anti-patterns

Recreate-deploy without drainage and readiness ⇒ request breaks.
Unprepared DDL ⇒ locks and timeouts in prime time.
Mixing incompatible schemes between service versions.
Lack of idempotency in handlers and workers.
"Roll out by feel" without gates and comparison with baseline.
Long DNS-TTL with blue-green, which is why flip lasts for hours.
Local sessions/cache in instance memory with rolling/canary.

11) Implementation scenarios

11. 1 Kubernetes (rolling + canary)

Deployment с `maxUnavailable=0`, `maxSurge=25%`.
Readiness is waiting for warm-up (cache initialization, minor migration).
Service-mesh/Ingress with weighted routing (1-5-10-25-50-100%).
Alerts: p95, 5xx, queue lag, business funnel.

11. 2 Blue-Green in the Cloud

Two stacks behind the balancer: 'blue. example. com` и `green. example. com`.
Warm up green, smoke/regress, then listener/route swap (or DNS switch with low TTL).
In case of problems - instant flip back.

11. 3 Stateful service

Data replicas + online migrations; double reading with validation.
Background jabs are carried by version "leadership" or split queues.
Sessions/cache outside the instance; sticky is only temporarily enabled.

12) Ficheflags and client applications

New features are activated by flags (segments: employees → beta → all).
For mobile/desktop clients, consider protocol compatibility boundaries and legacy degradation policy (server-side fallback).

13) Performance and cost

Rolling is cheaper, but requires careful compatibility.
Blue-Green is more expensive at the time of release, but the rollback is instant.
Canary balances risk and cost, but requires strong observability.
Save through ephemeral previews and auto-cleaning stands.

14) Minimum reference pipeline ZDT

1. Build: single artifact, signature, SBOM.
2. Test: unit/integration/contract + security.
3. Staging: smoke, load, expand migrations, rollback check.
4. Prod: shadow → canary (gates) or blue-green flip.
5. Post-deploy: surveillance, contract-cleanup, retro.

15) Brief summary

Zero-Downtime is a discipline: compatible versions + correct routing + managed migrations + observability and fast rollback. Choose a pattern for context (rolling, blue-green, canary), automate SLO gates, keep data idempotent - and releases will cease to be an event, turning into a reliable routine process.

Zero-Downtime deployments

At the time of release

After release

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects