High Availability и SLA

1) Terms and connection with business

SLI (Service Level Indicator) - measured service indicator (for example, the proportion of successful requests 2xx/3xx ≤ T ms).
SLO (Service Level Objective) - target SLI value (e.g. "99. 95% of requests ≤ 300 ms").
SLA (Service Level Agreement) - contractual obligation to the client (fines/credits in case of violation).
HA (High Availability) - architectural and operational measures that allow you to perform SLO/SLA.

Principle: SLA relies on SLO and SLO relies on observed SLIs. You can't promise in SLA what you don't measure.

2) "Nines" and accessibility math

Availability per period = 'work _ time/total _ time'. Benchmarks (per year):

Availability	Max. downtime/year
99. 0%	≈ 3 days 15 hours
99. 5%	≈ 1 day 20 h
99. 9%	≈ 8 h 45 m
99. 95%	≈ 4 h 23 m
99. 99%	≈ 52 m 34 s
99. 999%	≈ 5 m 15 s

Composition of availability

Sequential chain (red path dependencies): 'A _ total = Π A_i' (each component reduces the total).
Parallel asset nodes: 'A _ total = 1 − Π (1 − A_i)' (reserve increases total).

3) What exactly to measure (correct SLI)

User view: successful completion of key operations (login, deposit, check-out) and their latency p99.
Time corridor: aggregate by sliding windows (5/30/60 min) and by region.
Exceptions: "scheduled windows" are counted in SLOs, and in SLAs only if the contract says so.

SLI types:

Availability: success rate ≤ T.
Quality: p95/p99 latency.
Composite: "share of successful deposits ≤ 5 s."

4) Error Budget and burn rate

Error Budget = `1 − SLO`. For 99. 95% monthly window gives 0. 05% errors/downtime.
Burn-rate: the speed of budget consumption (e.g. 4 × means that in 6 hours you eat up the daily limit).
Policy: with rapid combustion - stop releases, focus on stabilization, feature-freeze.

5) HA Architecture: Node to Region

5. 1 Node/Service

N + 1: at least one redundant replica (Deployment ≥ 2, PDB, anti-affinity).
Resource isolation: CPU/RAM/IO limits, priorities (PriorityClass).
Graceful shutdown/drain: no request break on restart.

5. 2 Zone/Region

Multi-AZ: replicas in different zones, cross-zone balancing, independent power/network.
Multi-region: asset-asset (harder: data/consistency) or asset-liability (simpler: above RPO).
Data: CP for money/orders (quorum/RAFT), EC/AP for caches/storefronts.

5. 3 Network layer and perimeter

L7-LB с health-checks, retry/timeout/circuit-breaking.
GSLB/DNS/Anycast for global traffic, short TTL.
Egress control and fault-tolerant channels to external PSP/providers.

6) Degradation instead of falling

Feature kill-switch (feature flags): turn off non-critical, save the "red path."

Switching to simplified paths: synchronous → asynchronous/queue, "accepted for processing."

Rate-limit/quotas: it is better to limit traffic than drop everyone.
Stale modes: give cache/static data when origin is unavailable.

7) Constraint management

Service map: direct/transitive, criticality, SLO of each.
Vulnerable links: external provider without SLA - turns into a cache/queue/duplicate.
Bulkhead isolation: different connection pools/quotas for slow routes.
Timeouts> Retries: short timeouts, maximum 1 retray for idempotent operations.

8) Operations and changes

Change management: releases via canaries/blue-green, SLO gates, automatic rollback.
Scheduled windows: standardize - length, frequency, communications.
Incidents: roles (IC/Comms/Tech/DB), runbook 'and, post-mortems with corrective actions.
Security events: if compromised, "panic mode" (read-only/tokens/rotation/blocking).

9) Observability and alerting

RED model (Rate, Errors, Duration) for each route.
SLI dashboards: availability/latency by region and by customer segment.
Burn-rate alerts: fast (1h, 14. 4 ×), slow (6h, 2 ×) - signal before SLO failure.
Exemplars-Switches from metrics to trace_id alignments.
Synthetics: samples from external points (perimeter, payment flow).

10) Fault tolerance tests

Game-days: scenarios for disabling AZ/regions, database/cache degradation, failure of external providers.
Chaos tools: network folts (latency/loss), kill-pods, CPU/IO overload.
DR-drills: development of RTO/RPO for Tier-0 systems (see "Backups and DR").

11) SLA Design

Definition of "availability": what counts as an incident (5xx, time> T, domain errors).
Calculation window: month/quarter; inclusion/exclusion of planned activities.
Credits/penalties: scale (e.g. 99. 9–99. 99% - X%, lower - Y%).
Client responsibilities: integration, retrays within reasonable limits, limits.
Notifications and the procedure of clymes: terms, format, evidence base (logs/metrics).
Force majeure: legal wording and boundaries.

Example (sketch):

"API availability by SLI "successful ≤ 500 ms" is at least 99. 95% per calendar month. Scheduled windows (up to 60 min/month announced in 48 hours) are excluded. At 99. 90–99. 95% - loan 5%; 99. 80–99. 90% — 10%; <99. 80% — 25%.»

12) Nines economy

Each additional "nine" increases costs not linearly (double regions, quorums, duplicates of providers, 24 × 7). Use tiering SLO:

Tier-0 (money/orders): 99. 95–99. 99%, multi-AZ, DR ready.
Tier-1 (basic features): 99. 9–99. 95%, multi-AZ.
Tier-2 (non-critical): 99. 5–99. 9%, degradation/stop is allowed for incidents.

13) HA patterns by layer

Perimeter: CDN/edge, multi-CDN or GSLB, WAF, rate-limit.
Balancing: L7 with outlier-ejection, timeouts/retrays, sticky/consistent-hash.
Applications: horizontal scale, readiness/liveness, PDB, topology spread.
Data: leader + replicas, quorum for CP, L2 cache, idempotency, PITR.
Queues: mirroring/multicluster, dedup, DLQ.
Secrets/configs: GitOps, atomic snapshots, rollback.

14) Anti-patterns

SLA without measuring instruments and external synthetics.
Single zone/cluster as SPOF.

Uncontrolled retrays → "self-DDoS."

Long transactions/mutexes on the hot track.
"Heavy" migrations/releases without canaries and rollback plan.
Lack of runbook and communication with stakeholders in an incident.

15) Implementation checklist (0-60 days)

0-15 days

Define critical user SLIs, set SLOs by Tier-0/1/2 levels.
Include burn-rate alerts, SLO-dashboards, synthetic perimeter checks.
Remove SPOF: ≥2 replicas, PDB, multi-AZ for fronts and critical databases.

16-40 days

Introduce canary releases with SLO-gating and auto-rollback.

Dependency map + quotas/pools/timeouts/PB for each "red path."

Regulation of planned windows and communications, incident message templates.

41-60 days

Game-day: disconnection of AZ, failure of an external provider, "burst" of traffic.
Recalculation of SLAs and actual credits, publication of reports to customers.
Revision of the "cost of ↔ nine" and re-laying on the shooting gallery.

16) Maturity metrics

≥ 95% of critical routes have SLI/SLO and burn-rate alerts.
SLO errors are accompanied by auto-freeze of releases (policy).
Multi-AZ coverage Tier-0 = 100%, successful DR-drills ≥ 1/quarter.
"Detection → mitigation" time p50 <5 min, p95 <15 min.
"Release ↔ incidents" correlation - maintained and reduced (rollback rate↓).
Public Incident/Credit Report - within N business days.

17) Examples and snippets

Burn-rate alerts (rule idea):

Fast: "SLO 99. 95%, window 1 h, burn ≥ 14. 4× → page on-call».
Slow: "window 6 h, burn ≥ 2 × → ticket & monitoring."

Envoy — circuit breaking/outlier:

yaml circuit_breakers:
thresholds:
- max_connections: 200 max_pending_requests: 100 max_requests: 1000 max_retries: 1 outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50

Canary with SLO analysis (Argo Rollouts, idea):

yaml analysis:
templates:
- name: slo-burn metrics:
- name: error-rate successCondition: result < 0. 005 provider: prometheus

SLI formulation example:


SLI: fraction_of_good_requests = good(HTTP 2xx/3xx ≤ 500ms) / all(requests)
SLO: ≥ 99. 95% per calendar month, per region

18) Conclusion

High Availability is not only clusters and replicas, but a consistent set of architecture, processes and metrics: clear SLI/SLO, realistic SLA, economics nines, degradation instead of falling, timeout/quota discipline, canary releases, regular exercises and transparent communication. Make affordability measurable and manageable - and it becomes a competitive advantage, not a lottery.

High Availability и SLA