High Availability и SLA
High Availability и SLA
1) Terms and connection with business
SLI (Service Level Indicator) - measured service indicator (for example, the proportion of successful requests 2xx/3xx ≤ T ms).
SLO (Service Level Objective) - target SLI value (e.g. "99. 95% of requests ≤ 300 ms").
SLA (Service Level Agreement) - contractual obligation to the client (fines/credits in case of violation).
HA (High Availability) - architectural and operational measures that allow you to perform SLO/SLA.
Principle: SLA relies on SLO and SLO relies on observed SLIs. You can't promise in SLA what you don't measure.
2) "Nines" and accessibility math
Availability per period = 'work _ time/total _ time'. Benchmarks (per year):Composition of availability
Sequential chain (red path dependencies): 'A _ total = Π A_i' (each component reduces the total).
Parallel asset nodes: 'A _ total = 1 − Π (1 − A_i)' (reserve increases total).
3) What exactly to measure (correct SLI)
User view: successful completion of key operations (login, deposit, check-out) and their latency p99.
Time corridor: aggregate by sliding windows (5/30/60 min) and by region.
Exceptions: "scheduled windows" are counted in SLOs, and in SLAs only if the contract says so.
- Availability: success rate ≤ T.
- Quality: p95/p99 latency.
- Composite: "share of successful deposits ≤ 5 s."
4) Error Budget and burn rate
Error Budget = `1 − SLO`. For 99. 95% monthly window gives 0. 05% errors/downtime.
Burn-rate: the speed of budget consumption (e.g. 4 × means that in 6 hours you eat up the daily limit).
Policy: with rapid combustion - stop releases, focus on stabilization, feature-freeze.
5) HA Architecture: Node to Region
5. 1 Node/Service
N + 1: at least one redundant replica (Deployment ≥ 2, PDB, anti-affinity).
Resource isolation: CPU/RAM/IO limits, priorities (PriorityClass).
Graceful shutdown/drain: no request break on restart.
5. 2 Zone/Region
Multi-AZ: replicas in different zones, cross-zone balancing, independent power/network.
Multi-region: asset-asset (harder: data/consistency) or asset-liability (simpler: above RPO).
Data: CP for money/orders (quorum/RAFT), EC/AP for caches/storefronts.
5. 3 Network layer and perimeter
L7-LB с health-checks, retry/timeout/circuit-breaking.
GSLB/DNS/Anycast for global traffic, short TTL.
Egress control and fault-tolerant channels to external PSP/providers.
6) Degradation instead of falling
Feature kill-switch (feature flags): turn off non-critical, save the "red path."
Switching to simplified paths: synchronous → asynchronous/queue, "accepted for processing."
Rate-limit/quotas: it is better to limit traffic than drop everyone.
Stale modes: give cache/static data when origin is unavailable.
7) Constraint management
Service map: direct/transitive, criticality, SLO of each.
Vulnerable links: external provider without SLA - turns into a cache/queue/duplicate.
Bulkhead isolation: different connection pools/quotas for slow routes.
Timeouts> Retries: short timeouts, maximum 1 retray for idempotent operations.
8) Operations and changes
Change management: releases via canaries/blue-green, SLO gates, automatic rollback.
Scheduled windows: standardize - length, frequency, communications.
Incidents: roles (IC/Comms/Tech/DB), runbook 'and, post-mortems with corrective actions.
Security events: if compromised, "panic mode" (read-only/tokens/rotation/blocking).
9) Observability and alerting
RED model (Rate, Errors, Duration) for each route.
SLI dashboards: availability/latency by region and by customer segment.
Burn-rate alerts: fast (1h, 14. 4 ×), slow (6h, 2 ×) - signal before SLO failure.
Exemplars-Switches from metrics to trace_id alignments.
Synthetics: samples from external points (perimeter, payment flow).
10) Fault tolerance tests
Game-days: scenarios for disabling AZ/regions, database/cache degradation, failure of external providers.
Chaos tools: network folts (latency/loss), kill-pods, CPU/IO overload.
DR-drills: development of RTO/RPO for Tier-0 systems (see "Backups and DR").
11) SLA Design
Definition of "availability": what counts as an incident (5xx, time> T, domain errors).
Calculation window: month/quarter; inclusion/exclusion of planned activities.
Credits/penalties: scale (e.g. 99. 9–99. 99% - X%, lower - Y%).
Client responsibilities: integration, retrays within reasonable limits, limits.
Notifications and the procedure of clymes: terms, format, evidence base (logs/metrics).
Force majeure: legal wording and boundaries.
- "API availability by SLI "successful ≤ 500 ms" is at least 99. 95% per calendar month. Scheduled windows (up to 60 min/month announced in 48 hours) are excluded. At 99. 90–99. 95% - loan 5%; 99. 80–99. 90% — 10%; <99. 80% — 25%.»
12) Nines economy
Each additional "nine" increases costs not linearly (double regions, quorums, duplicates of providers, 24 × 7). Use tiering SLO:- Tier-0 (money/orders): 99. 95–99. 99%, multi-AZ, DR ready.
- Tier-1 (basic features): 99. 9–99. 95%, multi-AZ.
- Tier-2 (non-critical): 99. 5–99. 9%, degradation/stop is allowed for incidents.
13) HA patterns by layer
Perimeter: CDN/edge, multi-CDN or GSLB, WAF, rate-limit.
Balancing: L7 with outlier-ejection, timeouts/retrays, sticky/consistent-hash.
Applications: horizontal scale, readiness/liveness, PDB, topology spread.
Data: leader + replicas, quorum for CP, L2 cache, idempotency, PITR.
Queues: mirroring/multicluster, dedup, DLQ.
Secrets/configs: GitOps, atomic snapshots, rollback.
14) Anti-patterns
SLA without measuring instruments and external synthetics.
Single zone/cluster as SPOF.
Uncontrolled retrays → "self-DDoS."
Long transactions/mutexes on the hot track.
"Heavy" migrations/releases without canaries and rollback plan.
Lack of runbook and communication with stakeholders in an incident.
15) Implementation checklist (0-60 days)
0-15 days
Define critical user SLIs, set SLOs by Tier-0/1/2 levels.
Include burn-rate alerts, SLO-dashboards, synthetic perimeter checks.
Remove SPOF: ≥2 replicas, PDB, multi-AZ for fronts and critical databases.
16-40 days
Introduce canary releases with SLO-gating and auto-rollback.
Dependency map + quotas/pools/timeouts/PB for each "red path."
Regulation of planned windows and communications, incident message templates.
41-60 days
Game-day: disconnection of AZ, failure of an external provider, "burst" of traffic.
Recalculation of SLAs and actual credits, publication of reports to customers.
Revision of the "cost of ↔ nine" and re-laying on the shooting gallery.
16) Maturity metrics
≥ 95% of critical routes have SLI/SLO and burn-rate alerts.
SLO errors are accompanied by auto-freeze of releases (policy).
Multi-AZ coverage Tier-0 = 100%, successful DR-drills ≥ 1/quarter.
"Detection → mitigation" time p50 <5 min, p95 <15 min.
"Release ↔ incidents" correlation - maintained and reduced (rollback rate↓).
Public Incident/Credit Report - within N business days.
17) Examples and snippets
Burn-rate alerts (rule idea):- Fast: "SLO 99. 95%, window 1 h, burn ≥ 14. 4× → page on-call».
- Slow: "window 6 h, burn ≥ 2 × → ticket & monitoring."
yaml circuit_breakers:
thresholds:
- max_connections: 200 max_pending_requests: 100 max_requests: 1000 max_retries: 1 outlier_detection:
consecutive_5xx: 5 interval: 5s base_ejection_time: 30s max_ejection_percent: 50
Canary with SLO analysis (Argo Rollouts, idea):
yaml analysis:
templates:
- name: slo-burn metrics:
- name: error-rate successCondition: result < 0. 005 provider: prometheus
SLI formulation example:
SLI: fraction_of_good_requests = good(HTTP 2xx/3xx ≤ 500ms) / all(requests)
SLO: ≥ 99. 95% per calendar month, per region
18) Conclusion
High Availability is not only clusters and replicas, but a consistent set of architecture, processes and metrics: clear SLI/SLO, realistic SLA, economics nines, degradation instead of falling, timeout/quota discipline, canary releases, regular exercises and transparent communication. Make affordability measurable and manageable - and it becomes a competitive advantage, not a lottery.