SRE culture and engineering principles
1) What is SRE culture
SRE culture is a set of values and practices that make reliability manageable: SLO goals → error-budget → conscious risks of change → quick stabilization → training on incidents.
The key paradigm: speed ≠ the enemy of reliability. Release speed is possible when risks are metered and automated.
- User-centric: denote reliability as the user sees it (SLI/SLO).
- Automation-first - any repeatable action → script/policy/controller.
- Blameless: errors are systemic, we investigate the causes, not people.
- Data-driven: solutions based on metrics and error budgets.
- Simplicity: simple, testable mechanisms> "magic" solutions.
2) SRE Engineering Philosophy
1. SLO/SLI and error budget are the basis of priorities and alerting.
2. Incident → stabilization → RCA - symptoms first, then causes.
3. Reducing manual labor (toil) is the goal ≤ 50% of SRE time, lower over time.
4. Production readiness - "production readiness" is required before external traffic.
5. Simplicity and isolation - fewer relationships, more blast radius restrictions.
6. Default observability - metrics/logs/traces, SLO widgets, synthetics.
7. Changes are managed - progressive delivery, canary calculations, auto-rollback.
8. Security by design - secrets, access, audit, minimum privileges.
9. Study cycles - drills, chaos games, post-mortems, retrospectives.
10. FinOps-awareness - "price of nines," cost-to-serve, effective SLOs.
3) Rituals and processes
3. 1 Production Readiness Review (PRR)
Before enabling traffic, the service must have:- SLI/SLO, dashboard and alerts (fast/slow burn).
- Health-endpoints '/healthz ', '/readyz', '/startupz '.
- Runbook/playbook of incidents, owner/on-call, escalation chain.
- Backups/DR plan, resource limits, budget calculations.
- Fault tolerance tests (feature flags, rollback scripts).
3. 2 Weekly SLO Briefing
Status of the service error-budget.
Weekly incidents, CAPA progress.
Release risk: where allowed/limited by deposit (budget).
3. 3 Postmortem without charges
Facts and timeline, user influence, which helped/hindered.
Systemic causes (processes/tools), not "culprit."
Specific CAPAs with owners and deadlines, publicity within the company.
3. 4 Games of Chaos and Dreal
Planned injection of failures (network, database, cache, nodes) + target SLO.
"Game day": stabilization time, MTTR measurement, playbook adjustment.
4) Alerting and noise
Principles:- Alert only on symptoms: broken SLO or user path.
- Multi-window, multi-burn: fast and slow channels.
- Quorum/anti-flapping: 'for' delays, suppression during maintenance.
- Down with "CPU> 80%" - such signals to dashboards, not to a pager.
- The proportion of actionable ≥ 80%.
- Median time-to-ack ≤ 5 minutes (P1).
- Pager fatigue reduction: ≤ 1 night page per week per engineer.
5) Change Management
Progressive delivery: canary → 10% → 25% → 50% → 100%.
Auto-rollback on SLO signals (errors/latency).
Feature-flags and kill-switch instead of global rollback.
Change policy by risk: fast lane для low-risk; CAB - high-risk only.
yaml steps:
- setWeight: 10
- analysis: { template: "slo-check" } # fail ⇒ rollback
- setWeight: 25
- analysis: { template: "slo-check" }
6) Reduction of toil (routine manual labor)
Examples of toil sources: manual deploi, restarts, "give access" tickets, queue cleaning.
Approach:- Repeatable task inventory → automation/self-service.
- KPI:% time on toil, "automated steps/incident," "minutes to self-service."
- Platform service catalog (namespaces, DB, queues, dashboards, alerts).
7) Observability and SLO-first design
Golden Signals (latency, traffic, errors, saturation).
SLO cards in each team: goal, window, budget, burn alerts.
Drilldown: from metrics to logs/traces; 'trace _ id' in default logs.
Synthetics: blackbox + headless scripts (login/deposit/checkout).
8) Capacity management and sustainability
Capacity planning: target RPS/competitiveness, stock by AZ/region.
Bulkhead/shedding: isolating pools, failing secondary functions first.
Backpressure and queues: lag control, DLQ, adaptive competitiveness.
Failover and DR: RPO/RTO, regular DR drills.
9) Safety as part of reliability
Secrets: secret manager, JIT accesses, audit.
WAF/DDoS-guard on the perimeter, client/tenant limits.
PII minimization, DSAR/Legal Hold in incidents.
Supply chain security: signature of artifacts, base image policy.
10) On-call health
Rotations without "singles," clear windows of rest.
The wake-at-night threshold is SLO P1/P2 only.
Psychohygiene: Sleep deficiency is recorded as an operational risk.
Metrics: pages/week, night pages/engineer, recovery time.
11) SRE Maturity Metrics
SLO coverage: the proportion of critical paths with SLO/alerts ≥ 90%.
Error-budget governance: there are freeze rules and apply.
Toil: ≤ 30-40% of the time, downward trend.
MTTD/MTTR: medians in quarterly dynamics.
Auto-mitigation rate:% of incidents with automatic action.
PRR pass-rate: percentage of releases that have passed production readiness.
Postmortem SLA: SEV-1 - postmortem ≤ 48 hours.
12) Documentation and knowledge
Minimum set:- Runbooks/playbooks (top scripts: 5xx spike, DB lag, Kafka lag, NodeNotReady, TLS).
- SLO cards and dashboards.
- PRR checklists and release templates.
- Platform service catalog and OLAs/SLAs.
- Training materials: SRE 101, Chaos 101, On-call 101.
13) Anti-patterns
Hero-culture: "rescuers" instead of system fixes.
Noisy alerting: CPU/drives in pager, hundreds of unnecessary signals.
"DevOps is a man": smeared responsibility, no owners.
Lack of SLO: "keep everything green" → priority chaos.
Delayed post-mortems and "witch hunts."
Global rollbacks without canaries.
Secrets in config/repo; no activity audit.
Observability as "beautiful graphs" without actionable signals.
14) Artifact patterns
14. 1 SRE-Charter (fragment)
yaml mission: "Make reliability manageable and economical"
tenets:
- "User - SLI/SLO Center"
- "Automation-first, minimizing toil"
- "Blameless & learning"
governance:
error_budget:
freeze_threshold: 0. 8 # 80% of the budget burned ⇒ release frieze review_cadence: "weekly"
oncall:
paging_policy: "SLO-only, P1/P2 at night"
health_metrics: ["pages_per_week", "night_pages_per_engineer"]
14. 2 Mini-PRR checklist
- SLI/SLO and burn alerts are configured
- Health-endpoints and synthetics
- Runbook/playbook + owner/on-call
- Rollback/feature flags/canary
- latency/errors/traffic/saturation dashboards
- Limits/quotas/guardrails security
- DR plan and backups tested
15) Implementation by stage (4 sprints)
Sprint 1 - Foundation
Define critical user paths and SLIs.
Formulate SLO and run burn alerts.
Enter PRR and minimum playbooks.
Sprint 2 - Change Management
Canary calculations, auto-rollback by SLO.
Self-service operations, service catalog.
Toil inventory and automation plan.
Sprint 3 - Training Cycles
Post-mortem ritual, chaos games calendar.
Dashboards SLO + incidents, reporting error-budget.
Sprint 4 - Optimization and Scale
SLO portfolio, FinOps "cost per 9."
Implementation of DR discipline, safety audit.
KPI on-call, burnout prevention.
16) Mini-FAQ
SRE = "fix everything"?
No, it isn't. SRE manages the reliability system: SLO, alert, processes, automation and training.
How to convince a business to invest in reliability?
Show ROI: lower MTTR, higher conversion, less SLA credits, below cost-to-serve, stable releases.
Do I need separate SRE commands?
Hybrid model: strategic SRE in platform + embedded-SRE in critical products.
Total
SRE culture is not a position, but a way to work with risk: SLO → error budget → managed change → automation → training. Fix the principles, start rituals (PRR, post-mortems, chaos games), shoot toil, build observability "by default" and take care of it-call. This way you get sustainable development speed, predictable releases and a reliable, economical platform.