SRE culture and engineering principles

1) What is SRE culture

SRE culture is a set of values and practices that make reliability manageable: SLO goals → error-budget → conscious risks of change → quick stabilization → training on incidents.
The key paradigm: speed ≠ the enemy of reliability. Release speed is possible when risks are metered and automated.

Core values:

User-centric: denote reliability as the user sees it (SLI/SLO).
Automation-first - any repeatable action → script/policy/controller.
Blameless: errors are systemic, we investigate the causes, not people.
Data-driven: solutions based on metrics and error budgets.
Simplicity: simple, testable mechanisms> "magic" solutions.

2) SRE Engineering Philosophy

1. SLO/SLI and error budget are the basis of priorities and alerting.
2. Incident → stabilization → RCA - symptoms first, then causes.
3. Reducing manual labor (toil) is the goal ≤ 50% of SRE time, lower over time.
4. Production readiness - "production readiness" is required before external traffic.
5. Simplicity and isolation - fewer relationships, more blast radius restrictions.
6. Default observability - metrics/logs/traces, SLO widgets, synthetics.
7. Changes are managed - progressive delivery, canary calculations, auto-rollback.
8. Security by design - secrets, access, audit, minimum privileges.
9. Study cycles - drills, chaos games, post-mortems, retrospectives.
10. FinOps-awareness - "price of nines," cost-to-serve, effective SLOs.

3) Rituals and processes

3. 1 Production Readiness Review (PRR)

Before enabling traffic, the service must have:

SLI/SLO, dashboard and alerts (fast/slow burn).
Health-endpoints '/healthz ', '/readyz', '/startupz '.
Runbook/playbook of incidents, owner/on-call, escalation chain.
Backups/DR plan, resource limits, budget calculations.
Fault tolerance tests (feature flags, rollback scripts).

3. 2 Weekly SLO Briefing

Status of the service error-budget.
Weekly incidents, CAPA progress.
Release risk: where allowed/limited by deposit (budget).

3. 3 Postmortem without charges

Facts and timeline, user influence, which helped/hindered.

Systemic causes (processes/tools), not "culprit."

Specific CAPAs with owners and deadlines, publicity within the company.

3. 4 Games of Chaos and Dreal

Planned injection of failures (network, database, cache, nodes) + target SLO.
"Game day": stabilization time, MTTR measurement, playbook adjustment.

4) Alerting and noise

Principles:

Alert only on symptoms: broken SLO or user path.
Multi-window, multi-burn: fast and slow channels.
Quorum/anti-flapping: 'for' delays, suppression during maintenance.
Down with "CPU> 80%" - such signals to dashboards, not to a pager.

Alert Quality KPIs:

The proportion of actionable ≥ 80%.
Median time-to-ack ≤ 5 minutes (P1).
Pager fatigue reduction: ≤ 1 night page per week per engineer.

5) Change Management

Progressive delivery: canary → 10% → 25% → 50% → 100%.
Auto-rollback on SLO signals (errors/latency).
Feature-flags and kill-switch instead of global rollback.
Change policy by risk: fast lane для low-risk; CAB - high-risk only.

Canary step pattern (ideologically):

yaml steps:
- setWeight: 10
- analysis: { template: "slo-check" } # fail ⇒ rollback
- setWeight: 25
- analysis: { template: "slo-check" }

6) Reduction of toil (routine manual labor)

Examples of toil sources: manual deploi, restarts, "give access" tickets, queue cleaning.

Approach:

Repeatable task inventory → automation/self-service.
KPI:% time on toil, "automated steps/incident," "minutes to self-service."
Platform service catalog (namespaces, DB, queues, dashboards, alerts).

7) Observability and SLO-first design

Golden Signals (latency, traffic, errors, saturation).
SLO cards in each team: goal, window, budget, burn alerts.
Drilldown: from metrics to logs/traces; 'trace _ id' in default logs.
Synthetics: blackbox + headless scripts (login/deposit/checkout).

8) Capacity management and sustainability

Capacity planning: target RPS/competitiveness, stock by AZ/region.
Bulkhead/shedding: isolating pools, failing secondary functions first.
Backpressure and queues: lag control, DLQ, adaptive competitiveness.
Failover and DR: RPO/RTO, regular DR drills.

9) Safety as part of reliability

Secrets: secret manager, JIT accesses, audit.
WAF/DDoS-guard on the perimeter, client/tenant limits.
PII minimization, DSAR/Legal Hold in incidents.
Supply chain security: signature of artifacts, base image policy.

10) On-call health

Rotations without "singles," clear windows of rest.
The wake-at-night threshold is SLO P1/P2 only.
Psychohygiene: Sleep deficiency is recorded as an operational risk.
Metrics: pages/week, night pages/engineer, recovery time.

11) SRE Maturity Metrics

SLO coverage: the proportion of critical paths with SLO/alerts ≥ 90%.
Error-budget governance: there are freeze rules and apply.
Toil: ≤ 30-40% of the time, downward trend.
MTTD/MTTR: medians in quarterly dynamics.
Auto-mitigation rate:% of incidents with automatic action.
PRR pass-rate: percentage of releases that have passed production readiness.
Postmortem SLA: SEV-1 - postmortem ≤ 48 hours.

12) Documentation and knowledge

Minimum set:

Runbooks/playbooks (top scripts: 5xx spike, DB lag, Kafka lag, NodeNotReady, TLS).
SLO cards and dashboards.
PRR checklists and release templates.
Platform service catalog and OLAs/SLAs.
Training materials: SRE 101, Chaos 101, On-call 101.

13) Anti-patterns

Hero-culture: "rescuers" instead of system fixes.
Noisy alerting: CPU/drives in pager, hundreds of unnecessary signals.
"DevOps is a man": smeared responsibility, no owners.
Lack of SLO: "keep everything green" → priority chaos.

Delayed post-mortems and "witch hunts."

Global rollbacks without canaries.
Secrets in config/repo; no activity audit.
Observability as "beautiful graphs" without actionable signals.

14) Artifact patterns

14. 1 SRE-Charter (fragment)

yaml mission: "Make reliability manageable and economical"
tenets:
- "User - SLI/SLO Center"
- "Automation-first, minimizing toil"
- "Blameless & learning"
governance:
error_budget:
freeze_threshold: 0. 8 # 80% of the budget burned ⇒ release frieze review_cadence: "weekly"
oncall:
paging_policy: "SLO-only, P1/P2 at night"
health_metrics: ["pages_per_week", "night_pages_per_engineer"]

14. 2 Mini-PRR checklist

SLI/SLO and burn alerts are configured
Health-endpoints and synthetics
Runbook/playbook + owner/on-call
Rollback/feature flags/canary
latency/errors/traffic/saturation dashboards
Limits/quotas/guardrails security
DR plan and backups tested

15) Implementation by stage (4 sprints)

Sprint 1 - Foundation

Define critical user paths and SLIs.
Formulate SLO and run burn alerts.
Enter PRR and minimum playbooks.

Sprint 2 - Change Management

Canary calculations, auto-rollback by SLO.
Self-service operations, service catalog.
Toil inventory and automation plan.

Sprint 3 - Training Cycles

Post-mortem ritual, chaos games calendar.
Dashboards SLO + incidents, reporting error-budget.

Sprint 4 - Optimization and Scale

SLO portfolio, FinOps "cost per 9."

Implementation of DR discipline, safety audit.
KPI on-call, burnout prevention.

16) Mini-FAQ

SRE = "fix everything"?
No, it isn't. SRE manages the reliability system: SLO, alert, processes, automation and training.

How to convince a business to invest in reliability?
Show ROI: lower MTTR, higher conversion, less SLA credits, below cost-to-serve, stable releases.

Do I need separate SRE commands?
Hybrid model: strategic SRE in platform + embedded-SRE in critical products.

Total

SRE culture is not a position, but a way to work with risk: SLO → error budget → managed change → automation → training. Fix the principles, start rituals (PRR, post-mortems, chaos games), shoot toil, build observability "by default" and take care of it-call. This way you get sustainable development speed, predictable releases and a reliable, economical platform.

SRE culture and engineering principles

Sprint 2 - Change Management

Sprint 3 - Training Cycles

Sprint 4 - Optimization and Scale

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects