Operating layer architecture
1) Task of the operating layer
The operational layer is a platform and set of practices that provide predictable exploitation: fast releases, low MTTR, compliance and managed cost. It creates railings for products and infrastructure: standards, automation, observability, change management, and secure access.
2) Logical model (planes and domains)
┌────────────────────────────────────────────────────────┐
│ Interface Plane (UX) │← ChatOps/Portals/API
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Control Plane: Policy, Orchestration, Identity, CMDB │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Data/Execution Plane: CI/CD, Jobs, IaC, Runtime Ops │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Telemetry Plane: Logs, Metrics, Traces, SLO Dashboards │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Security & Compliance Plane: Secrets, RBAC, Audit, IR │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Finance/Cost Plane: Usage, Quotas, Budgets, FinOps │
└────────────────────────────────────────────────────────┘
Key domains:
- Service directory/CMDB: a single register of services, owners, SLO, dependencies.
- Orchestration: pipelines, tasks, crowns, backups, DR.
- Policies (Policy-as-Code): alerts, accesses, retentions, change-gates.
- Observability: metrics/trails/logs, SLI/SLO, alerts and status page.
- Accesses/secrets: JIT/JEA, tokens, crypto, KMS/Vault.
- Incidents/changes: ITSM/tickets, CAB/RFC, post-mortems, simulations.
- DataOps: data contracts, freshness, lineage, quality.
- FinOps: cost accounting, limits, quotas, optimizations.
3) Reference flows
3. 1 Release (CI/CD → GitOps)
1. PR with code/manifests → tests/scans → signing artifacts.
2. Progressive deploy (canary/blue-green) with SLO-gardrails.
3. Auto-rollback during degradation; release annotations in telemetry.
3. 2 Detect → Respond → Recover
1. Burn-rate/symptoms + quorum → Page + war-room.
2. Diagnostics by traces/logs; playbooks.
3. Rollback/Folback/Limits → AAR/RCA → CAPA.
3. 3 Change (RFC/CAB)
1. Risk analysis + maintenance window + backout plan.
2. Suppression of non-critical alerts, SLO signals are active.
3. Evidence and report, policy review.
4) Service catalog and CMDB
Attributes: owner, SLI/SLO, dependencies (internal/external), dashboards, alerts, runbook 'and, data classes (PII/finance), zones (prod/stage/dev).
Auto-content: from CI/CD, telemetry and repositories.
Usage: alert routing, escalation, blast radius calculation, maturity reporting.
5) Policies-as-Code
Categories: access (RBAC/ABAC), security (SAST/SCA/DAST), alerts/SLO, grants, change-gates, resources/quotas.
Mechanics: declarative rules (YAML/Rego/CEL), validation in CI, enforcement in Control Plane.
An example of a gate: "Deploy is allowed if all SLOs are green, there are no active SEV-1, tests have passed, signatures are valid."
6) Orchestration and execution
CI/CD: build → scan → sign → promote.
Jobs/CronJobs/DAG: backups/rotations/backfills; deadlines and competition (Forbid/Replace).
Idempotence and rollbacks: check-then-act, step markers, circuit-breaker.
Launch rights: JIT accounts, limited scope; audit.
7) Signal observability and quality
SLI/SLO by domain: availability/latency/success of business operations, data freshness.
Alerts: burn-rate in two windows, quorum, rate-limit, runbook and owner.
Logs/metrics/trails are linked trace_id; channels from graphs to logs.
Status page: templates, update frequencies, auditing publications.
8) Accesses, secrets, crypto
Secret repositories (KMS/Vault), rotations, prohibition of secrets in the repo.
JIT/JEA Issue for operation/shift time.
mTLS/OIDC between services Image Signing/SBOM.
Audit: immutable logs, WORM for critical actions.
9) Incidents, changes, maintenance windows
Incidents: SEV matrix, IC/TL/Comms/Scribe, update templates, AAR→RCA→CAPA.
Changes: RFC/CAB, risk assessment, canaries, backout.
Maintenance windows: timing, communication, suppression of rules, evidence.
10) DataOps in the operation layer
Data contracts (schemas, freshness/completeness SLAs).
DQ tests on each layer (Bronze/Silver/Gold).
Lineage and catalogs; quarantine for scrap.
Data SLO and freshness/drift alerts.
11) FinOps and cost
Unit economy: $/1k requests, $/successful transaction, $/GiB logs, $/SLO point.
Quotas/limits: egress, log volumes, task duration.
Optimization: partitsii / cash / materializatsii / arkhivy (hot-warm-cold).
Reports: cheap "expensive" services/requests, alerts for overspending.
12) Interfaces: ChatOps/Portals/API
Platform portal: service catalog, push/push buttons, SLO status, window slots, policies.
ChatOps: `/deploy`, `/handover start`, `/mw create`, `/status update` — с аудитом и evidence.
API: for integration with ITSM/HR/billing/providers.
13) Responsibility Model (RACI)
Platform/SRE: control plane, policies, observability, rotations.
Product/Dev: SLO services, releases, playbooks.
Security: secrets, vulnerabilities, IR.
Data/Analytics: DataOps, SLA freshness/quality.
Compliance/Legal: regulatory, evidence storage.
Support/Comms: status page, client messages.
14) Operating layer maturity metrics
SLO coverage:% of services with defined SLI/SLO and burn-rate.
Alert hygiene: actionable ≥80%, FP ≤5%, alerts/on-call-hour (p95).
DORA: depletion rate, lead time, MTTR, change-failure-rate.
Change governance:% RFC changes,% on-time windows, rollbacks.
Security: average time to rotate secrets/certificates, closing vulnerabilities.
FinOps: $/unit and% QoQ savings.
Docs: runbook/SOP coating, freshness (≤90 days).
15) Minimum viable operating layer (MVP) checklist
- Service directory/CMDB with owners, SLO, dependencies and dashboards.
- CI/CD + GitOps, artifact signature, progressive releases, auto-rollback.
- Combined telemetry (logs/metrics/traces) with trace_id and SLO-alerts (double windows, quorum).
- Policy-as-Code: accesses, alerts, retentions, change-gates.
- Secret store, JIT/JEA, mTLS/SSO, unalterable audit.
- ITSM/Incidents: SEV matrix, playbooks, status page, update templates.
- Maintenance windows: calendar, RFC templates, backout plans, evidence.
- FinOps: cost visibility, quotas/limits, reports.
- Docs-as-Code, SOP/Runbook Templates, Ready for Production Checklist
16) Anti-patterns
"Platform = script set" without control plane and policies.
Monitoring "from everything →" avalanche of alerts, alert fatigue.
Manual production changes without GitOps/audit.
Secrets in environment variables without storage and rotation.
Lack of SLO: Arguing about feelings, not quality goals.
Scattered directories/owner tables → lost escalations.
High-risk changes do not have a backout plan.
Logs without structure/correlation → long investigations.
17) Mini templates
17. 1 Service card (catalog)
Service: checkout-api
Owner: @team-checkout
SLO: availability 99. 9% (28d), p95 latency ≤ 250 ms
Dependencies: payments-api, auth, redis, psp-a
Dashboards: SLO, errors, latency, capacity
Runbooks: rb://checkout/5xx, rb://checkout/rollout
Data: PII masked; retention 30d logs, 365d audit
Change gates: canary 1/5/25%, auto-rollback on burn-rate breach
17. 2 Politics alert (idea)
yaml id: checkout-latency-burn type: burn_rate sli: http_latency_p99 windows:
short: {duration: 1h, threshold: 5%}
long: {duration: 6h, threshold: 2%}
quorum: [ "synthetic:eu,us", "rum:checkout" ]
owner: team-checkout runbook: rb://checkout/latency routing: page:oncall-checkout controls: {dedup_key: "svc=checkout,region={{region}}", rate_limit: "1/15m"}
17. 3 Gate deploy (pseudo)
yaml allow_deploy_when:
tests: passed signatures: valid active_sev: none_of [SEV-0, SEV-1]
slo_guardrails: green_last_30m rollback_plan: present
18) Implementation Roadmap (8-12 weeks)
1. Ned. 1-2: service inventory → directory/CMDB; basic SLI/SLO and dashboards.
2. Ned. 3-4: GitOps + progressive releases; Policy-as-Code.
3. Ned. 5-6: unified telemetry and status page; burn-rate with quorum; runbook coverage.
4. Ned. 7-8: secrets/JIT, immutable audit; RFC/maintenance windows.
5. Ned. 9-10: FinOps reporting, quotas/limits; optimization of logs and storage.
6. Ned. 11-12: simulations of incidents/DR; maturity metrics; continuous improvement plan.
19) The bottom line
The operating layer architecture is a control plane plus standardized practices that turn operation into a repeatable, measurable, and safe process. Service catalog, GitOps, telemetry, policies, secure accesses, and managed changes deliver sustainable releases, rapid recovery, and transparent cost - that is, operational predictability for the business.