Operating layer architecture

1) Task of the operating layer

The operational layer is a platform and set of practices that provide predictable exploitation: fast releases, low MTTR, compliance and managed cost. It creates railings for products and infrastructure: standards, automation, observability, change management, and secure access.

2) Logical model (planes and domains)


┌────────────────────────────────────────────────────────┐
│        Interface Plane (UX)          │← ChatOps/Portals/API
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Control Plane: Policy, Orchestration, Identity, CMDB │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Data/Execution Plane: CI/CD, Jobs, IaC, Runtime Ops  │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Telemetry Plane: Logs, Metrics, Traces, SLO Dashboards │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Security & Compliance Plane: Secrets, RBAC, Audit, IR │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Finance/Cost Plane: Usage, Quotas, Budgets, FinOps   │
└────────────────────────────────────────────────────────┘

Key domains:

Service directory/CMDB: a single register of services, owners, SLO, dependencies.
Orchestration: pipelines, tasks, crowns, backups, DR.
Policies (Policy-as-Code): alerts, accesses, retentions, change-gates.
Observability: metrics/trails/logs, SLI/SLO, alerts and status page.
Accesses/secrets: JIT/JEA, tokens, crypto, KMS/Vault.
Incidents/changes: ITSM/tickets, CAB/RFC, post-mortems, simulations.
DataOps: data contracts, freshness, lineage, quality.
FinOps: cost accounting, limits, quotas, optimizations.

3) Reference flows

3. 1 Release (CI/CD → GitOps)

1. PR with code/manifests → tests/scans → signing artifacts.
2. Progressive deploy (canary/blue-green) with SLO-gardrails.
3. Auto-rollback during degradation; release annotations in telemetry.

3. 2 Detect → Respond → Recover

1. Burn-rate/symptoms + quorum → Page + war-room.
2. Diagnostics by traces/logs; playbooks.
3. Rollback/Folback/Limits → AAR/RCA → CAPA.

3. 3 Change (RFC/CAB)

1. Risk analysis + maintenance window + backout plan.
2. Suppression of non-critical alerts, SLO signals are active.
3. Evidence and report, policy review.

4) Service catalog and CMDB

Attributes: owner, SLI/SLO, dependencies (internal/external), dashboards, alerts, runbook 'and, data classes (PII/finance), zones (prod/stage/dev).
Auto-content: from CI/CD, telemetry and repositories.
Usage: alert routing, escalation, blast radius calculation, maturity reporting.

5) Policies-as-Code

Categories: access (RBAC/ABAC), security (SAST/SCA/DAST), alerts/SLO, grants, change-gates, resources/quotas.
Mechanics: declarative rules (YAML/Rego/CEL), validation in CI, enforcement in Control Plane.

An example of a gate: "Deploy is allowed if all SLOs are green, there are no active SEV-1, tests have passed, signatures are valid."

6) Orchestration and execution

CI/CD: build → scan → sign → promote.
Jobs/CronJobs/DAG: backups/rotations/backfills; deadlines and competition (Forbid/Replace).
Idempotence and rollbacks: check-then-act, step markers, circuit-breaker.
Launch rights: JIT accounts, limited scope; audit.

7) Signal observability and quality

SLI/SLO by domain: availability/latency/success of business operations, data freshness.
Alerts: burn-rate in two windows, quorum, rate-limit, runbook and owner.
Logs/metrics/trails are linked trace_id; channels from graphs to logs.
Status page: templates, update frequencies, auditing publications.

8) Accesses, secrets, crypto

Secret repositories (KMS/Vault), rotations, prohibition of secrets in the repo.
JIT/JEA Issue for operation/shift time.
mTLS/OIDC between services Image Signing/SBOM.
Audit: immutable logs, WORM for critical actions.

9) Incidents, changes, maintenance windows

Incidents: SEV matrix, IC/TL/Comms/Scribe, update templates, AAR→RCA→CAPA.
Changes: RFC/CAB, risk assessment, canaries, backout.
Maintenance windows: timing, communication, suppression of rules, evidence.

10) DataOps in the operation layer

Data contracts (schemas, freshness/completeness SLAs).
DQ tests on each layer (Bronze/Silver/Gold).
Lineage and catalogs; quarantine for scrap.
Data SLO and freshness/drift alerts.

11) FinOps and cost

Unit economy: $/1k requests, $/successful transaction, $/GiB logs, $/SLO point.
Quotas/limits: egress, log volumes, task duration.
Optimization: partitsii / cash / materializatsii / arkhivy (hot-warm-cold).
Reports: cheap "expensive" services/requests, alerts for overspending.

12) Interfaces: ChatOps/Portals/API

Platform portal: service catalog, push/push buttons, SLO status, window slots, policies.
ChatOps: `/deploy`, `/handover start`, `/mw create`, `/status update` — с аудитом и evidence.
API: for integration with ITSM/HR/billing/providers.

13) Responsibility Model (RACI)

Platform/SRE: control plane, policies, observability, rotations.
Product/Dev: SLO services, releases, playbooks.
Security: secrets, vulnerabilities, IR.
Data/Analytics: DataOps, SLA freshness/quality.
Compliance/Legal: regulatory, evidence storage.
Support/Comms: status page, client messages.

14) Operating layer maturity metrics

SLO coverage:% of services with defined SLI/SLO and burn-rate.
Alert hygiene: actionable ≥80%, FP ≤5%, alerts/on-call-hour (p95).
DORA: depletion rate, lead time, MTTR, change-failure-rate.
Change governance:% RFC changes,% on-time windows, rollbacks.
Security: average time to rotate secrets/certificates, closing vulnerabilities.
FinOps: $/unit and% QoQ savings.
Docs: runbook/SOP coating, freshness (≤90 days).

15) Minimum viable operating layer (MVP) checklist

Service directory/CMDB with owners, SLO, dependencies and dashboards.
CI/CD + GitOps, artifact signature, progressive releases, auto-rollback.
Combined telemetry (logs/metrics/traces) with trace_id and SLO-alerts (double windows, quorum).
Policy-as-Code: accesses, alerts, retentions, change-gates.
Secret store, JIT/JEA, mTLS/SSO, unalterable audit.
ITSM/Incidents: SEV matrix, playbooks, status page, update templates.
Maintenance windows: calendar, RFC templates, backout plans, evidence.
FinOps: cost visibility, quotas/limits, reports.
Docs-as-Code, SOP/Runbook Templates, Ready for Production Checklist

16) Anti-patterns

"Platform = script set" without control plane and policies.
Monitoring "from everything →" avalanche of alerts, alert fatigue.
Manual production changes without GitOps/audit.
Secrets in environment variables without storage and rotation.
Lack of SLO: Arguing about feelings, not quality goals.
Scattered directories/owner tables → lost escalations.
High-risk changes do not have a backout plan.
Logs without structure/correlation → long investigations.

17) Mini templates

17. 1 Service card (catalog)


Service: checkout-api
Owner: @team-checkout
SLO: availability 99. 9% (28d), p95 latency ≤ 250 ms
Dependencies: payments-api, auth, redis, psp-a
Dashboards: SLO, errors, latency, capacity
Runbooks: rb://checkout/5xx, rb://checkout/rollout
Data: PII masked; retention 30d logs, 365d audit
Change gates: canary 1/5/25%, auto-rollback on burn-rate breach

17. 2 Politics alert (idea)

yaml id: checkout-latency-burn type: burn_rate sli: http_latency_p99 windows:
short: {duration: 1h, threshold: 5%}
long: {duration: 6h, threshold: 2%}
quorum: [ "synthetic:eu,us", "rum:checkout" ]
owner: team-checkout runbook: rb://checkout/latency routing: page:oncall-checkout controls: {dedup_key: "svc=checkout,region={{region}}", rate_limit: "1/15m"}

17. 3 Gate deploy (pseudo)

yaml allow_deploy_when:
tests: passed signatures: valid active_sev: none_of [SEV-0, SEV-1]
slo_guardrails: green_last_30m rollback_plan: present

18) Implementation Roadmap (8-12 weeks)

1. Ned. 1-2: service inventory → directory/CMDB; basic SLI/SLO and dashboards.
2. Ned. 3-4: GitOps + progressive releases; Policy-as-Code.
3. Ned. 5-6: unified telemetry and status page; burn-rate with quorum; runbook coverage.
4. Ned. 7-8: secrets/JIT, immutable audit; RFC/maintenance windows.
5. Ned. 9-10: FinOps reporting, quotas/limits; optimization of logs and storage.
6. Ned. 11-12: simulations of incidents/DR; maturity metrics; continuous improvement plan.

19) The bottom line

The operating layer architecture is a control plane plus standardized practices that turn operation into a repeatable, measurable, and safe process. Service catalog, GitOps, telemetry, policies, secure accesses, and managed changes deliver sustainable releases, rapid recovery, and transparent cost - that is, operational predictability for the business.

Operating layer architecture

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects