Central control dashboard
1) Purpose and principles
Central control dashboard (hereinafter referred to as CDU) is a single window for making decisions in operations. It aggregates signals from telemetry, ITSM, CI/CD, service catalog, work calendar and providers, turning them into actionable widgets.
Principles:- SLO-first: top - target SLO and burn-rate by Tier-0/1.
- One-click to action: from the widget - to the playbook/runbook or ticket.
- Unified dictionary: the same SEV, statuses, colors and thresholds.
- Event annotations: releases/configs/windows on all graphs.
- Roles and permissions: personal views (on-call, IC, management).
- Low noise - source quorum, deduplication, and windowing.
2) Roles and key scenarios
On-call (P1/P2): quickly understand "what is on" and open the playbook (≤1 click).
IC: declare SEV, start war-room-mode, control cadence of com-updates.
Release Manager: see gates, canary progress, rollback readiness.
Service Owner/Product: business SLI (success of payments/registrations), impact of features.
SRE/Platform: capacity, autoscale, anomalies, DR-readiness.
FinOps: $/unit, overspending, budget alerts.
Security/Legal: posture, key certificates, rotation windows, WORM audit links.
3) CDA Information Architecture
Top shelf (hero panel):- SLO по Tier-0/1 (availability/latency/success) с burn-rate 2-окна.
- SEV status: active incidents and their timeline.
- Release status: canary/blue-green, active gates.
- Traffic lights providers (PSP/KYC/CDN).
- Maintenance windows (now/24h), suppression card.
- Capacity: CPU/RAM/IO/queue-depth/p95 latency with forecast.
- FinOps: $/1k txn, daily spend vs budget, log volume anomalies.
- DataOps: freshness of showcases, SLA pipelines, DQ errors.
- Security: certificate term, secret rotation, critical vulnerabilities (age/SLA).
- Correlations "release ↔ SLO," "provider ↔ failure/latency."
- Quick links: logs, trails, tickets, playbooks, SOP, escalation matrix.
4) Widgets (reference set)
1. SLO & Burn-rate
Shows the current SLI, target, and error budget consumption (1h/6h).
Action: open the service degradation playbook.
2. Incidents (SEV panel)
Active/Recent, Declare/Comms Timers, IC/Comms Roles.
Action: open war-room, update template, IC checklist.
3. Releases/Configs
Canary 1→5→25%, flags, rollback (button/SOP link).
Annotations: version, commits, author.
4. Maintenance windows
Current/upcoming, impacted-services/regions; suppression mask.
Action: Coordinate notifications, enable SLO guards.
5. Capacity/Autoscale
Consumption forecast (Naive/AR), hotspot-card, warm-pool.
Action: request quotas/scale rules (PR to repo policy).
6. FinOps
$/unit, top "expensive" queries/logs, daily burn vs budget.
Action: open the report and recommendation (sampling logs, archives).
7. Providers
SLA/PSP/KYC/CDN status, route weights, folback readiness.
Action: switch weight, communication template to partners.
8. Security
Certificates (≤30d), delays in rotations, vulnerabilities (age), suspicious events.
Action: open IR playbook/ticket.
9. DataOps
Window freshness, skip percentage, pipeline failure, DLQ.
Action: Backfill/quarantine/rollback transformation.
5) States/colors/thresholds (reference)
Green: SLI within target, burn-rate <1 ×.
Amber: SLI degrades, burn-rate 1-2 ×, p95 growth, but there is a workaround.
Red: breach or predictive burn-out <1h; open SEV-1/0.
Grey: suppression, no telemetry (source error).
6) Annotations and correlations
Release/config/window/provider statuses are displayed on SLO graphs.
Click on the → diff marker, author, gates, Rollback/Folback/SOP button.
In the incident, the timeline is built from ChatOps annotations and actions.
7) Data sources and verification
Telemetry: metrics/trails/logs with trace_id.
ITSM: Incidents/Issues/Changes (Statuses/SLAs).
CI/CD: releases, signatures, artifacts, tests.
Service directory/CMDB: owners, SLO, dependencies.
Calendar: maintenance windows.
Providers: status-API + manual confirmations (landing in a separate showcase).
FinOps: billing/resource tags, log volumes, egress.
Quality control: quorum, duplicate probes, SLA freshness, alerts to "dumb" sources.
8) Display modes
War-room: fixed layout SLO/Incidents/Releases/Comms-timer.
Executive (28 days): trends MTTR/MTTD/SEV mix, $/unit, SLO-adherence.
On-call: compact "night" panel (dark mode, large numbers).
Multi-tenant/region: service/region/tenant filters; presets.
9) Navigation and actions (one-click)
Buttons: '/declare sev1 ', '/freeze', '/rollback ', '/status update', 'open playbook'.
Drill- ดาวn: SLO → graph → logs/trails with prefilled filters (trace_id, release_id).
Sharing: snapshot of panels in a ticket/status page.
10) Security, access, audit
SSO/OIDC + RBAC/ABAC: roles and scopes (view/action).
JIT/JEA: The "dangerous" action is only available with a temporary raise.
Audit unchangeable: who pressed what, which requests/commands left.
Secrets: not displayed, only links to the secret manager.
11) CDU Maturity Metrics
Actionability ≥ 90%: Clicks lead to actions, not just graphs.
Time-to-First-Action ≤ 2 min from CCD during SEV-1/0.
The proportion of incidents where the CDU was a "source of truth" ≥ 95%.
Freshness of widgets: % with data "fresh 5 minutes."
Coverage:% of critical services with SLO cards and release annotations.
Zero-blind-spots: silent sources for the week = 0.
12) Checklists
Design
- Roles and scripts are described (P1/P2/IC/Exec/FinOps/Security/DataOps).
- The color/SEV/threshold dictionary is consistent.
- DataSources with quorum and freshness SLAs.
- War-room/On-call/Executive layouts.
- ChatOps/ITSM/CI/CD/CMDB Integration Plan.
Operation
- Widgets pass linter (required fields, owner, thresholds).
- Once a week - Escalation/Alert Review with DPC improvements.
- Incident snapshots are attached to the AAR/RCA.
- Dark Mode/Mobile Duty Preset.
- Tests for "mute" sources and correctness of annotations.
13) Templates (ideas)
13. 1 Widget Definition (YAML)
yaml id: slo-payments title: "SLO: Success of payments (EU)"
owner: team-payments type: slo_burnrate sli:
metric: "biz. payment_success_ratio"
target_pct: 99. 5 burn_rate:
short_window: "1h"
long_window: "6h"
thresholds:
amber: { burn_rate: 1. 2 }
red: { burn_rate: 2. 0 }
actions:
- label: "Open playbook"
link: "rb://payments/slo-degrade"
- label: "Release rollback"
link: "sop://REL-ROLLBACK-01"
annotations:
release: true change: true filters:
region: "eu"
tier: "0"
13. 2 Incident Card (JSON)
json
{
"id": "incidents-active",
"type": "incident_board",
"sev": ["SEV-0", "SEV-1", "SEV-2"],
"fields": ["id","sev","service","since","ic","next_comms_at"],
"actions": [{"label":"War-room","cmd":"/declare sev1"}]
}
13. 3 Connection with the release
yaml id: release-canary type: release_progress source: cicd://checkout gates: ["tests","signatures","slo_guardrails"]
canary_steps: [1,5,25]
rollback: "sop://REL-ROLLBACK-01"
annotations: { on_charts: ["slo-latency","slo-success"] }
13. 4 FinOps widget
yaml id: finops-burn type: cost_unit metrics:
- id: "cost_per_1k_txn"
- id: "logs_daily_gib"
alerts:
- when: "cost_per_1k_txn > target1. 2"
action: "open://finops/reco-logs-sampling"
14) Anti-patterns
"Wall of graphs" without actions and playbooks.
Different colors/thresholds on commands → confusion in SEV.
No release/window annotations - complex cause correlation.
Duplicate sources without quorum are false Page/noise.
Secrets/keys on the panel - risk of leakage.
Slow render (requests/aggregations are not cached) - panels are not opened in battle.
15) Implementation Roadmap (4-8 weeks)
1. Ned. 1: collection of requirements by roles, dictionary of statuses/colors, layouts of three modes.
2. Ned. 2: SLO/Incidents/Releases/Windows connection, annotations, ChatOps actions.
3. Ned. 3: add FinOps/Capacity/Providers/DataOps/Security, quorum of sources.
4. Ned. 4: War-room mode, snapshots in ITSM, pilot on Tier-0.
5. Ned. 5-6: performance optimization, mobile/on-call preset, widget linter.
6. Ned. 7-8: maturity metrics, weekly review, automatic recommendations (sampling logs, quotas, folback).
16) The bottom line
CDUs are not "beautiful graphs," but a panel of solutions: SLO and burn-rate from above, incidents/releases/windows in one context, instant actions via ChatOps and SOP, confirmed sources and annotations. This dashboard reduces MTTA/MTTR, simplifies communications, supports FinOps and makes operation transparent and predictable.