Central control dashboard

1) Purpose and principles

Central control dashboard (hereinafter referred to as CDU) is a single window for making decisions in operations. It aggregates signals from telemetry, ITSM, CI/CD, service catalog, work calendar and providers, turning them into actionable widgets.

Principles:

SLO-first: top - target SLO and burn-rate by Tier-0/1.
One-click to action: from the widget - to the playbook/runbook or ticket.
Unified dictionary: the same SEV, statuses, colors and thresholds.
Event annotations: releases/configs/windows on all graphs.
Roles and permissions: personal views (on-call, IC, management).
Low noise - source quorum, deduplication, and windowing.

2) Roles and key scenarios

On-call (P1/P2): quickly understand "what is on" and open the playbook (≤1 click).
IC: declare SEV, start war-room-mode, control cadence of com-updates.
Release Manager: see gates, canary progress, rollback readiness.
Service Owner/Product: business SLI (success of payments/registrations), impact of features.
SRE/Platform: capacity, autoscale, anomalies, DR-readiness.
FinOps: $/unit, overspending, budget alerts.
Security/Legal: posture, key certificates, rotation windows, WORM audit links.

3) CDA Information Architecture

Top shelf (hero panel):

SLO по Tier-0/1 (availability/latency/success) с burn-rate 2-окна.
SEV status: active incidents and their timeline.
Release status: canary/blue-green, active gates.
Traffic lights providers (PSP/KYC/CDN).

Middle shelf (operating):

Maintenance windows (now/24h), suppression card.
Capacity: CPU/RAM/IO/queue-depth/p95 latency with forecast.
FinOps: $/1k txn, daily spend vs budget, log volume anomalies.
DataOps: freshness of showcases, SLA pipelines, DQ errors.
Security: certificate term, secret rotation, critical vulnerabilities (age/SLA).

Lower shelf (diagnostics/drill- ดาวn):

Correlations "release ↔ SLO," "provider ↔ failure/latency."
Quick links: logs, trails, tickets, playbooks, SOP, escalation matrix.

4) Widgets (reference set)

1. SLO & Burn-rate

Shows the current SLI, target, and error budget consumption (1h/6h).
Action: open the service degradation playbook.

2. Incidents (SEV panel)

Active/Recent, Declare/Comms Timers, IC/Comms Roles.
Action: open war-room, update template, IC checklist.

3. Releases/Configs

Canary 1→5→25%, flags, rollback (button/SOP link).
Annotations: version, commits, author.

4. Maintenance windows

Current/upcoming, impacted-services/regions; suppression mask.
Action: Coordinate notifications, enable SLO guards.

5. Capacity/Autoscale

Consumption forecast (Naive/AR), hotspot-card, warm-pool.
Action: request quotas/scale rules (PR to repo policy).

6. FinOps

$/unit, top "expensive" queries/logs, daily burn vs budget.
Action: open the report and recommendation (sampling logs, archives).

7. Providers

SLA/PSP/KYC/CDN status, route weights, folback readiness.
Action: switch weight, communication template to partners.

8. Security

Certificates (≤30d), delays in rotations, vulnerabilities (age), suspicious events.
Action: open IR playbook/ticket.

9. DataOps

Window freshness, skip percentage, pipeline failure, DLQ.
Action: Backfill/quarantine/rollback transformation.

5) States/colors/thresholds (reference)

Green: SLI within target, burn-rate <1 ×.
Amber: SLI degrades, burn-rate 1-2 ×, p95 growth, but there is a workaround.
Red: breach or predictive burn-out <1h; open SEV-1/0.
Grey: suppression, no telemetry (source error).

6) Annotations and correlations

Release/config/window/provider statuses are displayed on SLO graphs.
Click on the → diff marker, author, gates, Rollback/Folback/SOP button.
In the incident, the timeline is built from ChatOps annotations and actions.

7) Data sources and verification

Telemetry: metrics/trails/logs with trace_id.
ITSM: Incidents/Issues/Changes (Statuses/SLAs).
CI/CD: releases, signatures, artifacts, tests.
Service directory/CMDB: owners, SLO, dependencies.
Calendar: maintenance windows.
Providers: status-API + manual confirmations (landing in a separate showcase).
FinOps: billing/resource tags, log volumes, egress.

Quality control: quorum, duplicate probes, SLA freshness, alerts to "dumb" sources.

8) Display modes

War-room: fixed layout SLO/Incidents/Releases/Comms-timer.
Executive (28 days): trends MTTR/MTTD/SEV mix, $/unit, SLO-adherence.
On-call: compact "night" panel (dark mode, large numbers).
Multi-tenant/region: service/region/tenant filters; presets.

9) Navigation and actions (one-click)

Buttons: '/declare sev1 ', '/freeze', '/rollback ', '/status update', 'open playbook'.
Drill- ดาวn: SLO → graph → logs/trails with prefilled filters (trace_id, release_id).
Sharing: snapshot of panels in a ticket/status page.

10) Security, access, audit

SSO/OIDC + RBAC/ABAC: roles and scopes (view/action).
JIT/JEA: The "dangerous" action is only available with a temporary raise.
Audit unchangeable: who pressed what, which requests/commands left.
Secrets: not displayed, only links to the secret manager.

11) CDU Maturity Metrics

Actionability ≥ 90%: Clicks lead to actions, not just graphs.
Time-to-First-Action ≤ 2 min from CCD during SEV-1/0.
The proportion of incidents where the CDU was a "source of truth" ≥ 95%.

Freshness of widgets: % with data "fresh 5 minutes."

Coverage:% of critical services with SLO cards and release annotations.
Zero-blind-spots: silent sources for the week = 0.

12) Checklists

Design

Roles and scripts are described (P1/P2/IC/Exec/FinOps/Security/DataOps).
The color/SEV/threshold dictionary is consistent.
DataSources with quorum and freshness SLAs.
War-room/On-call/Executive layouts.
ChatOps/ITSM/CI/CD/CMDB Integration Plan.

Operation

Widgets pass linter (required fields, owner, thresholds).
Once a week - Escalation/Alert Review with DPC improvements.
Incident snapshots are attached to the AAR/RCA.
Dark Mode/Mobile Duty Preset.
Tests for "mute" sources and correctness of annotations.

13) Templates (ideas)

13. 1 Widget Definition (YAML)

yaml id: slo-payments title: "SLO: Success of payments (EU)"
owner: team-payments type: slo_burnrate sli:
metric: "biz. payment_success_ratio"
target_pct: 99. 5 burn_rate:
short_window: "1h"
long_window: "6h"
thresholds:
amber: { burn_rate: 1. 2 }
red:  { burn_rate: 2. 0 }
actions:
- label: "Open playbook"
link: "rb://payments/slo-degrade"
- label: "Release rollback"
link: "sop://REL-ROLLBACK-01"
annotations:
release: true change: true filters:
region: "eu"
tier: "0"

13. 2 Incident Card (JSON)

json
{
"id": "incidents-active",
"type": "incident_board",
"sev": ["SEV-0", "SEV-1", "SEV-2"],
"fields": ["id","sev","service","since","ic","next_comms_at"],
"actions": [{"label":"War-room","cmd":"/declare sev1"}]
}

13. 3 Connection with the release

yaml id: release-canary type: release_progress source: cicd://checkout gates: ["tests","signatures","slo_guardrails"]
canary_steps: [1,5,25]
rollback: "sop://REL-ROLLBACK-01"
annotations: { on_charts: ["slo-latency","slo-success"] }

13. 4 FinOps widget

yaml id: finops-burn type: cost_unit metrics:
- id: "cost_per_1k_txn"
- id: "logs_daily_gib"
alerts:
- when: "cost_per_1k_txn > target1. 2"
action: "open://finops/reco-logs-sampling"

14) Anti-patterns

"Wall of graphs" without actions and playbooks.
Different colors/thresholds on commands → confusion in SEV.
No release/window annotations - complex cause correlation.
Duplicate sources without quorum are false Page/noise.
Secrets/keys on the panel - risk of leakage.
Slow render (requests/aggregations are not cached) - panels are not opened in battle.

15) Implementation Roadmap (4-8 weeks)

1. Ned. 1: collection of requirements by roles, dictionary of statuses/colors, layouts of three modes.
2. Ned. 2: SLO/Incidents/Releases/Windows connection, annotations, ChatOps actions.
3. Ned. 3: add FinOps/Capacity/Providers/DataOps/Security, quorum of sources.
4. Ned. 4: War-room mode, snapshots in ITSM, pilot on Tier-0.
5. Ned. 5-6: performance optimization, mobile/on-call preset, widget linter.
6. Ned. 7-8: maturity metrics, weekly review, automatic recommendations (sampling logs, quotas, folback).

16) The bottom line

CDUs are not "beautiful graphs," but a panel of solutions: SLO and burn-rate from above, incidents/releases/windows in one context, instant actions via ChatOps and SOP, confirmed sources and annotations. This dashboard reduces MTTA/MTTR, simplifies communications, supports FinOps and makes operation transparent and predictable.

Central control dashboard

Operation

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects