Operational dashboard

(Section: Operations and Management)

1) Purpose and principles

An operational dashboard is a "single window" for monitoring platform health and taking action. It aggregates metrics, events, alerts, and business key figures in the context of the user role (SRE, Product, Financials, Compliance, Support, Partners).

Principles:

Actionable by design: each widget has an action button (rollback, pauze, re-run, re-route).
Role-aware: rights and levels of detail depend on the role/tenant/region.
Source-of-truth: numbers converge with billing/logs/bills.
Near-real-time + historicity: seconds/minutes for incidents, months/years for trends.
Explainability: any aggregate expands to a raw event with 'trace _ id'.

2) Roles and scripts (who comes and why)

SRE/Platform: availability, p50/p95/p99 latency, error/retray, capacity, cost per 1k events.
Product/Operations: E2E-Success Rate, conversion, partner onboarding time, phicheflags.
Finance/FinOps: revenue/COGS/CM per unit, egress/ingress, budgets and caps, deviations.
Compliance/Security: receipts/signatures, PII requests, SoD violations, recertification status.
Support/CS: ticket queue, MTTA/MTTR, SLA by partner and region.
Partners/Tenants: own SLO metrics, webhook statuses, usage and quotas.

3) North Star and key SLI/SLO

North Star: E2E Success Rate on critical routes at target p95 in each region.

SLI (example):

Availability per-channel/region.
p50/p95/p99 latency.
Error-rate and proportion of retrays.
Webhook delivery success rate (% with receipts).
Cost of 1k events and egress/ingress per unit.
Summary of incidents: MTTA, MTTR, error-budget burn.

SLO (example):

Availability ≥ 99. 95 %/region/channel.
p95 ≤ 120 ms (showcase), ≤ 250 ms (checkout/quote).
The success of webhooks ≥ 99. 5% in 5-min. window.
Δ between quote and checkout = 0 (± 1 minor unit according to allocation rules).
Reaction time at P1 ≤ 10 min, MTTR ≤ 60 min.

4) Dashboard data architecture

Event bus: telemetry (traces/metrics/logs), business events, billing, compliance.
Streaming/aggregation: T + 5s/T + 1m windows for near-real-time; CDC/outbox for guaranteed delivery.
Storages: time-series (RAM), OLAP (long history), WORM logs (audit).
Semantic layer: dictionary of metrics, units, normalization by region and tenants.
Link to raw materials: drill-down to 'trace _ id '/' event _ id' and signatures (receipt_hash).

5) Interface and widget design

Global header: filters (time, region, tenant, product, environment), status indicators.
Tiles (KPIs): E2E Success, availability, p95, error-rate, cost/1k, egress.
Charts: sparkline trends, heat-map by region, percentile charts.
Tables: top mistakes, partners with degradation, exceeding quotas, unclosed incidents.

Action sections: "Pause promo," "Rollback feature," "Raise quota," "Restart delivery."

Context-help: hints about metrics/techniques and communication with SLO.

6) Dashboard modules (recommended set)

1. Platform health: availability/latency/errors, burn-down error-budget.
2. Partner integrations: webhook status, receipts, idempotent takes, lag queues.
3. Checkout & Prices: vitrina↔checkout compliance, 'fx _ version', 'tax _ rule _ version', failure cases.
4. Content/Directories: publish time, cache/invalidator errors, freshness.
5. RTP & Limits (if applicable): theor. vs observed RTP, actuation of limits, exposure.
6. FinOps: COGS/unit, egress/ingress, compute/storage, budgets/cap-alerts.
7. Security/Compliance: SoD, JIT, MFA, signed operations, PII requests, and logs.
8. Support: queues, MTTA/MTTR, reasons, auto-runbooks.
9. Release/Feature Flags: release statuses, canary regions, auto-gluing regressions with incidents.
10. Experiments: A/B guardrails, impact of features on SLI/ROI.

7) Alerts, runes and escalations

Level P1-P3 alerts with noise cancellation and 'trace _ id' deduplication.
Auto-runbooks: when triggered - starting checks/fixes (clearing the cache, switching routing, pause promo).

Escalation: matrix 24 × 7, response SLO, channels (chat/voice/SMS), "red button."

Post-incident: causal report templates and action items.

8) Multi-regionality and multi-tenant

Slices: region/tenant/channel/provider, independent SLOs and budgets.
Confidence zones: PII data/finance - visible only in the respective areas, the rest - aggregates.
Cost-aware: comparing routes by price at the same p95; optimization recommendations.

9) Security and privacy

RBAC/ABAC: visibility and actions by role; ReBAC for product/tenant ownership.
Signatures and receipts: for financial/critical events - hashes and DSSE receipts.
PII hygiene: tokenization, masking, access only through approved jabs.
Audit: WORM logs for config/role/limit changes, reproducibility.

10) Metrics data model (example)

`metric` `{name, unit, type: counter/gauge/hist, owner, sla_ref}`

`dim` `{region, tenant, product, provider, version, environment}`

`point` `{metric, value, ts, dims{}, trace_id, signature?}`

`event` `{type, severity, subject_id, payload_hash, receipt_hash, ts}`

`slo` `{name, target, window, burn_rate, owners[], runbook_url}`

`alert` `{slo_ref, condition, status, ack_by, acknowledged_at, runbook_step}`

11) dashboard API/webhooks

'POST/ingest/metrics' - receiving metrics (scheme, limits, authentication).
'POST/ingest/events' - business events (versions/signatures).
`GET /kpis? filters... '- aggregates for widgets.
'GET/traces/{ trace _ id} '- deep promotion.
Вебхуки: `IncidentRaised`, `QuotaCapReached`, `PriceMismatch`, `WebhookDeliveryLag`, `SecuritySoDViolation`.

12) Data quality and tests

Data contracts: schemes and validation at reception, versioning ('expand → migrate → contract').

Anomalies: monitoring of omissions/jumps, thresholds "flatline "/" noise."

Sampling: for high-QPS metrics - sliding, while maintaining representativeness.
Backfill: secure version-tagged backloads.

13) Metrics of the dashboard itself (metrics metrics)

UI/API availability ≥ 99. 9%.
Latency p95 API requests ≤ 300 ms.
Completeness - The percentage of sources that sent data to the window ≥ 99. 5%.
Freshness: incremental updates lag ≤ 30 s.
Correctness: discrepancy with reference reports ≤ 0. 1%.

14) Economy and FinOps in the dashboard

Cost per 1k events decomposed by provider/region.
Egress/Ingress heatmaps, caching/routing recommendations.
Budgets/cap-alerts: 80/90/100%, auto-trading and prioritization.

15) Availability and UX

Night theme, short captions, status icons.
Keyboard navigation and a11y: contrast, alt, aria tags.

Saved presets: "SRE duty," "finance," "partner."

Snapshots and sharing: capture state with filters and link/export.

16) Risks and anti-patterns

Dash-sprawl: 20 different dashboards without a single dictionary of metrics.
Vanity metrics: beautiful graphs with no connection to SLO/actions.
Inconsistency of figures: reports ≠ billing/audit.
Noisy alerts: fatigue and P1 omissions.
Absence of drill-down: it is impossible to get to the primary and causes.

17) Implementation checklist

Define roles and scripts; agree North Star and SLI/SLO.
Create a dictionary of metrics and units; formalize data contracts.
Configure ingest (metrics/events/traces), OLAP, and WORM auditing.
Implement key modules (health, partners, checkout, FinOps, Security).
Include alerts with runes and escalations; "red button."
Add rollback/pause/re-route/raise-limit actions.
Build heat-map by region/tenant; filters and presets.
Verify outgoing digits with billing/bills.
Game-day (GameDay): disconnection of the provider, avalanche of retras, desynchronization of prices.
Weekly SLO reviews and post-mortem quality.

18) RACI

Area	R	A	C	I
Metrics Dictionary/SLI/SLO	Platform Analytics	CTO	Product, SRE, Finance	All
Source integrations	Data Eng	Head of Data	SRE, Security	Product
Alerts and runes	SRE	CTO	Product, FinOps	Support
Security/Privacy	Security/Privacy	CISO/DPO	Legal, Compliance	All
Financial metrics	FinOps	CFO	Product, Data	Audit

19) FAQ

Can all reports be replaced with a dashboard?
No, it isn't. Dashboard - for RAM and actions; formal reporting/auditing - individual artifacts.

How much "real time" do you need?

For incidents - seconds/minutes, for economics - minutes/hours; consistency is important, not absolute "online."

How to deal with the noise of alerts?
SLO-oriented conditions, aggregation, deduplication by 'trace _ id', prioritization and auto-runbooks.

How to check the correctness of metrics?
Regular reconciliations with reference reports, test feeds, control samples and WORM logs.

Summary: Operational dashboard is not a "beautiful board," but a management tool: single SLI/SLO, actions from the interface, tracing to raw materials and strict consistency with billing and audit. Build it on an event architecture, give context by role, add runes and escalations - and you get predictable operations, quick decisions and sustainable growth.

Operational dashboard

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects