Operational dashboard
(Section: Operations and Management)
1) Purpose and principles
An operational dashboard is a "single window" for monitoring platform health and taking action. It aggregates metrics, events, alerts, and business key figures in the context of the user role (SRE, Product, Financials, Compliance, Support, Partners).
Principles:- Actionable by design: each widget has an action button (rollback, pauze, re-run, re-route).
- Role-aware: rights and levels of detail depend on the role/tenant/region.
- Source-of-truth: numbers converge with billing/logs/bills.
- Near-real-time + historicity: seconds/minutes for incidents, months/years for trends.
- Explainability: any aggregate expands to a raw event with 'trace _ id'.
2) Roles and scripts (who comes and why)
SRE/Platform: availability, p50/p95/p99 latency, error/retray, capacity, cost per 1k events.
Product/Operations: E2E-Success Rate, conversion, partner onboarding time, phicheflags.
Finance/FinOps: revenue/COGS/CM per unit, egress/ingress, budgets and caps, deviations.
Compliance/Security: receipts/signatures, PII requests, SoD violations, recertification status.
Support/CS: ticket queue, MTTA/MTTR, SLA by partner and region.
Partners/Tenants: own SLO metrics, webhook statuses, usage and quotas.
3) North Star and key SLI/SLO
North Star: E2E Success Rate on critical routes at target p95 in each region.
SLI (example):- Availability per-channel/region.
- p50/p95/p99 latency.
- Error-rate and proportion of retrays.
- Webhook delivery success rate (% with receipts).
- Cost of 1k events and egress/ingress per unit.
- Summary of incidents: MTTA, MTTR, error-budget burn.
- Availability ≥ 99. 95 %/region/channel.
- p95 ≤ 120 ms (showcase), ≤ 250 ms (checkout/quote).
- The success of webhooks ≥ 99. 5% in 5-min. window.
- Δ between quote and checkout = 0 (± 1 minor unit according to allocation rules).
- Reaction time at P1 ≤ 10 min, MTTR ≤ 60 min.
4) Dashboard data architecture
Event bus: telemetry (traces/metrics/logs), business events, billing, compliance.
Streaming/aggregation: T + 5s/T + 1m windows for near-real-time; CDC/outbox for guaranteed delivery.
Storages: time-series (RAM), OLAP (long history), WORM logs (audit).
Semantic layer: dictionary of metrics, units, normalization by region and tenants.
Link to raw materials: drill-down to 'trace _ id '/' event _ id' and signatures (receipt_hash).
5) Interface and widget design
Global header: filters (time, region, tenant, product, environment), status indicators.
Tiles (KPIs): E2E Success, availability, p95, error-rate, cost/1k, egress.
Charts: sparkline trends, heat-map by region, percentile charts.
Tables: top mistakes, partners with degradation, exceeding quotas, unclosed incidents.
Action sections: "Pause promo," "Rollback feature," "Raise quota," "Restart delivery."
Context-help: hints about metrics/techniques and communication with SLO.
6) Dashboard modules (recommended set)
1. Platform health: availability/latency/errors, burn-down error-budget.
2. Partner integrations: webhook status, receipts, idempotent takes, lag queues.
3. Checkout & Prices: vitrina↔checkout compliance, 'fx _ version', 'tax _ rule _ version', failure cases.
4. Content/Directories: publish time, cache/invalidator errors, freshness.
5. RTP & Limits (if applicable): theor. vs observed RTP, actuation of limits, exposure.
6. FinOps: COGS/unit, egress/ingress, compute/storage, budgets/cap-alerts.
7. Security/Compliance: SoD, JIT, MFA, signed operations, PII requests, and logs.
8. Support: queues, MTTA/MTTR, reasons, auto-runbooks.
9. Release/Feature Flags: release statuses, canary regions, auto-gluing regressions with incidents.
10. Experiments: A/B guardrails, impact of features on SLI/ROI.
7) Alerts, runes and escalations
Level P1-P3 alerts with noise cancellation and 'trace _ id' deduplication.
Auto-runbooks: when triggered - starting checks/fixes (clearing the cache, switching routing, pause promo).
Escalation: matrix 24 × 7, response SLO, channels (chat/voice/SMS), "red button."
Post-incident: causal report templates and action items.
8) Multi-regionality and multi-tenant
Slices: region/tenant/channel/provider, independent SLOs and budgets.
Confidence zones: PII data/finance - visible only in the respective areas, the rest - aggregates.
Cost-aware: comparing routes by price at the same p95; optimization recommendations.
9) Security and privacy
RBAC/ABAC: visibility and actions by role; ReBAC for product/tenant ownership.
Signatures and receipts: for financial/critical events - hashes and DSSE receipts.
PII hygiene: tokenization, masking, access only through approved jabs.
Audit: WORM logs for config/role/limit changes, reproducibility.
10) Metrics data model (example)
`metric` `{name, unit, type: counter/gauge/hist, owner, sla_ref}`
`dim` `{region, tenant, product, provider, version, environment}`
`point` `{metric, value, ts, dims{}, trace_id, signature?}`
`event` `{type, severity, subject_id, payload_hash, receipt_hash, ts}`
`slo` `{name, target, window, burn_rate, owners[], runbook_url}`
`alert` `{slo_ref, condition, status, ack_by, acknowledged_at, runbook_step}`
11) dashboard API/webhooks
'POST/ingest/metrics' - receiving metrics (scheme, limits, authentication).
'POST/ingest/events' - business events (versions/signatures).
`GET /kpis? filters... '- aggregates for widgets.
'GET/traces/{ trace _ id} '- deep promotion.
Вебхуки: `IncidentRaised`, `QuotaCapReached`, `PriceMismatch`, `WebhookDeliveryLag`, `SecuritySoDViolation`.
12) Data quality and tests
Data contracts: schemes and validation at reception, versioning ('expand → migrate → contract').
Anomalies: monitoring of omissions/jumps, thresholds "flatline "/" noise."
Sampling: for high-QPS metrics - sliding, while maintaining representativeness.
Backfill: secure version-tagged backloads.
13) Metrics of the dashboard itself (metrics metrics)
UI/API availability ≥ 99. 9%.
Latency p95 API requests ≤ 300 ms.
Completeness - The percentage of sources that sent data to the window ≥ 99. 5%.
Freshness: incremental updates lag ≤ 30 s.
Correctness: discrepancy with reference reports ≤ 0. 1%.
14) Economy and FinOps in the dashboard
Cost per 1k events decomposed by provider/region.
Egress/Ingress heatmaps, caching/routing recommendations.
Budgets/cap-alerts: 80/90/100%, auto-trading and prioritization.
15) Availability and UX
Night theme, short captions, status icons.
Keyboard navigation and a11y: contrast, alt, aria tags.
Saved presets: "SRE duty," "finance," "partner."
Snapshots and sharing: capture state with filters and link/export.
16) Risks and anti-patterns
Dash-sprawl: 20 different dashboards without a single dictionary of metrics.
Vanity metrics: beautiful graphs with no connection to SLO/actions.
Inconsistency of figures: reports ≠ billing/audit.
Noisy alerts: fatigue and P1 omissions.
Absence of drill-down: it is impossible to get to the primary and causes.
17) Implementation checklist
- Define roles and scripts; agree North Star and SLI/SLO.
- Create a dictionary of metrics and units; formalize data contracts.
- Configure ingest (metrics/events/traces), OLAP, and WORM auditing.
- Implement key modules (health, partners, checkout, FinOps, Security).
- Include alerts with runes and escalations; "red button."
- Add rollback/pause/re-route/raise-limit actions.
- Build heat-map by region/tenant; filters and presets.
- Verify outgoing digits with billing/bills.
- Game-day (GameDay): disconnection of the provider, avalanche of retras, desynchronization of prices.
- Weekly SLO reviews and post-mortem quality.
18) RACI
19) FAQ
Can all reports be replaced with a dashboard?
No, it isn't. Dashboard - for RAM and actions; formal reporting/auditing - individual artifacts.
How much "real time" do you need?
For incidents - seconds/minutes, for economics - minutes/hours; consistency is important, not absolute "online."
How to deal with the noise of alerts?
SLO-oriented conditions, aggregation, deduplication by 'trace _ id', prioritization and auto-runbooks.
How to check the correctness of metrics?
Regular reconciliations with reference reports, test feeds, control samples and WORM logs.
Summary: Operational dashboard is not a "beautiful board," but a management tool: single SLI/SLO, actions from the interface, tracing to raw materials and strict consistency with billing and audit. Build it on an event architecture, give context by role, add runes and escalations - and you get predictable operations, quick decisions and sustainable growth.