Dashboards of infrastructure
1) Why do you need it
A single picture of the state: from cluster and networks to databases and queues.
Fast RCA and post-mortems: a bunch of metrics ↔ logs ↔ traces.
SLO by service and platform: control over availability and latency.
FinOps transparency: volume/cost by services, tenant and environments.
Compliance/security: status of patches/vulnerabilities, accesses, anomalies.
Methodologies: Golden Signals (latency, traffic, errors, saturation), RED (Rate, Errors, Duration) for requests, USE (Utilization, Saturation, Errors) for resources.
2) Principles of good dashboard
Actionable-Each panel responds to "what to do next."
Hierarchy: overview → domains → deep dive → raw.
Templates/variables: 'cluster', 'namespace', 'service', 'tenant', 'env'.
Uniform units: ms for latency,%, RPS, ops/sec, bytes.
Consistent timepicker: default 1-6 hours, fast presets 5m/15m/24h.
Drilldown: from the panel to the logs (Loki/ELK) and the track (Tempo/Jaeger).
Ownership: the owner is indicated on the dashboard, SLO, runbook, contact in on-call.
3) Folder structure and roles
00_Overview - high-level overview of the platform.
10_Kubernetes - clusters, nodes, workloads, HPA/VPA, containers.
20_Network_Edge — Ingress/Envoy/Nginx, LB, DNS, CDN, WAF.
30_Storage_DB - PostgreSQL/MySQL, Redis, Kafka/RabbitMQ, object storage.
40_CICD_Runner - pipelines, agents, artifacts, registry.
50_Security_Compliance - vulnerabilities, patches, RBAC, audit events.
60_FinOps_Cost - cost per service/tenant/cluster, disposal.
99_Runbooks - links to instructions and SLO cards.
Roles: Platform-SRE (full access), Service-Owner (own spaces), Security/Compliance, Finance/FinOps, View-only.
4) Platform overview dashboard (Landing)
Goal: in ≤30 seconds to understand if everything is in order.
Recommended panels:- SLO platform (API availability edge): target value, actual, era of errors, burn-rate.
- p50/p95/p99 latency by major entry points.
- 4xx/5xx errors and top endpoints with regressions.
- Resource saturation (CPU, RAM, network, disk) - p95 by cluster.
- Incidents/alerts (active) and recent releases.
- Cost/hour (approximate) and trend by week.
Variable templates: 'env', 'region', 'cluster', 'tenant'.
5) Kubernetes: clusters and workshops
Key groups:1. Cluster/Nodes
CPU/Memory disposal, pressure (memory/cpu), IO disk, inode.
Subsystems: kube-api, etcd, controllers; kubelet health.
2. Vorkloads
RPS/RPM, latency p95, error rate, restarts, throttling, OOMKills.
HPA targets vs actual metrics.
3. Network path within cluster
eBPF/Netflow: top talkers, drops, retransmits.
4. Events K8s
Rate по Warning/FailedScheduling/BackOff.
Examples of PromQL:promql
API (5xx) errors by sum by (service) (rate (http_requests_total{status=~"5"..}[5m]))
Latency p95 histogram_quantile (0. 95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
Throttling CPU контейнеров sum by (namespace, pod) (rate(container_cpu_cfs_throttled_seconds_total[5m]))
6) Edge, grid and DNS
Panels:- Ingress/Envoy/Nginx: RPS, p95, 4xx/5xx, upstream_errors, active_conns.
- LB/Anycast: distribution of traffic by zones, failover events.
- DNS: resolution latency, NXDOMAIN/SERVFAIL rate, hit-ratio cache.
- CDN/WAF: blocked by rules, abnormal traffic (bots/scrappers).
promql sum(rate(nginx_http_requests_total[5m])) by (status)
7) Databases and storages
PostgreSQL/MySQL: qps, latency, lock waits, replication lag, backups/failures.
Redis: hit ratio, evictions, memory, slow commands.
Kafka/RabbitMQ: lag by consumer groups, rebalances, unacked messages.
Object storage: queries, errors, egress, lat p95.
promql
Replication lag in seconds max by (replica) (pg_replication_lag_seconds)
Slow Queries> 1s rate (pg_stat_activity_longqueries_total[5m])
Kafka (example):
promql
Lag by group max by (topic, group) (kafka_consumergroup_lag)
8) CI/CD and artifacts
Pipeline overview: success/runtime, runner queue.
Deployment health: versions, canary/blue-green status, warm-up time.
Image registers: size, last push 'and, disposal.
promql
Rate (ci_pipeline_success_total[1h] )/rate (ci_pipeline_total[1h]) success rate
9) Safety and compliance
Patches and vulnerabilities: proportion of nodes/images with critical CVEs, average "time to patch."
RBAC and secrets: unsuccessful access attempts, access to secrets.
Audit events: inputs/changes in critical components, drift.
WAF/DLP/PII revision: rule locks, masking errors.
10) Logs and Trails: End-to-End Review
Summary of errors from logs (Loki/ELK): top exceptions, new signatures.
Button "Go to logs with filters" (LogQL/ES query).
Traces: top slow spans, percentage of requests without trace context.
{app="api", level="error"} = "NullReference"
{app="nginx"} json status="5.." count_over_time([5m])
11) FinOps: cost and disposal
Cost by services/tenants/clusters (according to billing/exporters).
Hot/cold nodes: idle resources, rightsizing recommendations (CPU/Mem).
Data egress, L7 requests and their cost.
Dynamics: week/month, forecast.
- cost_per_rps, cost_per_request, storage_cost_gb_day, idle_cost.
- efficiency factor: 'RPS/$' or 'SLO-minutes/$'.
12) SLO, bugs and burn-rate
SLO card on each domain dashboard: goal, period, errors (budget).
Burn-rate alerts (two speeds: fast/slow).
promql
Bad budget: 5xx as a fraction of sum (rate (http_requests_total{status=~"5"..}[5m])) traffic
/
sum(rate(http_requests_total[5m]))
Burn-rate (fast channel ~ 1h)
(
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
) / (1 - SLO) > 14. 4
13) Visualization standards
Panel types: time-series for series, stat for KPI, table for top-N, heatmap for latency.
Legends and units: required; shortened labels, SI format.
Color zones: green/yellow/red by SLO/threshold (uniform).
Panel description: what we measure, source, runbook link, owner.
14) Panel templates (quick start)
(A) API Overview
KPI: `RPS`, `p95`, `5xx%`, `error_budget_remaining`.
Top endpoints by error/latency.
Drilldown in the 'trace _ id = $ trace' logs.
(B) Node Health
CPU/Memory/Disk/Network - p95 by node, list of "hot."
Pressure, throttling, package drops.
(C) DB Health
TPS, latency p95, locks, replication lag, slow queries.
Backup status/latest success.
(D) Kafka Lag
Lag by group, consumption rate vs producing, rebalances.
(E) Cost & Util
Cost/hour by services, idle%, rightsizing hints, forecast.
15) Variables and tags (recommended set)
`env` (prod/stage/dev)
`region`/`az`
`cluster`
`namespace`/`service`/`workload`
`tenant`
`component` (edge/db/cache/queue)
`version` (release/git_sha)
16) Integration with alert and incident management
Rules in Alertmanager/Graphana alerts with links to the desired dashboard and already substituted variables.
P1/P2 by SLO criteria, auto-assign to on-call.
Annotations of releases/incidents on graphs.
17) Quality of dashboards: checklist
- Owner and contact.
- SLO/thresholds are documented.
- Variables work and limit the size of queries.
- All panels with units and legend.
- Drilldown to logs/tracks.
- The panels fit into 2-3 "screens" (without scrolling per kilometer).
- Response time ≤2 -3 sec (cache, downsample).
- No dead panels or degraded metrics.
18) Performance and cost of dashboards themselves
Downsampling/recording rules for heavy aggregations.
Caching (query-frontend/repeater) and range/step limits.
Test hangar: load on TSDB/clusters for typical dashboard requests.
Label sanitization (low cardinality), abandoning wildcards.
19) Implementation plan (iterations)
1. Week 1: Landing + K8s/Edge reviews, basic SLOs, owners.
2. Week 2: DB/Queues, log and trace integration (drilldown), burn-rate alerts.
3. Week 3: FinOps dashboards, rightsizing recommendations, cost report.
4. Week 4 +: Security/Compliance, SLO card autogeneration, dashboard regression tests.
20) Mini-FAQ
How many dashboards do you need?
At least 1 review + one per domain (K8s, Edge, DB, Queues, CI/CD, Security, Cost). The rest is by maturity.
What is more important - metrics or logs?
Metrics for symptoms and SLO, logs for causes. Bundle through 'trace _ id' and consistent labels.
How not to "drown" in the panels?
Hierarchy, explicit owners, metric hygiene, regular reviews and removal of "dead" panels.
Total
Infrastructure dashboards are not "beautiful graphs," but a management tool: SLO control, fast RCA and conscious FinOps. Standardize variables, visual patterns, and owners; provide drilldown to logs/tracks and automate burn-rate alerts. This will give predictability, reaction speed and cost transparency at the level of the entire platform.