Dashboards of infrastructure

1) Why do you need it

A single picture of the state: from cluster and networks to databases and queues.
Fast RCA and post-mortems: a bunch of metrics ↔ logs ↔ traces.
SLO by service and platform: control over availability and latency.
FinOps transparency: volume/cost by services, tenant and environments.
Compliance/security: status of patches/vulnerabilities, accesses, anomalies.

Methodologies: Golden Signals (latency, traffic, errors, saturation), RED (Rate, Errors, Duration) for requests, USE (Utilization, Saturation, Errors) for resources.

2) Principles of good dashboard

Actionable-Each panel responds to "what to do next."

Hierarchy: overview → domains → deep dive → raw.
Templates/variables: 'cluster', 'namespace', 'service', 'tenant', 'env'.
Uniform units: ms for latency,%, RPS, ops/sec, bytes.
Consistent timepicker: default 1-6 hours, fast presets 5m/15m/24h.
Drilldown: from the panel to the logs (Loki/ELK) and the track (Tempo/Jaeger).
Ownership: the owner is indicated on the dashboard, SLO, runbook, contact in on-call.

3) Folder structure and roles

00_Overview - high-level overview of the platform.
10_Kubernetes - clusters, nodes, workloads, HPA/VPA, containers.
20_Network_Edge — Ingress/Envoy/Nginx, LB, DNS, CDN, WAF.
30_Storage_DB - PostgreSQL/MySQL, Redis, Kafka/RabbitMQ, object storage.
40_CICD_Runner - pipelines, agents, artifacts, registry.
50_Security_Compliance - vulnerabilities, patches, RBAC, audit events.
60_FinOps_Cost - cost per service/tenant/cluster, disposal.
99_Runbooks - links to instructions and SLO cards.

Roles: Platform-SRE (full access), Service-Owner (own spaces), Security/Compliance, Finance/FinOps, View-only.

4) Platform overview dashboard (Landing)

Goal: in ≤30 seconds to understand if everything is in order.

Recommended panels:

SLO platform (API availability edge): target value, actual, era of errors, burn-rate.
p50/p95/p99 latency by major entry points.
4xx/5xx errors and top endpoints with regressions.
Resource saturation (CPU, RAM, network, disk) - p95 by cluster.
Incidents/alerts (active) and recent releases.
Cost/hour (approximate) and trend by week.

Variable templates: 'env', 'region', 'cluster', 'tenant'.

5) Kubernetes: clusters and workshops

Key groups:

1. Cluster/Nodes

CPU/Memory disposal, pressure (memory/cpu), IO disk, inode.
Subsystems: kube-api, etcd, controllers; kubelet health.

2. Vorkloads

RPS/RPM, latency p95, error rate, restarts, throttling, OOMKills.
HPA targets vs actual metrics.

3. Network path within cluster

eBPF/Netflow: top talkers, drops, retransmits.

4. Events K8s

Rate по Warning/FailedScheduling/BackOff.

Examples of PromQL:

promql
API (5xx) errors by sum by (service) (rate (http_requests_total{status=~"5"..}[5m]))

Latency p95 histogram_quantile (0. 95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

Throttling CPU контейнеров sum by (namespace, pod) (rate(container_cpu_cfs_throttled_seconds_total[5m]))

6) Edge, grid and DNS

Panels:

Ingress/Envoy/Nginx: RPS, p95, 4xx/5xx, upstream_errors, active_conns.
LB/Anycast: distribution of traffic by zones, failover events.
DNS: resolution latency, NXDOMAIN/SERVFAIL rate, hit-ratio cache.
CDN/WAF: blocked by rules, abnormal traffic (bots/scrappers).

Example (Nginx):

promql sum(rate(nginx_http_requests_total[5m])) by (status)

7) Databases and storages

PostgreSQL/MySQL: qps, latency, lock waits, replication lag, backups/failures.
Redis: hit ratio, evictions, memory, slow commands.
Kafka/RabbitMQ: lag by consumer groups, rebalances, unacked messages.
Object storage: queries, errors, egress, lat p95.

PostgreSQL (example):

promql
Replication lag in seconds max by (replica) (pg_replication_lag_seconds)

Slow Queries> 1s rate (pg_stat_activity_longqueries_total[5m])

Kafka (example):

promql
Lag by group max by (topic, group) (kafka_consumergroup_lag)

8) CI/CD and artifacts

Pipeline overview: success/runtime, runner queue.
Deployment health: versions, canary/blue-green status, warm-up time.
Image registers: size, last push 'and, disposal.

Example:

promql
Rate (ci_pipeline_success_total[1h] )/rate (ci_pipeline_total[1h]) success rate

9) Safety and compliance

Patches and vulnerabilities: proportion of nodes/images with critical CVEs, average "time to patch."

RBAC and secrets: unsuccessful access attempts, access to secrets.
Audit events: inputs/changes in critical components, drift.
WAF/DLP/PII revision: rule locks, masking errors.

10) Logs and Trails: End-to-End Review

Summary of errors from logs (Loki/ELK): top exceptions, new signatures.
Button "Go to logs with filters" (LogQL/ES query).
Traces: top slow spans, percentage of requests without trace context.

Examples of LogQL:


{app="api", level="error"}     = "NullReference"
{app="nginx"}      json      status="5.."      count_over_time([5m])

11) FinOps: cost and disposal

Cost by services/tenants/clusters (according to billing/exporters).
Hot/cold nodes: idle resources, rightsizing recommendations (CPU/Mem).
Data egress, L7 requests and their cost.
Dynamics: week/month, forecast.

Key metrics:

cost_per_rps, cost_per_request, storage_cost_gb_day, idle_cost.
efficiency factor: 'RPS/$' or 'SLO-minutes/$'.

12) SLO, bugs and burn-rate

SLO card on each domain dashboard: goal, period, errors (budget).
Burn-rate alerts (two speeds: fast/slow).

Examples of PromQL (error as "5xx or p95> threshold"):

promql
Bad budget: 5xx as a fraction of sum (rate (http_requests_total{status=~"5"..}[5m])) traffic
/
sum(rate(http_requests_total[5m]))

Burn-rate (fast channel ~ 1h)
(
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
) / (1 - SLO) > 14. 4

💡 Substitute your'SLO 'and multi-window, multi-burn coefficients.

13) Visualization standards

Panel types: time-series for series, stat for KPI, table for top-N, heatmap for latency.
Legends and units: required; shortened labels, SI format.
Color zones: green/yellow/red by SLO/threshold (uniform).
Panel description: what we measure, source, runbook link, owner.

14) Panel templates (quick start)

(A) API Overview

KPI: `RPS`, `p95`, `5xx%`, `error_budget_remaining`.
Top endpoints by error/latency.
Drilldown in the 'trace _ id = $ trace' logs.

(B) Node Health

CPU/Memory/Disk/Network - p95 by node, list of "hot."

Pressure, throttling, package drops.

(C) DB Health

TPS, latency p95, locks, replication lag, slow queries.
Backup status/latest success.

(D) Kafka Lag

Lag by group, consumption rate vs producing, rebalances.

(E) Cost & Util

Cost/hour by services, idle%, rightsizing hints, forecast.

15) Variables and tags (recommended set)

`env` (prod/stage/dev)

`region`/`az`

`cluster`

`namespace`/`service`/`workload`

`tenant`

`component` (edge/db/cache/queue)

`version` (release/git_sha)

16) Integration with alert and incident management

Rules in Alertmanager/Graphana alerts with links to the desired dashboard and already substituted variables.
P1/P2 by SLO criteria, auto-assign to on-call.
Annotations of releases/incidents on graphs.

17) Quality of dashboards: checklist

Owner and contact.
SLO/thresholds are documented.
Variables work and limit the size of queries.
All panels with units and legend.
Drilldown to logs/tracks.
The panels fit into 2-3 "screens" (without scrolling per kilometer).
Response time ≤2 -3 sec (cache, downsample).
No dead panels or degraded metrics.

18) Performance and cost of dashboards themselves

Downsampling/recording rules for heavy aggregations.
Caching (query-frontend/repeater) and range/step limits.
Test hangar: load on TSDB/clusters for typical dashboard requests.
Label sanitization (low cardinality), abandoning wildcards.

19) Implementation plan (iterations)

1. Week 1: Landing + K8s/Edge reviews, basic SLOs, owners.
2. Week 2: DB/Queues, log and trace integration (drilldown), burn-rate alerts.
3. Week 3: FinOps dashboards, rightsizing recommendations, cost report.
4. Week 4 +: Security/Compliance, SLO card autogeneration, dashboard regression tests.

20) Mini-FAQ

How many dashboards do you need?
At least 1 review + one per domain (K8s, Edge, DB, Queues, CI/CD, Security, Cost). The rest is by maturity.

What is more important - metrics or logs?
Metrics for symptoms and SLO, logs for causes. Bundle through 'trace _ id' and consistent labels.

How not to "drown" in the panels?
Hierarchy, explicit owners, metric hygiene, regular reviews and removal of "dead" panels.

Total

Infrastructure dashboards are not "beautiful graphs," but a management tool: SLO control, fast RCA and conscious FinOps. Standardize variables, visual patterns, and owners; provide drilldown to logs/tracks and automate burn-rate alerts. This will give predictability, reaction speed and cost transparency at the level of the entire platform.

Dashboards of infrastructure

(B) Node Health

(C) DB Health

(D) Kafka Lag

(E) Cost & Util

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects