Storage of time series
1) Why a separate architecture for time series
Time series are sequences of pairs (timestamp, value) with tags (labels), which are characterized by:- High recording speed (ingest) and frequency.
- Reads by time ranges (scan + aggregates/window functions).
- Explosive cardinality due to tag combinations.
- The need for retention (shelf life restrictions) and downsampling (time compression).
- Hence - a special storage model, compression formats and request protocols.
2) Data model and contract metrics
2. 1 Naming and Tags
metric_name: singular verb/noun ('http _ requests _ total', 'cpu _ usage _ seconds _ total').
labels: attribute keys ('job', 'instance', 'dc', 'pod', 'status', 'method').
Invariants: do not change the semantics of the name, add versions ('metric _ v2') with incompatible changes.
2. 2 Types of rows
Gauge, Counter, Histogram/Summary, Event/Span.
For financials/densities - Fix units and aggregability (summed/averaged).
2. 3 Retention and Rollup Policy
Hot detail (seconds/1-10 min) → warm units (5m/1h) → cold (1d/1w).
For counter - store rate/deriv aggregates.
3) Recording path: reception, buffering, compact
3. 1 Ingest-pipeline
Scrape (pull, Prometheus) or push (OTLP/StatsD/Graphite), often via gateway/agent.
Buffering in WAL (write-ahead log), then compressing into segments/blocks (LSM-like architecture).
Batching and time sorting increase compression and speed.
3. 2 Handling out-of-order and takes
Tolerance window (lateness window, for example 5-15 min) + policy: 'drop | upsert | keep-last'.
Deduplication by '(series_id, timestamp)' with versioning or "last record wins."
3. 3 Compression
Delta-of-delta for timestamps, Gorilla/XOR for float, RLE and varint for integers, dictionary for tags.
The optimal block size ("chunk") of 1-8K points is a compromise between IOPS and CPU.
4) Storage schemes: TSDB vs SQL/columns
4. 1 Dedicated TSDB
Prometheus (local, short retention, PromQL, remote_write).
VictoriaMetrics/M3/InfluxDB - horizontal scaling, long retention, remote read.
Block formats are optimized for range scan + tendering aggregations.
4. 2 Relational/column engines
TimescaleDB (PostgreSQL): hypertables, chunks by time/space, continuous aggregates.
ClickHouse: MergeTree/TTL/materialized views, excellent compression and time aggregation.
Selection - by query ecosystem (SQL vs PromQL), join/BI requirements and team operational skills.
5) Scheme and examples
5. 1 TimescaleDB: hypertable + continuous aggregate
sql
CREATE TABLE metrics_cpu(
ts timestamptz NOT NULL,
host text NOT NULL,
dc text NOT NULL,
usage double precision NOT NULL,
PRIMARY KEY (ts, host, dc)
);
SELECT create_hypertable('metrics_cpu', by_range('ts'), chunk_time_interval => interval '1 day');
-- Continuous unit (5 minutes)
CREATE MATERIALIZED VIEW cpu_5m
WITH (timescaledb. continuous) AS
SELECT time_bucket('5 minutes', ts) AS ts5m, host, dc, avg(usage) AS avg_usage
FROM metrics_cpu GROUP BY 1,2,3;
-- Politicians
SELECT add_retention_policy('metrics_cpu', INTERVAL '14 days');
SELECT add_retention_policy('cpu_5m', INTERVAL '180 days');
5. 2 ClickHouse: Aggregating Storage
sql
CREATE TABLE metrics_cpu (
ts DateTime,
host LowCardinality(String),
dc LowCardinality(String),
usage Float32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (host, dc, ts)
TTL ts + INTERVAL 14 DAY
SETTINGS index_granularity = 8192;
-- Rollup in hourly detail
CREATE MATERIALIZED VIEW cpu_1h
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (host, dc, ts)
POPULATE AS
SELECT toStartOfHour(ts) AS ts, host, dc, avg(usage) AS usage
FROM metrics_cpu GROUP BY ts, host, dc;
5. 3 Prometheus/VictoriaMetrics: remote_write
yaml global:
scrape_interval: 15s remote_write:
- url: http://vminsert:8480/insert/0/prometheus/api/v1/write
6) Cardinality: how not to "blow up" the storage
6. 1 Rules
Limit label cardinality (number of unique values). Do not include 'user _ id', 'request _ id', 'trace _ id'.
Normalize "multi-valued" tags (categories → codes).
Use LowCardinality types (in CH), dictionaries/label trees (in TSDB).
6. 2 Controls and alerts
Metrics: 'series _ count', 'label _ values {label}', top-N "expensive" rows.
Write failure policies when cardinality limit is exceeded per tenant/job.
6. 3 Histories/Histograms
For high-cardinality, it is better to store aggregates (histogram buckets) and pre-rollup; quantiles to calculate online on aggregates.
7) Retention, downsampling and tiered-storage
7. 1 Policies
Hot: 3-30 days of second/minute detail.
Warm: 90-365 days of 5m/1h aggregates.
Cold: years of day aggregates, archive in object storage (S3/Glacier) with Parquet.
7. 2 Technicians
Continuous aggregates (Timescale), materialized views (CH), retention + rollup tasks (Victoria/M3/Influence).
Tiered storage: "hot blocks" locally, "cold" in object with local cache.
8) Queries and languages
8. 1 PromQL (example)
promql rate(http_requests_total{job="api",status=~"5.."}[5m])
Looking for a 5xx API error rate.
8. 2 SQL aggregates by windows
sql
SELECT time_bucket('1h', ts) AS hour,
dc, avg(usage) AS avg, max(usage) AS pmax
FROM metrics_cpu
WHERE ts >= now() - interval '24 hours'
GROUP BY 1,2 ORDER BY 1;
8. 3 Anomalies (sketch)
Z-score/ESD by window statistics, STL decomposition of seasonality; store the results in a separate row'anomaly = 1/0 '.
9) Integrations and protocols
OTLP (OpenTelemetry): metrics/trails/logs, exporters on agents (otel-collector) → TSDB/clickhouse/object.
StatsD/Graphite: simple counters/timers; proxy to edge, then conversion to a single format.
Kafka/NATS: buffer for ingest bursts, replayer for backfill; consumers write batchami.
text kafka(topic=metrics) -> stream processor (normalize/tags) -> CH INSERT INTO metrics_cpu FORMAT RowBinary
10) Accessibility, HA and Federation
Replica/TSDB HA pairs or Prometheus federation (region → global level).
Remote read/write for long-term storage and centralized dashboards.
Shard-by-label/time: uniform distribution of ingest, locality by 'dc/tenant'.
11) Observability of the storage itself
11. 1 Metrics
Ingest: `samples/sec`, `append_latency`, `wal_fsync_ms`.
Хранение: `blocks_count`, `compaction_queue_len`, `chunk_compression_ratio`.
Запросы: `query_qps`, `scan_bytes`, `p95/p99_latency`, `alloc_bytes`.
Cardinality: 'series _ count', top-labels.
11. 2 SLO
"p99 latency for 1h range ≤ 200ms at QPS≤500."
«Ingest-drop ≤ 0. 01% at burst before X samples/sec."
«Compaction backlog < 10 min».
11. 3 Alerts
Growth 'series _ count'> Y %/hour.
Compression queue/flush> threshold.
Доля out-of-order > N%, dedup/late-drops.
12) Safety and multi-tenancy
Isolation by 'tenant' (label in keys, individual tables/databases, quotas).
Sanitization of labels (PII prohibition), size/value control.
Encryption "at rest" and on transport, auditing access to "sensitive" metrics.
13) Operating practices
Warm up and cold start: pin of "hot" blocks in the cache, prefetch of the last N hours.
Backfill: individual pipelines with low priority, do not mix with online.
Schema versioning: migrations with parallel writing (dual-write) and subsequent switch.
Storage budget: control of'cost _ per _ TB _ month' + forecast of cardinality growth.
14) Anti-patterns
Tags with high cardinality (user_id, uuid) → explosion of rows.
"Eternal" rows without retention → uncontrolled growth.
Non-butching/sorting recording → poor compression and IOPS storm.
Mix OLTP and long scans on the same disk pool.
Lack of out-of-order policy → duplicates and bloats.
Histograms with hundreds of buckets → cost × 10 without benefit.
15) Implementation checklist
- Define metrics, their types and units; fix the name/label contract.
- Select Engine (TSDB vs SQL/Column) and Query Language (PromQL/SQL).
- Design a retention/rollup (hot/warm/cold) and ILM.
- Configure ingest: WAL/batches/sorting, out-of-order windows.
- Turn on compression (delta-of-delta/XOR/RLE), optimal chunks.
- Control cardinality: quotas, alerts, opt-out policies.
- Configure HA/federation and remote-write/read.
- SLO dashboards and storage metrics (ingest/query/storage).
- Security/tenant isolation policies and lack of PII in labels.
- Regular "game day": backfill, node loss, ingest surge.
16) FAQ
Q: What to choose for infrastructure monitoring: Prometheus or ClickHouse/Timescale?
A: For native monitoring and PromQL - Prometheus + long-term storage (Victoria/M3). For BI/warehouse scenarios and SQL - Timescale/ClickHouse.
Q: How to store quantiles without heavy summaries?
A: Use histogram with neat buckets and calculate quantiles when requested; or t-digest/CKMS in aggregates.
Q: How to deal with out-of-order?
A: Enter the tolerance window (5-15 min) and deterministic dedup policy; for telemetry from mobile/edge - the window is wider.
Q: When is a rollup needed?
A: Always> 30-90 days: aggregates reduce the size × 10-100 and accelerate analytics.
Q: Is it possible to mix logs and metrics?
A: Store separately (formats/queries are different). For correlation, use Exemplar/TraceID and dashboards, but do not add everything to one table.
17) Totals
Effective time series storage is metrics contract + tag discipline, competent ingest (WAL/compaction), compression, and retention/rollup policies. Add cardinality control, NA/federation and observability of the store itself - and you get predictable p95, reasonable cost for months to years and surge resistance.