Storage of time series

1) Why a separate architecture for time series

Time series are sequences of pairs (timestamp, value) with tags (labels), which are characterized by:

High recording speed (ingest) and frequency.
Reads by time ranges (scan + aggregates/window functions).
Explosive cardinality due to tag combinations.
The need for retention (shelf life restrictions) and downsampling (time compression).
Hence - a special storage model, compression formats and request protocols.

2) Data model and contract metrics

2. 1 Naming and Tags

metric_name: singular verb/noun ('http _ requests _ total', 'cpu _ usage _ seconds _ total').
labels: attribute keys ('job', 'instance', 'dc', 'pod', 'status', 'method').
Invariants: do not change the semantics of the name, add versions ('metric _ v2') with incompatible changes.

2. 2 Types of rows

Gauge, Counter, Histogram/Summary, Event/Span.
For financials/densities - Fix units and aggregability (summed/averaged).

2. 3 Retention and Rollup Policy

Hot detail (seconds/1-10 min) → warm units (5m/1h) → cold (1d/1w).
For counter - store rate/deriv aggregates.

3) Recording path: reception, buffering, compact

3. 1 Ingest-pipeline

Scrape (pull, Prometheus) or push (OTLP/StatsD/Graphite), often via gateway/agent.
Buffering in WAL (write-ahead log), then compressing into segments/blocks (LSM-like architecture).
Batching and time sorting increase compression and speed.

3. 2 Handling out-of-order and takes

Tolerance window (lateness window, for example 5-15 min) + policy: 'drop | upsert | keep-last'.

Deduplication by '(series_id, timestamp)' with versioning or "last record wins."

3. 3 Compression

Delta-of-delta for timestamps, Gorilla/XOR for float, RLE and varint for integers, dictionary for tags.
The optimal block size ("chunk") of 1-8K points is a compromise between IOPS and CPU.

4) Storage schemes: TSDB vs SQL/columns

4. 1 Dedicated TSDB

Prometheus (local, short retention, PromQL, remote_write).
VictoriaMetrics/M3/InfluxDB - horizontal scaling, long retention, remote read.
Block formats are optimized for range scan + tendering aggregations.

4. 2 Relational/column engines

TimescaleDB (PostgreSQL): hypertables, chunks by time/space, continuous aggregates.
ClickHouse: MergeTree/TTL/materialized views, excellent compression and time aggregation.
Selection - by query ecosystem (SQL vs PromQL), join/BI requirements and team operational skills.

5) Scheme and examples

5. 1 TimescaleDB: hypertable + continuous aggregate

sql
CREATE TABLE metrics_cpu(
ts timestamptz NOT NULL,
host text NOT NULL,
dc text NOT NULL,
usage double precision NOT NULL,
PRIMARY KEY (ts, host, dc)
);
SELECT create_hypertable('metrics_cpu', by_range('ts'), chunk_time_interval => interval '1 day');

-- Continuous unit (5 minutes)
CREATE MATERIALIZED VIEW cpu_5m
WITH (timescaledb. continuous) AS
SELECT time_bucket('5 minutes', ts) AS ts5m, host, dc, avg(usage) AS avg_usage
FROM metrics_cpu GROUP BY 1,2,3;

-- Politicians
SELECT add_retention_policy('metrics_cpu', INTERVAL '14 days');
SELECT add_retention_policy('cpu_5m',   INTERVAL '180 days');

5. 2 ClickHouse: Aggregating Storage

sql
CREATE TABLE metrics_cpu (
ts DateTime,
host LowCardinality(String),
dc LowCardinality(String),
usage Float32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (host, dc, ts)
TTL ts + INTERVAL 14 DAY
SETTINGS index_granularity = 8192;

-- Rollup in hourly detail
CREATE MATERIALIZED VIEW cpu_1h
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (host, dc, ts)
POPULATE AS
SELECT toStartOfHour(ts) AS ts, host, dc, avg(usage) AS usage
FROM metrics_cpu GROUP BY ts, host, dc;

5. 3 Prometheus/VictoriaMetrics: remote_write

yaml global:
scrape_interval: 15s remote_write:
- url: http://vminsert:8480/insert/0/prometheus/api/v1/write

6) Cardinality: how not to "blow up" the storage

6. 1 Rules

Limit label cardinality (number of unique values). Do not include 'user _ id', 'request _ id', 'trace _ id'.
Normalize "multi-valued" tags (categories → codes).
Use LowCardinality types (in CH), dictionaries/label trees (in TSDB).

6. 2 Controls and alerts

Metrics: 'series _ count', 'label _ values {label}', top-N "expensive" rows.
Write failure policies when cardinality limit is exceeded per tenant/job.

6. 3 Histories/Histograms

For high-cardinality, it is better to store aggregates (histogram buckets) and pre-rollup; quantiles to calculate online on aggregates.

7) Retention, downsampling and tiered-storage

7. 1 Policies

Hot: 3-30 days of second/minute detail.
Warm: 90-365 days of 5m/1h aggregates.
Cold: years of day aggregates, archive in object storage (S3/Glacier) with Parquet.

7. 2 Technicians

Continuous aggregates (Timescale), materialized views (CH), retention + rollup tasks (Victoria/M3/Influence).
Tiered storage: "hot blocks" locally, "cold" in object with local cache.

8) Queries and languages

8. 1 PromQL (example)

promql rate(http_requests_total{job="api",status=~"5.."}[5m])

Looking for a 5xx API error rate.

8. 2 SQL aggregates by windows

sql
SELECT time_bucket('1h', ts) AS hour,
dc, avg(usage) AS avg, max(usage) AS pmax
FROM metrics_cpu
WHERE ts >= now() - interval '24 hours'
GROUP BY 1,2 ORDER BY 1;

8. 3 Anomalies (sketch)

Z-score/ESD by window statistics, STL decomposition of seasonality; store the results in a separate row'anomaly = 1/0 '.

9) Integrations and protocols

OTLP (OpenTelemetry): metrics/trails/logs, exporters on agents (otel-collector) → TSDB/clickhouse/object.
StatsD/Graphite: simple counters/timers; proxy to edge, then conversion to a single format.
Kafka/NATS: buffer for ingest bursts, replayer for backfill; consumers write batchami.

Example of Kafka → ClickHouse (pseudo):

text kafka(topic=metrics) -> stream processor (normalize/tags) -> CH INSERT INTO metrics_cpu FORMAT RowBinary

10) Accessibility, HA and Federation

Replica/TSDB HA pairs or Prometheus federation (region → global level).
Remote read/write for long-term storage and centralized dashboards.
Shard-by-label/time: uniform distribution of ingest, locality by 'dc/tenant'.

11) Observability of the storage itself

11. 1 Metrics

Ingest: `samples/sec`, `append_latency`, `wal_fsync_ms`.
Хранение: `blocks_count`, `compaction_queue_len`, `chunk_compression_ratio`.
Запросы: `query_qps`, `scan_bytes`, `p95/p99_latency`, `alloc_bytes`.
Cardinality: 'series _ count', top-labels.

11. 2 SLO

"p99 latency for 1h range ≤ 200ms at QPS≤500."

«Ingest-drop ≤ 0. 01% at burst before X samples/sec."

«Compaction backlog < 10 min».

11. 3 Alerts

Growth 'series _ count'> Y %/hour.
Compression queue/flush> threshold.
Доля out-of-order > N%, dedup/late-drops.

12) Safety and multi-tenancy

Isolation by 'tenant' (label in keys, individual tables/databases, quotas).
Sanitization of labels (PII prohibition), size/value control.
Encryption "at rest" and on transport, auditing access to "sensitive" metrics.

13) Operating practices

Warm up and cold start: pin of "hot" blocks in the cache, prefetch of the last N hours.
Backfill: individual pipelines with low priority, do not mix with online.
Schema versioning: migrations with parallel writing (dual-write) and subsequent switch.
Storage budget: control of'cost _ per _ TB _ month' + forecast of cardinality growth.

14) Anti-patterns

Tags with high cardinality (user_id, uuid) → explosion of rows.
"Eternal" rows without retention → uncontrolled growth.
Non-butching/sorting recording → poor compression and IOPS storm.
Mix OLTP and long scans on the same disk pool.
Lack of out-of-order policy → duplicates and bloats.
Histograms with hundreds of buckets → cost × 10 without benefit.

15) Implementation checklist

Define metrics, their types and units; fix the name/label contract.
Select Engine (TSDB vs SQL/Column) and Query Language (PromQL/SQL).
Design a retention/rollup (hot/warm/cold) and ILM.
Configure ingest: WAL/batches/sorting, out-of-order windows.
Turn on compression (delta-of-delta/XOR/RLE), optimal chunks.
Control cardinality: quotas, alerts, opt-out policies.
Configure HA/federation and remote-write/read.
SLO dashboards and storage metrics (ingest/query/storage).
Security/tenant isolation policies and lack of PII in labels.
Regular "game day": backfill, node loss, ingest surge.

16) FAQ

Q: What to choose for infrastructure monitoring: Prometheus or ClickHouse/Timescale?
A: For native monitoring and PromQL - Prometheus + long-term storage (Victoria/M3). For BI/warehouse scenarios and SQL - Timescale/ClickHouse.

Q: How to store quantiles without heavy summaries?
A: Use histogram with neat buckets and calculate quantiles when requested; or t-digest/CKMS in aggregates.

Q: How to deal with out-of-order?
A: Enter the tolerance window (5-15 min) and deterministic dedup policy; for telemetry from mobile/edge - the window is wider.

Q: When is a rollup needed?
A: Always> 30-90 days: aggregates reduce the size × 10-100 and accelerate analytics.

Q: Is it possible to mix logs and metrics?
A: Store separately (formats/queries are different). For correlation, use Exemplar/TraceID and dashboards, but do not add everything to one table.

17) Totals

Effective time series storage is metrics contract + tag discipline, competent ingest (WAL/compaction), compression, and retention/rollup policies. Add cardinality control, NA/federation and observability of the store itself - and you get predictable p95, reasonable cost for months to years and surge resistance.

Storage of time series

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects