Compress analytical data
1) Why compress analytical data
Compression reduces storage and traffic, speeds scans with less IO and better caching. The price is CPU and (sometimes) the complexity of updates. The goal is the optimum "IO↔CPU↔tochnost↔stoimost" for your SLOs.
Base metrics:- Compression Ratio (CR) = `raw_size / compressed_size`.
- Scan Cost ≈ bytes_scanned / throughput_storage + cpu_decode_time`.
- Total Cost = `storage_cost + compute_cost + egress_cost`.
2) Layers where compression lives
1. At the format level: Parquet/ORC/Avro (pages/stripes/columns).
2. At the encoding level of the column: Dictionary, RLE, Delta, FoR/Bit-packing, Gorilla/XOR.
3. At the codec level: ZSTD, Snappy, LZ4, Gzip.
4. At the query/engine level: vectorization, page skipping (min/max), bloom/zone-map.
5. At the storage level: tiered storage (hot/warm/cold), compression, page cache.
3) Column formats and their advantages
Parquet: column pages; dictionary support, RLE/Bit-packing, min/max statistics and null-count.
ORC: strips with indexes on streams, bloom filters; effective for long scans.
Avro (row): convenient for stream/logs, worse for analytical scans.
Practice: For default analytics, use Parquet/ORC, include column stats and dictionary where cardinality is low/medium.
4) Column encodings (lossless)
Dictionary-Replaces values with indexes (ideal for low cardinality).
RLE (Run-Length Encoding) - duplicate → values (value, run). Good for sorted/clustered columns.
Delta/Delta-of-Delta: stores differences (numbers/times).
FoR (Frame-of-Reference) + Bit-packing: value = base + offset; offset is packed with N bits.
Gorilla/XOR (Time-series): stores XOR of neighboring values with variable length; good for metrics.
Nullable bitmasks: a separate stream of nulls increases CR.
Tip: Pre-clustering/filtering key sorting dramatically improves RLE/zone-maps and CR.
5) General purpose codecs
ZSTD: best CR at moderate CPU price; supports levels 1-22. Universal choice.
Snappy: fast, low CR; suitable for hot data with high read frequency.
LZ4: Snappy even faster, similar CR; often - for stream/logs/caches.
Gzip/Deflate: high CR, high CPU price; rarely justified in interactive analytics.
Rule: hot layer - Snappy/LZ4, warm/cold - ZSTD (level 3-7).
6) Time series and logs
TSDB/column databases: Gorilla/XOR, Delta-RLE-Bitmap, Sparse-run for rare signals.
Logs: JSON→Parquet + ZSTD; normalize keys and types (do not store "string int").
Downsampling and roll-ups (lossy): store units by windows (1m/5m/1h) in a hot layer; raw - in cold.
Sketch structures: HLL (cardinality), TDigest/KLL (quantiles), CMS (frequencies) - compact, but approximate.
7) Lossless vs Lossy (when you can lose accuracy)
Lossless - reporting, finance, audit.
Lossy - monitoring, A/B analytics on large windows, telemetry (with explicit marking!).
Quality control: set the tolerance (e.g. P99 ± 0. 5 pp) and check it in CI.
8) Partitioning, pages and compaction
Parties: by date/region/tenant → fewer scans, better CR.
Page size/stripe: 64-256 KB per page, 64-512 MB per file - balance between seek and CPU.
Compaction: combine small files problem - above CR and speed.
Zone-maps/Bloom: speed up page skips; effective in sorting by filters.
9) Compression and encryption/privacy
Order of operations: first compression, then encryption. Otherwise, CR ≈ 1.
TDE/at-rest does not interfere with CR (an already compressed block is encrypted).
In-transit (TLS) does not affect the format.
PII masking/tokenization prior to compression keeps entropy manageable.
Caution with OPE/DET encryption: may degrade CR and/or risk privacy.
10) Cost and SLO (economics)
Storage: less bytes → less than $/TB-mo.
Compute: less IO → faster scans; but decompression wastes CPU.
Egress: less bytes → lower traffic/copy time.
SLO compromise: match the codec/level so that 'p95 _ latency' remains in the target window.
yaml hot:
format: parquet codec: snappy target_p95_ms: 1000 max_scan_mb: 2048 warm:
format: parquet codec: zstd:4 target_p95_ms: 2500 compaction: daily cold:
format: parquet codec: zstd:7 glacier: true retention: 365d
11) Practices for engines (ClickHouse/Snowflake/BigQuery/Redshift/Presto)
ClickHouse: CODEC 'and on speakers (LZ4/ZSTD/DoubleDelta), ORDER BY for RLE/scans, TTL/compression.
Snowflake/BigQuery: format/clustering automation; help cluster by (date, tenant, filter keys).
Redshift/Presto/Trino: Parquet/ORC with ZSTD, setting'hive. exec. compress. output ', statistics and file splitting.
12) Pipelines: where to include compression
Ingest: compressed batches (ZSTD/LZ4) when writing to lake.
Transform/DBT: create column targets with the desired codec and sorting.
Serve/OLAP: materialized views with a suitable codec; pre-aggregates for hot dashboards.
Export: для CSV/JSON — gzip/zstd; better to give to Parquet.
13) Testing and validation
AB profiling: a set of requests → compare p50/p95, bytes scanned, CPU time, CR.
Golden sets: correctness check after recoding/compressing.
Region perf tests: alerts if p95 ↑> X% after codec/level change.
DQ rules: types/ranges/NULL-rate should not change when reloading.
14) Retention and TTL policies
Tiered: hot (7-14 days) , warm (30-90 days) , cold (≥180 days) .
Downsampling: As you "cool down," store units/sketches instead of raw.
Retention/Legal hold: do not remove conflicts with regulations; store directories and versions.
15) Antipatterns
"Everywhere Gzip level 9 ": expensive CPU, no benefit.
No sorting/clustering: bad RLE/zone-maps → expensive scans.
JSON as a storage format: convenient for ingest, bad for analytics.
Too small files: inflate metadata/seek; CR falls.
Pre-compression encryption: Near-zero CR.
Lossy unmarked: breach of trust and accountability.
16) Implementation Roadmap
1. Discovery: query/data profiles, SLOs, and budgets.
2. MVP: Parquet + ZSTD/Snappy, basic sorting/clustering, compaction.
3. Tuning: ZSTD levels, page sizes, cluster by, bloom/zone-maps.
4. Warm/Cold: tiered storage, downsampling/sketches, egress policies.
5. Hardening: regression perf tests, DQ, transcoding runbooks.
17) Pre-release checklist
- Format: Parquet/ORC; statistics/dictionaries included.
- Clustering by filtering keys; parties by date/tenant.
- Codecs: hot = Snappy/LZ4, warm/cold = ZSTD (3-7); p95 is normal.
- Compression is set up; no small files; target file/page sizes.
- DQ and golden sets are green; types/ranges saved.
- Encryption after compression; PII masked; retention/Legal-hold complied.
- Perf regressions are monitored; alerts by p95/bytes scanned/CR.
- Storage policy and transcoding instructions documentation is ready.
18) Mini templates
DBT (Parquet table with ZSTD and clustering):sql create table if not exists analytics.sales_daily cluster by (event_date, tenant_id)
as select from {{ ref('sales_daily_view') }};
-- в конфиге модели: materialized=table, file_format=parquet, compression=zstd
Compacted policy (pseudo):
yaml compaction:
target_file_mb: 256 small_file_threshold_mb: 32 schedule: "hourly"
Config downsampling (pseudo):
yaml timeseries:
raw: keep: 14d rollup_1m: keep: 90d rollup_1h: keep: 365d rollup_1d: keep: 1825d
Bottom line: analytical data compression is not only "turn on the codec," but a holistic strategy: correct format, column encoding, sorting and partitioning, compression and storage levels, respect for encryption and SLO. Smart design delivers faster scans, lower counts and predictable performance - without compromising trust in data.