Testing data pipelines

1) Why test data pipelines

Data pipelines (ingest → transform → serve) - a critical infrastructure for reporting, ML and operational solutions. Errors turn into incorrect metrics, fraud signals and monetary losses. Testing provides:

Correctness and resilience.
Predictable changes (schema/logic evolution).
Compliance with SLO in terms of freshness, completeness, latency.
Fast release (release speed) due to automated verification.

2) Data testing pyramid

Bottom up: a lot of quick local tests → less integration → a little end-to-end.

1. Unit tests of transformations (functions, UDF, SQL-views, dbt-models).
2. Data quality tests (freshness/completeness/uniqueness/range rules).
3. Contracts and schemes (schema/contract tests, evolution).
4. Pipeline integration tests (DAG: ingest ↔ storage ↔ transform ↔ marts).
5. E2E tests (source to storefront/API) including rights (RLS/CLS) and export.
6. Load/capacity (volume, speed, cost-to-serve).
7. Data chaos tests (delays, duplicates, out-of-order, unavailability).

3) Types of tests: what exactly we check

3. 1 Unit Logic Tests

Net transformation functions; property-based (invariants: idempotency, monotony).
SQL/DBT: comparison of the result with the standard (golden set), 'SELECT' prohibition, checking the filter by time.

3. 2 Data quality tests (DQ)

Freshness: window delay ≤ target threshold.
Completeness: expected quantity/percentage of occupancy.
Uniqueness: keys without duplicates.
Domain rules: ranges, referential integrity, business invariants.
Anomalies: outliers, bursts of duplicates, time gaps.

3. 3 Contracts and schemes

Change compatibility (SemVer: MAJOR/MINOR/PATCH).
Availability of mandatory columns, types, restrictions.
Fixed KPI semantics: formulas and aggregation windows.

3. 4 Integration and E2E

DAG integrity: triggers, dependencies, idempotent repeat.
Full path: source → raw → curated → marts → BI/API; RLS/CLS.

3. 5 Performance and costs

p95/p99 job latency, throughput (rows/s), volume/value.
Performance regression tests and scan limits.

3. 6 Security and privacy

PII/PCI masking (deterministic tokenization).
RLS/CLS Check - Users see only their own.
Export/snapshots: no "raw" personal fields.

4) The specifics of streaming (Kafka/Flink/Spark Structured Streaming)

Watermarks and lateness: tests of windows with late events (T + Δ), correct recalculations.
Exactly-once in meaning: dedup by 'event _ id', idempotent entry (upsert/merge).
Out-of-order: invariants for aggregates by 'event _ time'; fix'ingested _ at '.
Loss/repetition: simulate a drop/game of parties, check the correctness of the showcases.

5) Idempotence and determinism (what and how to test)

Restarting a step gives the same result (with the same window parameters).
Recording - via staging and atomic swap.
Merge logic with SCD1/SCD2 is covered by last-write-wins, source priority.
UDF/aggregate determinacy: same inputs → same outputs.

6) Test data management

Golden datasets: small standards with manual validation.
Synthetics + data factories: covering domain edges (nulls, extreme values, Unicode, TZ).
De-identified prod samples: privacy match.
Layered fictions: raw events, intermediate layers, final showcases.

7) Data Contracts - Example and Rules

YAML contract (simplified):

yaml dataset: orders schema:
- name: order_id; type: string; unique: true
- name: user_id; type: string; not_null: true
- name: amount; type: decimal(18,2); min: 0
- name: event_time; type: timestamp; tz: UTC freshness_sla: 10m dq_rules:
- "pct_null(user_id) < 0. 1%"
- "duplicates(order_id) = 0"
- "sum(amount) >= 0"
evolution:
allowed_minor_additions: true breaking_changes_require: approval: 'data-governance'
actions_on_violation:
- quarantine_partition
- replay_last_60m

8) Observability and SLO tests

Export metrics: Freshness, Completeness, Uniqueness, Latency to Grafana/Prometheus.
SLO alerts as "red" unit tests in prod (Synthetics).

Regression reports: "after the release of X p95 ↑ by 40%."

9) CI/CD and media

CI: unit + DQ + PR contracts; schema-diff; SQL static analysis (linter).
Sandbox/staging: run integration and e2e, chaos tests with secure data.
Feature-flags: canary jabs/models/formulas.
Cataloging: version of schemes, KPI formulas, lineage; automatic update of documentation.

10) Chaos data testing (Chaos-Data)

Injection of duplicates/omissions, delays, out-of-order.
Broker/party drop, "broken" files, schema drift.
We validate: auto-repair (replay/backfill), quarantine and alerts, MTTR-data.

11) Load and cost

Traffic generators with p95 profile/peaks.
Limits on scan/step (bytes scanned, time caps).
A/B value profiler: "old" vs "new" model/query.

12) Tools (sample classes)

DQ/Contracts: dbt tests, Great Expectations, Dequ, Soda, Custom linters.
Orchestration: Airflow/Dagster/Argo/Prefect (operators for tests at each step).
Platforms: BigQuery/Snowflake/Redshift/ClickHouse/Delta/Iceberg/Hudi.
Streaming: Kafka, Flink, Spark Streaming; TestContainers for local environments.
Observability: Prometheus/Grafana/Otel; DataHub/Amundsen/Collibra directories.

13) Antipatterns

"There is nothing to test - it's just SQL": there are no units and DQ → metrics break.
Only E2E: slow, unstable, the causes of breakdowns are not clear.
SELECT: breaks under MINOR evolution.
Live reading of OLTP in tests: instability and flakes.
Lack of golden sets: nothing to compare the results with.
No idempotency tests: rerun spoils data.
Forgotten streaming: not tested lateness/out-of-order/redelivery.

14) Implementation Roadmap

1. Basis: unit tests of transformations, golden sets, SQL linter, top-10 showcase DQ rules.
2. Contracts: schema-diff in CI, SemVer, automatic compatibility checks.
3. Integrations: DAG tests, idempotency, e2e for critical streams.
4. Streaming: watermarks/lateness, dedup/idempotent sinks tests.
5. SLO and chaos: quality metrics in sales, alerts, chaos scenarios, MTTR goals.
6. Optimization: perf regressions, budget guards, canary releases.

15) Pre-release checklist

Unit tests cover key transformations and UDFs.
DQ rules for freshness/completeness/uniqueness/ranges pass.
Contracts and schema-diff are green; there are no breaking changes without appruv.
Idempotency tested; atomic sink/merge works.
Streaming: watermarks/late data/out-of-order covered; dedup in place.
SLO metrics are exposed; alerts are configured; runbooks are.
Test data is secure; PII masked; RLS/CLS checked.
No perf regressions; scan/time limits met.
Chaos tests of basic scenarios passed; MTTR-target achievable.

16) Mini template examples

16. 1 SQL unit test (pseudo-dbt):

sql
-- tests/assert_positive_amount. sql select count() as c from {{ ref('fct_payments') }}
where amount < 0 having c = 0

16. 2 Great Expectations-style:

yaml expect_table_row_count_to_be_between:
min_value: 1000 mostly: 0. 99 expect_column_values_to_not_be_null:
column: user_id expect_column_values_to_be_unique:
column: txn_id

16. 3 Checking lateness in stream (pseudo-code):

python emit(events_out_of_window <= threshold)
emit(reprocessed_events == late_events_detected)

16. 4 Contract-test (schema-diff CI):

bash schema-diff --current models/orders. yml --target prod_schema. yml --semver

17) The bottom line

Testing data pipelines is a systems discipline, not a collection of piecemeal checks. Combine a pyramid of tests, contracts, and observability with practices of idempotency, circuit evolution, and streaming invariants. Then releases will become fast, incidents will become rare and short, and trust in data will become sustainable.

Testing data pipelines

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects