Metrics architecture

Metrics Architecture

Metrics architecture is a system of rules, artifacts and services that provide unambiguous definitions, reproducible calculation, transparent access and reliable operation of indicators throughout the organization. The goal is for "MAU," "Retention D30" or "ARPPU" to be considered the same in all dashboards, experiments and reports.

1) Principles

1. Single Source of Truth for formulas and reference books.
2. Separation of semantics from implementation: the business definition lives in a semantic layer, not in every SQL/laptop.
3. Versioning metrics, schemas, and formulas (v1→v2) with managed history migration.
4. Reproducibility and testability: calculations are deterministic, covered by tests.
5. Observability: freshness, fullness, consistency and drift - with SLOs and alerts.
6. Security and privacy: PII minimization, RLS/CLS, audit.
7. Operating system as code: definitions, transformations, policies - in the repository with CI/CD.

2) Architecture layers

Source data: events/transactions, reference books, model logs/infra.
Integration and cleaning: CDC/incremental loading, dedup, unification of time zones.
Data model (DWH): star/snowflake, slowly changing measurements (SCD), surrogate keys.
Semantic layer of metrics: uniform definitions, aggregations, filters, time grain, rollup logic.
Design layer: batch/microbatch/stream; windows, water marks, keys.
Catalog and dictionary: "passport metrics," lineage, owners, rights.
Access and consumption: BI/dashboards, API metrics, uploads, experiments/AB.

3) Data and Metrics Contracts

Source Contract (Events/Tables)

Schema: fields, types, nullity, primary key.
SLA: freshness (for example, "≤10 minutes lag"), frequency, maximum delayed arrival.
Quality: key uniqueness, valid value domains, timezone, idempotency.
Changes: scheme evolution policy (backward/forward), deviation plan.

Metric Contract

Name/ID: 'RET _ D30 _ v2'

Domaine/Owner: Product Analytics

Definition (in human language)

Formula: SQL/pseudocode + input storefronts/semantic objects

Granularity/temporal logic: day/week; point-in-time rules, timezone

Default Segments/Filters

Units and currencies (conversion rate/date)

SLO: freshness ≤ X, accuracy ≥ Y, availability ≥ Z

Version/Change History/Effective Date

Guardrails: valid ranges, Winzorization rules p1/p99

4) Semantic layer of metrics

The task of the layer is to centrally store definitions and aggregation rules:

Elements: dimensions (date, country, platform), facts (events, revenue), metrics (ARPU, Retention D30), calculated fields, calendar (work/weekend, holidays).
Time behavior: calendar tables, lags, cohorts, "sliding" windows (7/30/90).
Rollup and consistency: amount by day = month, while excluding double counting (distinct users).
Mix-adjustment: normalization to a constant mix of channels/countries for honest YoY.
Multicurrency/timezones: adjusted to base currency on the transaction date; local and "canonical" UTC slices.

5) Calculation: batch, microbatch, stream

Batch: night/hourly jobs, full/incremental recalculations, idempotency control.
Microbatch: windows 1-15 minutes for operational dashboards.
Stream: events through the tire; windows (tumbling/sliding/session), water marks (late data), exactly-once semantics (deadlock + offset store).

Window patterns:

'HOP 5m, WINDOW 1h'for operational KPIs;
'TUMBLE 1d'for daily metrics;
'SESSION 30m'for sessions.

6) Quality and verifiability

Data tests: schematic, domain (ranges), referential links.
Tests of metrics: invariants (DAU≤MAU), non-empty segments, expectations of monotony (cumulative).
Reconciliation: between semantic layer and reference reports/accounting.
Data health: freshness, completeness, duplicates, NULL fraction, abnormal jumps.
Drift metrics: PSI/KL/JS on key features, especially for ML metrics.

7) Versioning and migrations

Formula version is'METRIC _ NAME _ vN '. It is forbidden to "quietly" change the definition without changing the version.

Migration strategies:

Side-by-side: v1 and v2 are counted in parallel; reconciliation and training of users is carried out.
Cut-over: switching consumers to v2 in the low load window; archive v1.
Recalculation of history: backfill according to historical data; difference protocol (diff report).
Communications: changelog, date of entry, who will be affected, instructions.

8) Data model for metrics

Facts: grain (event_id, transaction_id, user_day), event time, sum/values.
Dimensions: user, device, geography, channel, product, calendar; SCD type for historicity.
Keys: surrogate IDs, stable business keys, mapping tables.
Anti-duplicates: identity rules (user merge), session "gluing" windows.

9) Units, currencies, seasonality

Units/format: explicit units, rounding, scales (log/linear).
Multicurrency: conversion at the exchange rate on the transaction date; store both "raw" and normalized amount.
Seasonality: YoY and seasonal indices; separate "holiday" effects.

10) Security and access

Row-Level Security (RLS): access to metrics by country/brand/partner.
Column-Level Security (CLS) -Masking PII/financial fields.
Audit: who requested the metric, which filters, which exported data.

API differentiation: "aggregates by role" vs "detailed uploads."

11) Observability and SLO

SLO freshness: for example, "operational KPI - lag ≤ 15 min, daily - until 06:00 local time."

Availability SLO: ≥ 99. 9% for API/semantic layer.
Alerts: SLO delinquency, metric jumps, NULL/duplicate growth, variance v1 vs v2> X%.
Runbooks: what to do when degraded - RCA steps, fallback (for example, switching to the last valid "snapshot metric").

12) Experiments and metrics

Guardrail metrics: latency, resiliency, FPR/FNR for scoring.
Uniform definitions for A/B: conversions, retention, NSM - through the same semantic layer.
Minimum distinguishable effect (MDE), power analysis: store parameters in the metrics card.
Causal attribution: policies by mix-adjustment and control groups.

13) API metrics and consumption

Запросы: `GET /metrics/{name}?from=2025-09-01&to=2025-10-01&dims=country,platform&filters=channel:paid`.

Policies: limits, cache, pagination, idempotent "exports."

Versions: 'X-Metric-Version: v2' header, deprecation warnings.

14) Patterns and artifacts

Metric Passport (example)

Code/Version: 'ARPPU _ v3'

Definition: average revenue per paying user for the period

Формула: `sum(revenue_net) / count_distinct(user_id where paying_flag=1)`

Granularity: day; rollup: week/month = numerator sum/denominator sum

Sources: 'fact _ payments _ v2', 'dim _ users _ scd'

Units: currency 'base _ ccy'; conversion at the exchange rate as of

Default filters: active markets, exclude test transactions

SLO: freshness ≤ 1 hour; API ≥ 99 availability. 9%

Guardrails: ARPPU ∈ [0; 10 000]; vinzorization p1/p99

Owners: Monetization Analytics; revision date: 2025-10-01

Check-list metric release

Definition and formula agreed, covered with tests
Semantic object created; lineage documented
Backfill and references completed
SLO/alerts are configured; runbook ready
Rights and RLS configured; PII hidden
Old versions replaced in dashboards/experiments
Changelog/communication sent

point-in-time SQL pseudo code (example Retention D30)

sql
WITH cohort AS (
SELECT user_id, MIN(event_date) AS signup_date
FROM fact_events
WHERE event_type = 'signup'
GROUP BY 1
),
activity AS (
SELECT user_id, event_date
FROM fact_events
WHERE event_type = 'app_open'
),
ret AS (
SELECT c. signup_date,
COUNT(DISTINCT CASE WHEN a. event_date = c. signup_date + INTERVAL '30 day' THEN a. user_id END) AS returned,
COUNT(DISTINCT c. user_id) AS cohort_size
FROM cohort c
LEFT JOIN activity a
ON a. user_id = c. user_id
AND a. event_date BETWEEN c. signup_date AND c. signup_date + INTERVAL '30 day'
GROUP BY 1
)
SELECT signup_date, returned / cohort_size AS retention_d30
FROM ret;

15) Frequent mistakes and how to avoid them

Quiet formula edits: always via version and changelog.
"Different in every laptop" metrics: Force on semantic layer/API.
Inconsistent timezones/currencies: centralized calendar and FX table.
Double user accounting: rollup rules and unique keys.
Opaque freshness: Clearly show lag/update time.
Dependence on one engineer: everything is like a code, with a review and an oncall.

Total

The metrics architecture is dictionary + semantic layer + robust calculation + governance and SLO. By following the principles described (contracts, tests, versions, observability, safety), you turn metrics from "number disputes" into a sustainable product and business management mechanism.

Metrics architecture