Merge data from different sources
Merge data from different sources
Data merging is the process of combining heterogeneous flows (product databases, CRM, payment providers, event logs, third-party registries) into holistic entities and consistent storefronts. The goal is to get a Golden Record and consistent cuts for analytics, ML and operational cases.
1) Typical scenarios and objectives
360 ° in essence: client/player, device, payment instrument, merchant.
Transaction consolidation: multiple PSPs/cash registers → a single log with mandatory idempotency.
Event normalization: web/mobile/backend logs → a single event dictionary.
Enrichment: external directories (geo, FX, AML/sanctions, marketing sources).
Unified metrics: coordination of currencies/timezones, schemes and encodings.
2) Source contracts and schemes
Before starting - data contract for each source:- Schema: fields, types, nullability, key (s), value domains.
- Semantics: what each field (dictionary) means.
- SLA: freshness/frequency, maximum latency and out-of-order.
- Evolution: scheme change policy (backward/forward), depression.
- Quality: uniqueness of keys, acceptable ranges, referential integrity.
3) Identification: keys and mapping (record linkage)
3. 1. Hard IDs
Natural keys: 'user _ id', 'transaction _ id', 'device _ id', 'iban'.
Proxy keys: e-mail/phone (normalized: case, spaces, country codes).
Surrogates: 'surrogate _ id' in hub tables in the absence of a universal key.
3. 2. Soft matching rules
Deterministic: exact match of normalized e-mail + DR; "home "/" mobile "phone → E.164.
Probabilistic (fuzzy): Jaro-Winkler/Levenshtein for name/address, TF-IDF/embeddings for strings, "blocking" (blocking) by coarse hashes/prefixes for acceleration.
Graph approaches: entities as nodes, coincidences as edges; clustering connectivity components.
Step-up strategy: from strict to soft rules with manual "at the border" review.
3. 3. Consolidation rules (survival)
The priority of the source is "KYC registry> CRM> logs" when there is a conflict of values.
Freshness: The newer timestamp wins (adjusted for validity).
Fullness: prefer non-NULL; merging addresses/tags by combining sets.
Audit: Keep the "solution trail" - what was overwritten and why.
4) Deduplication and MDM
MDM layer (Master Data Management): master entity tables + istochnik→master relationships.
Golden Record: aggregated record with 'confidence' field/source of truth.
History: SCD type 2 for time-dependent attributes (address, KYC status).
Identities: Merge map tables with "merge "/" spill "dates.
5) Change flows: CDC, latecomers and duplicates
CDC (Change Data Capture): события `insert/update/delete` + `source_lsn`/offset.
Late events: watermarks and grace period, storing late updates for adjustments.
Out-of-order: sorting by key and time, compensating updates.
Duplicates: idempotent keys ('event _ id', 'idempotency _ key'), dedup in the window.
Exactly-once: transactional singles/store, 'MERGE' with deterministic logic.
6) Timezones, currencies and calendar
Time: store in UTC + localized slices; explicitly store'ingested _ at'and'event _ time'.
Currencies: Store "raw currency" and normalized 'base _ ccy' with rate on transaction date.
Calendars: Holiday/workday tables by region for fair comparisons.
7) Pseudo-SQL for merge (upsert/merge)
7. 1. Transactions (idempotent journal)
sql
MERGE INTO fact_transactions t
USING staging_transactions s
ON t. txn_id = s. txn_id
WHEN MATCHED AND s. updated_at > t. updated_at THEN
UPDATE SET amount = s. amount,
currency = s. currency,
status = s. status,
updated_at = s. updated_at
WHEN NOT MATCHED THEN
INSERT (txn_id, user_ext_id, amount, currency, status, event_time, updated_at)
VALUES (s. txn_id, s. user_ext_id, s. amount, s. currency, s. status, s. event_time, s. updated_at);
7. 2. User "golden record" (source priority + freshness)
sql
WITH ranked AS (
SELECT s. ext_user_id,
s. norm_email,
s. phone_e164,
s. addr_struct,
s. source,
s. updated_at,
ROW_NUMBER() OVER (
PARTITION BY s. ext_user_id
ORDER BY
CASE s. source
WHEN 'KYC' THEN 1 WHEN 'CRM' THEN 2 ELSE 3 END,
s. updated_at DESC
) AS rn
FROM staging_users s
)
MERGE INTO dim_user_golden g
USING ranked r
ON g. ext_user_id = r. ext_user_id
WHEN MATCHED AND r. rn = 1 THEN
UPDATE SET email = COALESCE(r. norm_email, g. email),
phone = COALESCE(r. phone_e164, g. phone),
address = COALESCE(r. addr_struct, g. address),
source_of_truth = r. source,
updated_at = r. updated_at
WHEN NOT MATCHED AND r. rn = 1 THEN
INSERT (ext_user_id, email, phone, address, source_of_truth, updated_at)
VALUES (r. ext_user_id, r. norm_email, r. phone_e164, r. addr_struct, r. source, r. updated_at);
8) Quality and testing
Test schema: required fields, types, domains.
Logic tests: uniqueness of the key, absence of duplicates, no "back in time."
Reconciliations: amounts by source vs final showcase; discrepancies → tickets.
Profiling: distributions, fraction NULL, "long tails."
Merge metrics: precision/recall mappings,% of records with confidence ≥ threshold.
9) Observability and SLO
SLO freshness: window lag ≤ N minutes/hours; delay monitoring and backlogging.
Alerts: an increase in duplicates, a surge in conflicts, a drop in coverage keys.
Lineage logs: from which source the field was taken, when and by whom it was overwritten.
Runybooks: Incident scenarios (late batches, CDC storms, incorrect FX).
10) Security, privacy, compliance
PII: aliasing, ID hashing, masking in BI.
RLS/CLS: access by roles and rows; export - with tokens and expiration date.
Data lifetime: storage schedules; right to remove (DSAR) and "legal hold."
Re-identification: rules for minimizing the joins of sensitive tables.
11) Model and data organization
Layers: 'raw' (as is) → 'staging' (cleaning/normalization) → 'core' (master entities, fact/measurements) → 'marts' (showcases for analytics/ML).
SCD: type 2 for attributes, type 1 for error correction; explicit'valid _ from/valid _ to '.
Feature Store: transformation functions are identical online/offline; point-in-time correctness.
12) Implementation patterns
ELT with semantic layer: merge logic is described declaratively (rules, priorities, keys).
Stream + microbatch: for near-real-time displays - microbatches 1-15 min with watermarks.
Graph-linkage: a separate graph hub for complex identification (devices, maps, addresses).
Step-up validation: include new linkage rules in shadow mode, collect accuracy metrics.
13) Pre-Merge Loop Release Checklist
- Source contracts signed; schemas and field dictionaries are consistent
- linkage keys/rules defined; has a deduplication strategy
- Survival rules and source priorities are set; audit-log enabled
- CDC/idempotency/late data processing implemented
- Currencies/Timezones/Calendar Normalized
- Quality tests and reconciliations are set up; dashboards of observability are available
- SLO freshness and availability are fixed; alerts and runibooks are ready
- PII/accesses/storage compliant
- Documentation: Entity Passport, lineage schema, sample requests
14) Passport of the "golden record" (template)
Entity: 'USER _ GOLDEN'
Key: 'user _ master _ id' (surrogate), mappings' source _ user _ id [] '
Fields and rules:- 'email ': normalization + priority' KYC> CRM> LOGS'
- 'phone ': E.164 normalization, verification split
- `name`: Jaro-Winkler ≥ 0. 92, fallback - KYC source
- 'address': compound object; union + freshness priority
- History: SCD2 ('valid _ from/valid _ to')
- Lineage: donor field reference list
- Quality: coverage≥98%, dublikaty≤0. 3%
- SLO: freshness ≤ 1 hour, availability ≥ 99. 9%
- Owners: Data Platform, KYC/AML
- Risks: name collisions, "family" phones, shared-devices
15) Summary and recommendations
Merger is not only a "JOIN by key," but an outline: source contracts → identification and dedup → priorities and a "golden record" of CDC → and late → quality and observability → safety and change history.
Build rules transparently, keep an audit of each solution, support SCD and exactly-once. This is how data from dozens of sources turns into reliable storefronts and sustainable metrics for the product, analytics and ML.