Merge data from different sources

Data merging is the process of combining heterogeneous flows (product databases, CRM, payment providers, event logs, third-party registries) into holistic entities and consistent storefronts. The goal is to get a Golden Record and consistent cuts for analytics, ML and operational cases.

1) Typical scenarios and objectives

360 ° in essence: client/player, device, payment instrument, merchant.
Transaction consolidation: multiple PSPs/cash registers → a single log with mandatory idempotency.
Event normalization: web/mobile/backend logs → a single event dictionary.
Enrichment: external directories (geo, FX, AML/sanctions, marketing sources).
Unified metrics: coordination of currencies/timezones, schemes and encodings.

2) Source contracts and schemes

Before starting - data contract for each source:

Schema: fields, types, nullability, key (s), value domains.
Semantics: what each field (dictionary) means.
SLA: freshness/frequency, maximum latency and out-of-order.
Evolution: scheme change policy (backward/forward), depression.
Quality: uniqueness of keys, acceptable ranges, referential integrity.

3) Identification: keys and mapping (record linkage)

3. 1. Hard IDs

Natural keys: 'user _ id', 'transaction _ id', 'device _ id', 'iban'.
Proxy keys: e-mail/phone (normalized: case, spaces, country codes).
Surrogates: 'surrogate _ id' in hub tables in the absence of a universal key.

3. 2. Soft matching rules

Deterministic: exact match of normalized e-mail + DR; "home "/" mobile "phone → E.164.
Probabilistic (fuzzy): Jaro-Winkler/Levenshtein for name/address, TF-IDF/embeddings for strings, "blocking" (blocking) by coarse hashes/prefixes for acceleration.
Graph approaches: entities as nodes, coincidences as edges; clustering connectivity components.
Step-up strategy: from strict to soft rules with manual "at the border" review.

3. 3. Consolidation rules (survival)

The priority of the source is "KYC registry> CRM> logs" when there is a conflict of values.
Freshness: The newer timestamp wins (adjusted for validity).
Fullness: prefer non-NULL; merging addresses/tags by combining sets.
Audit: Keep the "solution trail" - what was overwritten and why.

4) Deduplication and MDM

MDM layer (Master Data Management): master entity tables + istochnik→master relationships.
Golden Record: aggregated record with 'confidence' field/source of truth.
History: SCD type 2 for time-dependent attributes (address, KYC status).
Identities: Merge map tables with "merge "/" spill "dates.

5) Change flows: CDC, latecomers and duplicates

CDC (Change Data Capture): события `insert/update/delete` + `source_lsn`/offset.
Late events: watermarks and grace period, storing late updates for adjustments.
Out-of-order: sorting by key and time, compensating updates.
Duplicates: idempotent keys ('event _ id', 'idempotency _ key'), dedup in the window.
Exactly-once: transactional singles/store, 'MERGE' with deterministic logic.

6) Timezones, currencies and calendar

Time: store in UTC + localized slices; explicitly store'ingested _ at'and'event _ time'.
Currencies: Store "raw currency" and normalized 'base _ ccy' with rate on transaction date.
Calendars: Holiday/workday tables by region for fair comparisons.

7) Pseudo-SQL for merge (upsert/merge)

7. 1. Transactions (idempotent journal)

sql
MERGE INTO fact_transactions t
USING staging_transactions s
ON t. txn_id = s. txn_id
WHEN MATCHED AND s. updated_at > t. updated_at THEN
UPDATE SET amount = s. amount,
currency = s. currency,
status = s. status,
updated_at = s. updated_at
WHEN NOT MATCHED THEN
INSERT (txn_id, user_ext_id, amount, currency, status, event_time, updated_at)
VALUES (s. txn_id, s. user_ext_id, s. amount, s. currency, s. status, s. event_time, s. updated_at);

7. 2. User "golden record" (source priority + freshness)

sql
WITH ranked AS (
SELECT s. ext_user_id,
s. norm_email,
s. phone_e164,
s. addr_struct,
s. source,
s. updated_at,
ROW_NUMBER() OVER (
PARTITION BY s. ext_user_id
ORDER BY
CASE s. source
WHEN 'KYC' THEN 1 WHEN 'CRM' THEN 2 ELSE 3 END,
s. updated_at DESC
) AS rn
FROM staging_users s
)
MERGE INTO dim_user_golden g
USING ranked r
ON g. ext_user_id = r. ext_user_id
WHEN MATCHED AND r. rn = 1 THEN
UPDATE SET email = COALESCE(r. norm_email, g. email),
phone = COALESCE(r. phone_e164, g. phone),
address = COALESCE(r. addr_struct, g. address),
source_of_truth = r. source,
updated_at = r. updated_at
WHEN NOT MATCHED AND r. rn = 1 THEN
INSERT (ext_user_id, email, phone, address, source_of_truth, updated_at)
VALUES (r. ext_user_id, r. norm_email, r. phone_e164, r. addr_struct, r. source, r. updated_at);

8) Quality and testing

Test schema: required fields, types, domains.

Logic tests: uniqueness of the key, absence of duplicates, no "back in time."

Reconciliations: amounts by source vs final showcase; discrepancies → tickets.

Profiling: distributions, fraction NULL, "long tails."

Merge metrics: precision/recall mappings,% of records with confidence ≥ threshold.

9) Observability and SLO

SLO freshness: window lag ≤ N minutes/hours; delay monitoring and backlogging.
Alerts: an increase in duplicates, a surge in conflicts, a drop in coverage keys.
Lineage logs: from which source the field was taken, when and by whom it was overwritten.
Runybooks: Incident scenarios (late batches, CDC storms, incorrect FX).

10) Security, privacy, compliance

PII: aliasing, ID hashing, masking in BI.
RLS/CLS: access by roles and rows; export - with tokens and expiration date.

Data lifetime: storage schedules; right to remove (DSAR) and "legal hold."

Re-identification: rules for minimizing the joins of sensitive tables.

11) Model and data organization

Layers: 'raw' (as is) → 'staging' (cleaning/normalization) → 'core' (master entities, fact/measurements) → 'marts' (showcases for analytics/ML).
SCD: type 2 for attributes, type 1 for error correction; explicit'valid _ from/valid _ to '.
Feature Store: transformation functions are identical online/offline; point-in-time correctness.

12) Implementation patterns

ELT with semantic layer: merge logic is described declaratively (rules, priorities, keys).
Stream + microbatch: for near-real-time displays - microbatches 1-15 min with watermarks.
Graph-linkage: a separate graph hub for complex identification (devices, maps, addresses).
Step-up validation: include new linkage rules in shadow mode, collect accuracy metrics.

13) Pre-Merge Loop Release Checklist

Source contracts signed; schemas and field dictionaries are consistent
linkage keys/rules defined; has a deduplication strategy
Survival rules and source priorities are set; audit-log enabled
CDC/idempotency/late data processing implemented
Currencies/Timezones/Calendar Normalized
Quality tests and reconciliations are set up; dashboards of observability are available
SLO freshness and availability are fixed; alerts and runibooks are ready
PII/accesses/storage compliant
Documentation: Entity Passport, lineage schema, sample requests

14) Passport of the "golden record" (template)

Entity: 'USER _ GOLDEN'

Key: 'user _ master _ id' (surrogate), mappings' source _ user _ id [] '

Fields and rules:

'email ': normalization + priority' KYC> CRM> LOGS'
'phone ': E.164 normalization, verification split
`name`: Jaro-Winkler ≥ 0. 92, fallback - KYC source
'address': compound object; union + freshness priority
History: SCD2 ('valid _ from/valid _ to')
Lineage: donor field reference list
Quality: coverage≥98%, dublikaty≤0. 3%
SLO: freshness ≤ 1 hour, availability ≥ 99. 9%
Owners: Data Platform, KYC/AML
Risks: name collisions, "family" phones, shared-devices

15) Summary and recommendations

Merger is not only a "JOIN by key," but an outline: source contracts → identification and dedup → priorities and a "golden record" of CDC → and late → quality and observability → safety and change history.
Build rules transparently, keep an audit of each solution, support SCD and exactly-once. This is how data from dozens of sources turns into reliable storefronts and sustainable metrics for the product, analytics and ML.

Merge data from different sources

Merge data from different sources

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects