Registry parsing and automation

TL; DR

Reliable "verification" automation is based on three pillars: deterministic ingestion (safe delivery, idempotence, integrity control), normalized data layer (single scheme, matching keys, standardized units of time/currencies/signs) and strict quality discipline (validations, tolerances, DLQ, alerts, auto corrections). The goal is to turn motley files/webhooks into stable tables for reconciliation, reporting and BI with SLAs on availability.

1) Source and format landscape

1. 1 Sources

PSP/acquirers/aggregators: transactions, sets, commissions, disputes.
Banks: extracts MT940, ISO 20022 CAMT. 052/053/054, PAIN payments. 001/002.
AWS/wallets/payouts (OCT/RTP/SEPA): payouts registers, returns.
Crypto-custom/exchanges: on-chain transactions, conversion/commission reports.
Taxes/State portals: CSV/XLSX/PDF, sometimes via a scripted browser.

1. 2 Formats

CSV/TSV (variable separators, locales, encodings).
XLSX (multi-sheet, merged cells).
XML (ISO 20022 CAMT/PAIN, XSD custom schemas).
SWIFT MT940/942 (position fields).
JSON-API/NDJSON (incremental uploads, cursors).
PDF (tabular - parser; scans - OCR).
ZIP/TAR. GZ (multi-file batches).

2) Ingestion pipeline architecture

Contours:

1. Landing: secure file acceptance (SFTP/FTPS/WebDAV/API/webhooks) → immediately consider checksum, keep raw materials invariably.

2. Raw: layout by dates/providers/butches, storage with versions.

3. Normalize: parsing → unifying types/units → normalized tables.

4. Validated: post-validation (quality rules) → flags, DLQ.

5. Matched: mapping to internal events/bank.

6. Serving/BI: display cases for reconciliation/financials/transactions.

Key requirements:

The idempotency of ingestion: '(provider, file_name, file_size, checksum, statement_date)' → a unique key.
Reruns/Retrays: rerunning a file does not create duplicates.
DLQ (dead-letter queue): all unrecognized/violating rules - to an isolated queue.
Versioning: a new file for the same day → a new version with a link to the previous one.

3) Security of delivery and secrets

Channels: SFTP with limited keys; FTPS - only with strict TLS; API - OAuth2/short TTL tokens.
Content encryption: PGP/GPG when downloading files; S/MIME for e-mail inboxes (if used).
Integrity check: SHA-256/512 checksum, comparison with hash in manifest.
Secrets: store in Vault/KMS, rotation, prohibited in config files/logs.
Accesses: RBAC + the principle of "least privileges," separate service accounts.

4) Normalization and data scheme

4. 1 Universal rules

Time: always UTC in ISO-8601; for settlement dates, 'DATE' without TZ.
Amounts: 'DECIMAL (p, s)' in minor units + individual 'scale'; sign: receipt/expense strictly according to the dictionary.
Currencies: ISO-4217, fixed rate table with 'fx _ src'.
Locali: prohibition of autodetect - explicit setting of separators/decimal point/thousandths.
Encodings: UTF-8 input; others - conversion with a log.

4. 2 Normalized "flat" layer (example)

json
{
"provider": "Acquirer_A",
"source_kind": "PSP_TX    PSP_SETTLEMENT    BANK    WALLET    CRYPTO",
"kind": "AUTH    CAPTURE    REFUND    PAYOUT    FEE    SETTLEMENT    CHARGEBACK",
"payment_id": "pay_123" ,//yours
"provider_txid": "psp_abc_789" ,//external
"merchant_ref": "mr_456",
"sequence": 0,             // partial/refund line index
"amount_minor": 100000,        // 1000. 00
"currency": "EUR",
"fee_minor": 120,           // 1. 20
"fx_rate": 1. 0000,
"fx_src": "PSP    ECB    BANK",
"event_ts": "2025-11-03T12:00:00Z",
"value_date": "2025-11-05",
"account": "PSP_MERCHANT_CARD_A",
"bin": "425000",
"last4": "1234",
"status": "APPROVED    CAPTURED    SUCCESS    FAILED    SETTLED",
"file_id": "ing_20251103_001",
"row_hash": "sha256(raw_row)"
}

5) Parsers by Format: Tricks and Rake

5. 1 CSV/TSV

Explicitly specify 'delimiter', 'quotechar', 'escapechar', 'encoding'.
Detection of empty lines/header shifts; header aliases support (for example, 'Amount' AMT 'TxnAmount').
Conversion of sign (minus/brackets), normalization of thousandths (', '/. '/spaces).

5. 2 XLSX

Reading by sheet whitelist; Auto Merge Inhibit - Flattens merged cells.
Conversion of formulas to values; Excel → UTC dates with explicit TZ.

5. 3 XML (ISO 20022 CAMT/PAIN)

Validation by XSD; XPath identity mapping ('<Ntry>', '<TxDtls>', '<Amt>', '<CdtDbtInd>').
Credit/debit normalization → sign; support for multiple '<Chrgs>', '<RmtInf>'.

5. 4 MT940

Parsing tags': 61: ',': 86: '; support for national extensions; position fields → slicing rules.
Consolidation of several ': 61:' into one batch.

5. 5 JSON/NDJSON/API

Cursors' since _ id'created _ at' page _ token '; rate-limit aware retrai (429 → backoff + jitter).
Partial semantics (multiple refand strings to one'provider _ txid ').

5. 6 PDF/OCR

First, an attempt at table parsing (table detector), only then OCR (Tesseract) with whitelist characters.
Post-validation: amounts, control totals, reconciliation of the number of lines.

5. 7 Archives/Batches

Unpacking with structure preservation; each file - a separate 'file _ id'; manifest, control of all parts.

6) Data quality validations and rules

Mandatory checks:

Schema - All required fields are present.
Types: amounts - numeric, dates - parse.
Checksums/Totals - Sum of rows = Total in file (if any).
Ranges: date in a reasonable window; sum> 0 (or according to the dictionary of allowed negative).
Uniqueness: '(provider, provider_txid, sequence)' is not duplicated in normalized.
Tolerances are allowed 'amount/fx/time' discrepancies.

Result: 'VALID', 'VALID _ WITH _ WARNINGS', 'INVALID → DLQ (reason_code)'.

7) Identity and deduplication

Ingestion key: '(provider, file_name, filesize, checksum, statement_date)' → a single 'file _ id'.
Row-level idem: `row_hash = sha256(normalized_row_compact)`; reloading does not create new records.
Webhooks/API: 'idempotency _ key' provider + your tags ('exec _ id'), store TTL.
Duplicates of the provider: dedup by 'provider _ txid' + 'sequence', in case of discrepancy - in DLQ_DUPLICATE.

8) Orchestration and schedules

Оркестратор: Airflow/Dagster (DAG: `fetch → decrypt → parse → normalize → validate → publish → match`).
SLA/SLO: 'Time-to-Availability (TtA)' from file occurrence to 'normalized = READY'.
Retrai: exponential backoff + jitter; attempt limits; clear statuses.
Concurrency and isolation: heavy OCR/parsing XLSX - in a separate pool/worker with a CPU/RAM limit.
DLQ-replay: periodic reprocess when updating rules/mappings.

9) Observability and alerts

Metrics:

Ingestion Success%, Parse Success% by source.
TtA p50/p95, Throughput (lines/min).
DLQ Rate и Aging DLQ p50/p95.
Schema Drift Incidents.
Duplicate Rate по `provider_txid`.

Alerts (example):

`TtA p95 > SLA` → P1.
'DLQ Rate> 2% 'per hour for → provider P1.
'Schema Drift detected '→ P0 (stop auto-matching by source).
'Duplicate spike '→ P2 (check provider/webhooks).

Dashbord: a funnel 'files → rows_raw → rows_norm → rows_valid → rows_matched', the DLQ card for the reasons, TtA-quantiles.

10) Auto corrections and mappings

Header aliases: dictionary with versions (e. g., `Amount`→`amt`, `AMOUNT`→`amt`).
Code maps: provider statuses → your directory ('APPROVED' CAPTURED 'SETTLED').
Sign policy: 'CR/DR', 'C/D', brackets - into a single "sign" model.
Amount repair: removal of thousandth separators, normalization of minus.
Timezone repair: local time of the provider → UTC, including DST.

💡 Any auto-correction - logged and marked in 'repair _ flags'.

11) Link to "Reconciliation of PSP Payments and Reports"

Ready normalized layer - input for matching (provider_txid/merchant_ref/fuzzy), calculating diff-taxonomy, auto-logs and settlement↔bank-stitching. Key fields: 'provider _ txid', 'sequence', 'kind', 'amount _ minor', 'value _ date', 'account'.

12) Storage model and tables

Table of landed files:

sql
CREATE TABLE landed_files (
file_id TEXT PRIMARY KEY,
provider TEXT,
source_kind TEXT,
file_name TEXT,
file_size BIGINT,
checksum TEXT,
statement_date DATE,
received_at TIMESTAMP WITH TIME ZONE,
version INT,
status TEXT, -- RECEIVED    PARSED    FAILED error TEXT
);

Normalized rows:

sql
CREATE TABLE psp_norm (
row_id BIGSERIAL PRIMARY KEY,
file_id TEXT REFERENCES landed_files(file_id),
provider TEXT,
source_kind TEXT,
kind TEXT,
payment_id TEXT,
provider_txid TEXT,
merchant_ref TEXT,
sequence INT,
amount_minor BIGINT,
currency CHAR(3),
fee_minor BIGINT,
fx_rate NUMERIC(18,8),
fx_src TEXT,
event_ts TIMESTAMPTZ,
value_date DATE,
account TEXT,
status TEXT,
row_hash TEXT UNIQUE,
repair_flags TEXT[]
);
CREATE INDEX idx_psp_norm_txid ON psp_norm(provider, provider_txid, sequence);

13) Parser pseudocode

CSV/XLSX:

python def parse_table(file, spec):
df = load_csv_or_xlsx(file, delimiter=spec. delim, encoding=spec. enc, sheet=spec. sheet)
df = rename_headers(df, spec. header_aliases)
df = clean_amounts(df, thousand=spec. thousand, decimal=spec. decimal, sign_policy=spec. sign)
rows = []
for r in df. itertuples():
rows. append(normalize_row(r, spec))
return rows

XML CAMT:

python def parse_camt(xml):
root = parse_xml(xml, xsd="camt053. xsd")
for ntry in root. findall('.//Ntry'):
sign = 1 if ntry. findtext('CdtDbtInd') == 'CRDT' else -1 amt = Decimal(ntry. findtext('Amt')) sign
... map to normalized fields

OCR PDF (fallback):

python def parse_pdf_ocr(pdf):
text = tesseract(pdf, lang="eng", psm=6, whitelist="0123456789.,-;:/A-Za-z")
table = detect_table(text)
return normalize_table(table)

14) GDPR/PII and log editing

Masking/hashing: PAN/email/phone → 'sha256 + salt', logs - without primary values.
Retention policy: 'retention' by source type (AML/accounting).
PII accesses - by role only; audit of reads/exports.

15) KPIs and targets (for parsing/ingestion)

Ingestion Success % ≥ 99. 5 %/day per source.
Parse Success % ≥ 99%, DLQ ≤ 1%.
TtA p95 (fayl→normalized) ≤ 15 minutes (CSV/XML), ≤ 60 minutes (PDF/OCR).
Schema Drift Incidents: 0/month without alert/fix.
Duplicate Rate по `provider_txid` ≤ 0. 05%.

16) Incident playbooks

Schema drift: stop auto-matching, turn on the "soft" parser with ML-detection of speakers, prepare an alias patch, run DLQ-replay.
DLQ surge: debugging the latest files, checking the encoding/locale/sign, temporarily lowering the severity of tolerances (with a flag).
SFTP delays: switching to API polling/webhooks, increasing retraces, communication with the provider.
Spikes duplicates: enable additional verification 'row _ hash', block of repetitions until clarified.

17) Test Case Package (UAT/Prod-Ready)

1. Idempotency: Repeat same load → 1 'file _ id', 0 new lines.
2. Locales: files with ', '/'. '/spaces → correct sums.
3. Partial/refund: multiple 'sequence' to one 'provider _ txid'.
4. XML XSD: invalid CAMT → 'INVALID' + DLQ.
5. MT940 variations: national extensions → correct analysis.
6. PDF→OCR: scanning with noise → extracting and passing basic rules.
7. Schema drift: a new header → alias patch and reprocessing of historical files.
8. Throughput: load test of N files/hour → compliance with TtA SLA.
9. PII edition: logs without PAN/e-mail, only hashes.

18) Implementation checklist

Source Registry: Protocol, Schedule, SLA, Format, Contact.
Secure channels (SFTP/PGP/API), Vault for secrets.
Idempotent ingestion + checksum + versions.
Parsers by format, alias dictionary, sign/locale policies.
Normalized layer and key indices.
Validation rules, tolerances, DLQ and replay.
Orchestrator (DAG), Retrai/Backoff, Resource Pools.
Observability: metrics, dashboards, alerts.
GDPR/PII masking, access audits.
Test cases and regular schema-drift drills.

Summary

Automation of parsing is not to "write a parser," but to build an industrial circuit: reliable delivery and encryption, idempotent pipelines, strict normalization, quality rules and transparent alerts. Such an outline turns any registers into predictable tables with a guaranteed SLA on data availability - the foundation for reconciliation, treasury and management reporting.

Registry parsing and automation

TL; DR

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects