Registry parsing and automation
TL; DR
Reliable "verification" automation is based on three pillars: deterministic ingestion (safe delivery, idempotence, integrity control), normalized data layer (single scheme, matching keys, standardized units of time/currencies/signs) and strict quality discipline (validations, tolerances, DLQ, alerts, auto corrections). The goal is to turn motley files/webhooks into stable tables for reconciliation, reporting and BI with SLAs on availability.
1) Source and format landscape
1. 1 Sources
PSP/acquirers/aggregators: transactions, sets, commissions, disputes.
Banks: extracts MT940, ISO 20022 CAMT. 052/053/054, PAIN payments. 001/002.
AWS/wallets/payouts (OCT/RTP/SEPA): payouts registers, returns.
Crypto-custom/exchanges: on-chain transactions, conversion/commission reports.
Taxes/State portals: CSV/XLSX/PDF, sometimes via a scripted browser.
1. 2 Formats
CSV/TSV (variable separators, locales, encodings).
XLSX (multi-sheet, merged cells).
XML (ISO 20022 CAMT/PAIN, XSD custom schemas).
SWIFT MT940/942 (position fields).
JSON-API/NDJSON (incremental uploads, cursors).
PDF (tabular - parser; scans - OCR).
ZIP/TAR. GZ (multi-file batches).
2) Ingestion pipeline architecture
Contours:1. Landing: secure file acceptance (SFTP/FTPS/WebDAV/API/webhooks) → immediately consider checksum, keep raw materials invariably.
2. Raw: layout by dates/providers/butches, storage with versions.
3. Normalize: parsing → unifying types/units → normalized tables.
4. Validated: post-validation (quality rules) → flags, DLQ.
5. Matched: mapping to internal events/bank.
6. Serving/BI: display cases for reconciliation/financials/transactions.
Key requirements:- The idempotency of ingestion: '(provider, file_name, file_size, checksum, statement_date)' → a unique key.
- Reruns/Retrays: rerunning a file does not create duplicates.
- DLQ (dead-letter queue): all unrecognized/violating rules - to an isolated queue.
- Versioning: a new file for the same day → a new version with a link to the previous one.
3) Security of delivery and secrets
Channels: SFTP with limited keys; FTPS - only with strict TLS; API - OAuth2/short TTL tokens.
Content encryption: PGP/GPG when downloading files; S/MIME for e-mail inboxes (if used).
Integrity check: SHA-256/512 checksum, comparison with hash in manifest.
Secrets: store in Vault/KMS, rotation, prohibited in config files/logs.
Accesses: RBAC + the principle of "least privileges," separate service accounts.
4) Normalization and data scheme
4. 1 Universal rules
Time: always UTC in ISO-8601; for settlement dates, 'DATE' without TZ.
Amounts: 'DECIMAL (p, s)' in minor units + individual 'scale'; sign: receipt/expense strictly according to the dictionary.
Currencies: ISO-4217, fixed rate table with 'fx _ src'.
Locali: prohibition of autodetect - explicit setting of separators/decimal point/thousandths.
Encodings: UTF-8 input; others - conversion with a log.
4. 2 Normalized "flat" layer (example)
json
{
"provider": "Acquirer_A",
"source_kind": "PSP_TX PSP_SETTLEMENT BANK WALLET CRYPTO",
"kind": "AUTH CAPTURE REFUND PAYOUT FEE SETTLEMENT CHARGEBACK",
"payment_id": "pay_123", // ваше
"provider_txid": "psp_abc_789", // внешнее
"merchant_ref": "mr_456",
"sequence": 0, // partial/refund line index
"amount_minor": 100000, // 1000.00
"currency": "EUR",
"fee_minor": 120, // 1.20
"fx_rate": 1.0000,
"fx_src": "PSP ECB BANK",
"event_ts": "2025-11-03T12:00:00Z",
"value_date": "2025-11-05",
"account": "PSP_MERCHANT_CARD_A",
"bin": "425000",
"last4": "1234",
"status": "APPROVED CAPTURED SUCCESS FAILED SETTLED",
"file_id": "ing_20251103_001",
"row_hash": "sha256(raw_row)"
}
5) Parsers by Format: Tricks and Rake
5. 1 CSV/TSV
Explicitly specify 'delimiter', 'quotechar', 'escapechar', 'encoding'.
5. 2 XLSX
Reading by sheet whitelist; Auto Merge Inhibit - Flattens merged cells.
Conversion of formulas to values; Excel → UTC dates with explicit TZ.
5. 3 XML (ISO 20022 CAMT/PAIN)
Validation by XSD; XPath identity mapping ('<Ntry>', '<TxDtls>', '<Amt>', '<CdtDbtInd>').
Credit/debit normalization → sign; support for multiple '<Chrgs>', '<RmtInf>'.
5. 4 MT940
Parsing tags': 61: ',': 86: '; support for national extensions; position fields → slicing rules.
Consolidation of several ': 61:' into one batch.
5. 5 JSON/NDJSON/API
5. 6 PDF/OCR
First, an attempt at table parsing (table detector), only then OCR (Tesseract) with whitelist characters.
Post-validation: amounts, control totals, reconciliation of the number of lines.
5. 7 Archives/Batches
Unpacking with structure preservation; each file - a separate 'file _ id'; manifest, control of all parts.
6) Data quality validations and rules
Mandatory checks:- Schema - All required fields are present.
- Types: amounts - numeric, dates - parse.
- Checksums/Totals - Sum of rows = Total in file (if any).
- Ranges: date in a reasonable window; sum> 0 (or according to the dictionary of allowed negative).
- Uniqueness: '(provider, provider_txid, sequence)' is not duplicated in normalized.
- Tolerances are allowed 'amount/fx/time' discrepancies.
Result: 'VALID', 'VALID _ WITH _ WARNINGS', 'INVALID → DLQ (reason_code)'.
7) Identity and deduplication
Ingestion key: '(provider, file_name, filesize, checksum, statement_date)' → a single 'file _ id'.
Row-level idem: `row_hash = sha256(normalized_row_compact)`; reloading does not create new records.
Webhooks/API: 'idempotency _ key' provider + your tags ('exec _ id'), store TTL.
Duplicates of the provider: dedup by 'provider _ txid' + 'sequence', in case of discrepancy - in DLQ_DUPLICATE.
8) Orchestration and schedules
Оркестратор: Airflow/Dagster (DAG: `fetch → decrypt → parse → normalize → validate → publish → match`).
SLA/SLO: 'Time-to-Availability (TtA)' from file occurrence to 'normalized = READY'.
Retrai: exponential backoff + jitter; attempt limits; clear statuses.
Concurrency and isolation: heavy OCR/parsing XLSX - in a separate pool/worker with a CPU/RAM limit.
DLQ-replay: periodic reprocess when updating rules/mappings.
9) Observability and alerts
Metrics:- Ingestion Success%, Parse Success% by source.
- TtA p50/p95, Throughput (lines/min).
- DLQ Rate и Aging DLQ p50/p95.
- Schema Drift Incidents.
- Duplicate Rate по `provider_txid`.
- `TtA p95 > SLA` → P1.
- 'DLQ Rate> 2% 'per hour for → provider P1.
- 'Schema Drift detected '→ P0 (stop auto-matching by source).
- 'Duplicate spike '→ P2 (check provider/webhooks).
Dashbord: a funnel 'files → rows_raw → rows_norm → rows_valid → rows_matched', the DLQ card for the reasons, TtA-quantiles.
10) Auto corrections and mappings
Header aliases: dictionary with versions (e. g., `Amount`→`amt`, `AMOUNT`→`amt`).
11) Link to "Reconciliation of PSP Payments and Reports"
Ready normalized layer - input for matching (provider_txid/merchant_ref/fuzzy), calculating diff-taxonomy, auto-logs and settlement↔bank-stitching. Key fields: 'provider _ txid', 'sequence', 'kind', 'amount _ minor', 'value _ date', 'account'.
12) Storage model and tables
Table of landed files:sql
CREATE TABLE landed_files (
file_id TEXT PRIMARY KEY,
provider TEXT,
source_kind TEXT,
file_name TEXT,
file_size BIGINT,
checksum TEXT,
statement_date DATE,
received_at TIMESTAMP WITH TIME ZONE,
version INT,
status TEXT, -- RECEIVED PARSED FAILED error TEXT
);
Normalized rows:
sql
CREATE TABLE psp_norm (
row_id BIGSERIAL PRIMARY KEY,
file_id TEXT REFERENCES landed_files(file_id),
provider TEXT,
source_kind TEXT,
kind TEXT,
payment_id TEXT,
provider_txid TEXT,
merchant_ref TEXT,
sequence INT,
amount_minor BIGINT,
currency CHAR(3),
fee_minor BIGINT,
fx_rate NUMERIC(18,8),
fx_src TEXT,
event_ts TIMESTAMPTZ,
value_date DATE,
account TEXT,
status TEXT,
row_hash TEXT UNIQUE,
repair_flags TEXT[]
);
CREATE INDEX idx_psp_norm_txid ON psp_norm(provider, provider_txid, sequence);
13) Parser pseudocode
CSV/XLSX:python def parse_table(file, spec):
df = load_csv_or_xlsx(file, delimiter=spec.delim, encoding=spec.enc, sheet=spec.sheet)
df = rename_headers(df, spec.header_aliases)
df = clean_amounts(df, thousand=spec.thousand, decimal=spec.decimal, sign_policy=spec.sign)
rows = []
for r in df.itertuples():
rows.append(normalize_row(r, spec))
return rows
XML CAMT:
python def parse_camt(xml):
root = parse_xml(xml, xsd="camt053.xsd")
for ntry in root.findall('.//Ntry'):
sign = 1 if ntry.findtext('CdtDbtInd') == 'CRDT' else -1 amt = Decimal(ntry.findtext('Amt')) sign
... map to normalized fields
OCR PDF (fallback):
python def parse_pdf_ocr(pdf):
text = tesseract(pdf, lang="eng", psm=6, whitelist="0123456789.,-;:/A-Za-z")
table = detect_table(text)
return normalize_table(table)
14) GDPR/PII and log editing
Masking/hashing: PAN/email/phone → 'sha256 + salt', logs - without primary values.
Retention policy: 'retention' by source type (AML/accounting).
PII accesses - by role only; audit of reads/exports.
15) KPIs and targets (for parsing/ingestion)
Ingestion Success % ≥ 99. 5 %/day per source.
Parse Success % ≥ 99%, DLQ ≤ 1%.
TtA p95 (fayl→normalized) ≤ 15 minutes (CSV/XML), ≤ 60 minutes (PDF/OCR).
Schema Drift Incidents: 0/month without alert/fix.
Duplicate Rate по `provider_txid` ≤ 0. 05%.
16) Incident playbooks
Schema drift: stop auto-matching, turn on the "soft" parser with ML-detection of speakers, prepare an alias patch, run DLQ-replay.
DLQ surge: debugging the latest files, checking the encoding/locale/sign, temporarily lowering the severity of tolerances (with a flag).
SFTP delays: switching to API polling/webhooks, increasing retraces, communication with the provider.
Spikes duplicates: enable additional verification 'row _ hash', block of repetitions until clarified.
17) Test Case Package (UAT/Prod-Ready)
1. Idempotency: Repeat same load → 1 'file _ id', 0 new lines.
2. Locales: files with ', '/'. '/spaces → correct sums.
3. Partial/refund: multiple 'sequence' to one 'provider _ txid'.
4. XML XSD: invalid CAMT → 'INVALID' + DLQ.
5. MT940 variations: national extensions → correct analysis.
6. PDF→OCR: scanning with noise → extracting and passing basic rules.
7. Schema drift: a new header → alias patch and reprocessing of historical files.
8. Throughput: load test of N files/hour → compliance with TtA SLA.
9. PII edition: logs without PAN/e-mail, only hashes.
18) Implementation checklist
- Source Registry: Protocol, Schedule, SLA, Format, Contact.
- Secure channels (SFTP/PGP/API), Vault for secrets.
- Idempotent ingestion + checksum + versions.
- Parsers by format, alias dictionary, sign/locale policies.
- Normalized layer and key indices.
- Validation rules, tolerances, DLQ and replay.
- Orchestrator (DAG), Retrai/Backoff, Resource Pools.
- Observability: metrics, dashboards, alerts.
- GDPR/PII masking, access audits.
- Test cases and regular schema-drift drills.
Resume Summary
Automation of parsing is not to "write a parser," but to build an industrial circuit: reliable delivery and encryption, idempotent pipelines, strict normalization, quality rules and transparent alerts. Such an outline turns any registers into predictable tables with a guaranteed SLA on data availability - the foundation for reconciliation, treasury and management reporting.