GH GambleHub

Registry parsing and automation

TL; DR

Reliable "verification" automation is based on three pillars: deterministic ingestion (safe delivery, idempotence, integrity control), normalized data layer (single scheme, matching keys, standardized units of time/currencies/signs) and strict quality discipline (validations, tolerances, DLQ, alerts, auto corrections). The goal is to turn motley files/webhooks into stable tables for reconciliation, reporting and BI with SLAs on availability.


1) Source and format landscape

1. 1 Sources

PSP/acquirers/aggregators: transactions, sets, commissions, disputes.
Banks: extracts MT940, ISO 20022 CAMT. 052/053/054, PAIN payments. 001/002.
AWS/wallets/payouts (OCT/RTP/SEPA): payouts registers, returns.
Crypto-custom/exchanges: on-chain transactions, conversion/commission reports.
Taxes/State portals: CSV/XLSX/PDF, sometimes via a scripted browser.

1. 2 Formats

CSV/TSV (variable separators, locales, encodings).
XLSX (multi-sheet, merged cells).
XML (ISO 20022 CAMT/PAIN, XSD custom schemas).
SWIFT MT940/942 (position fields).
JSON-API/NDJSON (incremental uploads, cursors).
PDF (tabular - parser; scans - OCR).
ZIP/TAR. GZ (multi-file batches).


2) Ingestion pipeline architecture

Contours:

1. Landing: secure file acceptance (SFTP/FTPS/WebDAV/API/webhooks) → immediately consider checksum, keep raw materials invariably.

2. Raw: layout by dates/providers/butches, storage with versions.

3. Normalize: parsing → unifying types/units → normalized tables.

4. Validated: post-validation (quality rules) → flags, DLQ.

5. Matched: mapping to internal events/bank.

6. Serving/BI: display cases for reconciliation/financials/transactions.

Key requirements:
  • The idempotency of ingestion: '(provider, file_name, file_size, checksum, statement_date)' → a unique key.
  • Reruns/Retrays: rerunning a file does not create duplicates.
  • DLQ (dead-letter queue): all unrecognized/violating rules - to an isolated queue.
  • Versioning: a new file for the same day → a new version with a link to the previous one.

3) Security of delivery and secrets

Channels: SFTP with limited keys; FTPS - only with strict TLS; API - OAuth2/short TTL tokens.
Content encryption: PGP/GPG when downloading files; S/MIME for e-mail inboxes (if used).
Integrity check: SHA-256/512 checksum, comparison with hash in manifest.
Secrets: store in Vault/KMS, rotation, prohibited in config files/logs.
Accesses: RBAC + the principle of "least privileges," separate service accounts.


4) Normalization and data scheme

4. 1 Universal rules

Time: always UTC in ISO-8601; for settlement dates, 'DATE' without TZ.
Amounts: 'DECIMAL (p, s)' in minor units + individual 'scale'; sign: receipt/expense strictly according to the dictionary.
Currencies: ISO-4217, fixed rate table with 'fx _ src'.
Locali: prohibition of autodetect - explicit setting of separators/decimal point/thousandths.
Encodings: UTF-8 input; others - conversion with a log.

4. 2 Normalized "flat" layer (example)

json
{
"provider": "Acquirer_A",
"source_kind": "PSP_TX    PSP_SETTLEMENT    BANK    WALLET    CRYPTO",
"kind": "AUTH    CAPTURE    REFUND    PAYOUT    FEE    SETTLEMENT    CHARGEBACK",
"payment_id": "pay_123",        // ваше
"provider_txid": "psp_abc_789",    // внешнее
"merchant_ref": "mr_456",
"sequence": 0,             // partial/refund line index
"amount_minor": 100000,        // 1000.00
"currency": "EUR",
"fee_minor": 120,           // 1.20
"fx_rate": 1.0000,
"fx_src": "PSP    ECB    BANK",
"event_ts": "2025-11-03T12:00:00Z",
"value_date": "2025-11-05",
"account": "PSP_MERCHANT_CARD_A",
"bin": "425000",
"last4": "1234",
"status": "APPROVED    CAPTURED    SUCCESS    FAILED    SETTLED",
"file_id": "ing_20251103_001",
"row_hash": "sha256(raw_row)"
}

5) Parsers by Format: Tricks and Rake

5. 1 CSV/TSV

Explicitly specify 'delimiter', 'quotechar', 'escapechar', 'encoding'.

Detection of empty lines/header shifts; header aliases support (e.g. 'AmountAMTTxnAmount`).
Conversion of sign (minus/brackets), normalization of thousandths (', '/. '/spaces).

5. 2 XLSX

Reading by sheet whitelist; Auto Merge Inhibit - Flattens merged cells.
Conversion of formulas to values; Excel → UTC dates with explicit TZ.

5. 3 XML (ISO 20022 CAMT/PAIN)

Validation by XSD; XPath identity mapping ('<Ntry>', '<TxDtls>', '<Amt>', '<CdtDbtInd>').
Credit/debit normalization → sign; support for multiple '<Chrgs>', '<RmtInf>'.

5. 4 MT940

Parsing tags': 61: ',': 86: '; support for national extensions; position fields → slicing rules.
Consolidation of several ': 61:' into one batch.

5. 5 JSON/NDJSON/API

Cursors' since _ idcreated_atpage_token`; rate-limit aware retrai (429 → backoff + jitter).
Partial semantics (multiple refand strings to one'provider _ txid ').

5. 6 PDF/OCR

First, an attempt at table parsing (table detector), only then OCR (Tesseract) with whitelist characters.
Post-validation: amounts, control totals, reconciliation of the number of lines.

5. 7 Archives/Batches

Unpacking with structure preservation; each file - a separate 'file _ id'; manifest, control of all parts.


6) Data quality validations and rules

Mandatory checks:
  • Schema - All required fields are present.
  • Types: amounts - numeric, dates - parse.
  • Checksums/Totals - Sum of rows = Total in file (if any).
  • Ranges: date in a reasonable window; sum> 0 (or according to the dictionary of allowed negative).
  • Uniqueness: '(provider, provider_txid, sequence)' is not duplicated in normalized.
  • Tolerances are allowed 'amount/fx/time' discrepancies.

Result: 'VALID', 'VALID _ WITH _ WARNINGS', 'INVALID → DLQ (reason_code)'.


7) Identity and deduplication

Ingestion key: '(provider, file_name, filesize, checksum, statement_date)' → a single 'file _ id'.
Row-level idem: `row_hash = sha256(normalized_row_compact)`; reloading does not create new records.
Webhooks/API: 'idempotency _ key' provider + your tags ('exec _ id'), store TTL.
Duplicates of the provider: dedup by 'provider _ txid' + 'sequence', in case of discrepancy - in DLQ_DUPLICATE.


8) Orchestration and schedules

Оркестратор: Airflow/Dagster (DAG: `fetch → decrypt → parse → normalize → validate → publish → match`).
SLA/SLO: 'Time-to-Availability (TtA)' from file occurrence to 'normalized = READY'.
Retrai: exponential backoff + jitter; attempt limits; clear statuses.
Concurrency and isolation: heavy OCR/parsing XLSX - in a separate pool/worker with a CPU/RAM limit.
DLQ-replay: periodic reprocess when updating rules/mappings.


9) Observability and alerts

Metrics:
  • Ingestion Success%, Parse Success% by source.
  • TtA p50/p95, Throughput (lines/min).
  • DLQ Rate и Aging DLQ p50/p95.
  • Schema Drift Incidents.
  • Duplicate Rate по `provider_txid`.
Alerts (example):
  • `TtA p95 > SLA` → P1.
  • 'DLQ Rate> 2% 'per hour for → provider P1.
  • 'Schema Drift detected '→ P0 (stop auto-matching by source).
  • 'Duplicate spike '→ P2 (check provider/webhooks).

Dashbord: a funnel 'files → rows_raw → rows_norm → rows_valid → rows_matched', the DLQ card for the reasons, TtA-quantiles.


10) Auto corrections and mappings

Header aliases: dictionary with versions (e. g., `Amount`→`amt`, `AMOUNT`→`amt`).

Code maps: provider statuses → your directory ('APPROVEDCAPTUREDSETTLED`).
Sign policy: 'CR/DR', 'C/D', brackets - into a single "sign" model.
Amount repair: removal of thousandth separators, normalization of minus.
Timezone repair: local time of the provider → UTC, including DST.
💡 Any auto-correction - logged and marked in 'repair _ flags'.

11) Link to "Reconciliation of PSP Payments and Reports"

Ready normalized layer - input for matching (provider_txid/merchant_ref/fuzzy), calculating diff-taxonomy, auto-logs and settlement↔bank-stitching. Key fields: 'provider _ txid', 'sequence', 'kind', 'amount _ minor', 'value _ date', 'account'.


12) Storage model and tables

Table of landed files:
sql
CREATE TABLE landed_files (
file_id TEXT PRIMARY KEY,
provider TEXT,
source_kind TEXT,
file_name TEXT,
file_size BIGINT,
checksum TEXT,
statement_date DATE,
received_at TIMESTAMP WITH TIME ZONE,
version INT,
status TEXT, -- RECEIVED    PARSED    FAILED error TEXT
);
Normalized rows:
sql
CREATE TABLE psp_norm (
row_id BIGSERIAL PRIMARY KEY,
file_id TEXT REFERENCES landed_files(file_id),
provider TEXT,
source_kind TEXT,
kind TEXT,
payment_id TEXT,
provider_txid TEXT,
merchant_ref TEXT,
sequence INT,
amount_minor BIGINT,
currency CHAR(3),
fee_minor BIGINT,
fx_rate NUMERIC(18,8),
fx_src TEXT,
event_ts TIMESTAMPTZ,
value_date DATE,
account TEXT,
status TEXT,
row_hash TEXT UNIQUE,
repair_flags TEXT[]
);
CREATE INDEX idx_psp_norm_txid ON psp_norm(provider, provider_txid, sequence);

13) Parser pseudocode

CSV/XLSX:
python def parse_table(file, spec):
df = load_csv_or_xlsx(file, delimiter=spec.delim, encoding=spec.enc, sheet=spec.sheet)
df = rename_headers(df, spec.header_aliases)
df = clean_amounts(df, thousand=spec.thousand, decimal=spec.decimal, sign_policy=spec.sign)
rows = []
for r in df.itertuples():
rows.append(normalize_row(r, spec))
return rows
XML CAMT:
python def parse_camt(xml):
root = parse_xml(xml, xsd="camt053.xsd")
for ntry in root.findall('.//Ntry'):
sign = 1 if ntry.findtext('CdtDbtInd') == 'CRDT' else -1 amt = Decimal(ntry.findtext('Amt')) sign
... map to normalized fields
OCR PDF (fallback):
python def parse_pdf_ocr(pdf):
text = tesseract(pdf, lang="eng", psm=6, whitelist="0123456789.,-;:/A-Za-z")
table = detect_table(text)
return normalize_table(table)

14) GDPR/PII and log editing

Masking/hashing: PAN/email/phone → 'sha256 + salt', logs - without primary values.
Retention policy: 'retention' by source type (AML/accounting).
PII accesses - by role only; audit of reads/exports.


15) KPIs and targets (for parsing/ingestion)

Ingestion Success % ≥ 99. 5 %/day per source.
Parse Success % ≥ 99%, DLQ ≤ 1%.
TtA p95 (fayl→normalized) ≤ 15 minutes (CSV/XML), ≤ 60 minutes (PDF/OCR).
Schema Drift Incidents: 0/month without alert/fix.
Duplicate Rate по `provider_txid` ≤ 0. 05%.


16) Incident playbooks

Schema drift: stop auto-matching, turn on the "soft" parser with ML-detection of speakers, prepare an alias patch, run DLQ-replay.
DLQ surge: debugging the latest files, checking the encoding/locale/sign, temporarily lowering the severity of tolerances (with a flag).
SFTP delays: switching to API polling/webhooks, increasing retraces, communication with the provider.
Spikes duplicates: enable additional verification 'row _ hash', block of repetitions until clarified.


17) Test Case Package (UAT/Prod-Ready)

1. Idempotency: Repeat same load → 1 'file _ id', 0 new lines.
2. Locales: files with ', '/'. '/spaces → correct sums.
3. Partial/refund: multiple 'sequence' to one 'provider _ txid'.
4. XML XSD: invalid CAMT → 'INVALID' + DLQ.
5. MT940 variations: national extensions → correct analysis.
6. PDF→OCR: scanning with noise → extracting and passing basic rules.
7. Schema drift: a new header → alias patch and reprocessing of historical files.
8. Throughput: load test of N files/hour → compliance with TtA SLA.
9. PII edition: logs without PAN/e-mail, only hashes.


18) Implementation checklist

  • Source Registry: Protocol, Schedule, SLA, Format, Contact.
  • Secure channels (SFTP/PGP/API), Vault for secrets.
  • Idempotent ingestion + checksum + versions.
  • Parsers by format, alias dictionary, sign/locale policies.
  • Normalized layer and key indices.
  • Validation rules, tolerances, DLQ and replay.
  • Orchestrator (DAG), Retrai/Backoff, Resource Pools.
  • Observability: metrics, dashboards, alerts.
  • GDPR/PII masking, access audits.
  • Test cases and regular schema-drift drills.

Resume Summary

Automation of parsing is not to "write a parser," but to build an industrial circuit: reliable delivery and encryption, idempotent pipelines, strict normalization, quality rules and transparent alerts. Such an outline turns any registers into predictable tables with a guaranteed SLA on data availability - the foundation for reconciliation, treasury and management reporting.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.