Payment routing and failover
Payment routing and failover
1) Why do you need it
Conversion: correct selection of channel/PSP by BIN/bank/geo/risk increases Auth Rate by 5-15 p.p.
Cost: dynamic selection by "success × commission" reduces the effective rate by 10-30 bps.
Resilience: isolation from PSP/3DS/bank falls; continuation of acceptance and payments in case of partial failures.
Compliance/RG: flexible implementation of limits, geo-restrictions, self-exclusions and sanction rules directly in routing.
2) Target architecture (layers)
1. Checkout Layer - localization of currencies/methods, APM discovery, 3DS UX.
2. Payment Orchestrator (Rule Engine) - routing, smart retry, idempotency, circuit-breaker.
3. Risk/KYT Engine - device/behavior, sanctions/PEP, velocity, RG limits, 3DS policy.
4. Compliance Hub - KYC, sanctions providers, affordability/limits, audits.
5. Wallet & Ledgers - money and game ledgers, bonus liabilities, multicurrency.
6. Reconciliation & Reporting - T + 0/T + 1 reconciliations, reason codes, tax registers.
7. Observability & Security - metrics/logs/traces, alerts, RBAC, PCI segmentation.
8. Data/ML - risk scoring, conversion prediction by banks/methods.
3) Data model and idempotency
Payment Intent (PI): a uniform object for deposit/payment with fields: amount, currency, method, geo, BIN, risk_score, rg_limits, route_history, idempotency_key, status.
Idempotency: each hop (PSP-A → PSP-B) is performed with one idempotency_key; call retry does not change the status of the ledger.
Route Journal: A journal of routes and responses (PSP id, reason code, latency, 3DS flow, fee), needed for A/B and model training.
4) Routing algorithm (reference)
4. 1 Acquiring
1. Pre-score: GEO, BIN/IIN, issuing bank, device, check, risk rate, RG status.
2. Compliance filters: sanctions/PEP, geo-blocks, age/self-exclusion.
3. Cost/success rules: score = w1· AuthRate + w2· (− Fee) + w3· Health − penalties.
4. 3DS policy: TRA/whitelisting/step-up by risk, choice challenge vs frictionless.
5. Route selection: PSP-A → (on failures/errors) → PSP-B → alternative method (APM/open banking).
6. Smart Retry: 3DS mode change, MID, mcc/fallback, time back by reason codes (05/51/62 ≠ 91/96).
7. Post-processing: Route Journal entry, balance update.
4. 2 Payouts
1. Prioritization: speed (instant/near-instant) ↔ cost ↔ availability.
2. KYT/AML/RG: velocity, free patterns, limits, source of funds, exception lists.
3. Routing: card-to-card OCT/RTP/Faster Payments/SEPA Instant/Pix/UPI.
4. Failover: queued payouts when bank/PSP is unavailable, periodic drain queues.
5. Confirmation: signed webhooks that compensate transactions for discrepancies.
5) Failover patterns
5. 1 Circuit-breaker
Local (on PSP): triggered by error_rate↑, latency↑, spike in declines (issuer-specific).
Global (per method): in case of industry failure (e.g. ACS/3DS at a large bank).
States: Closed → Open → Half-open; timeouts and thresholds are defined by GEO/BIN segments.
5. 2 Active-Active vs Active-Passive
Active-Active: parallel PSP/methods; Rule/cost balancing best RTO/RPO.
Active-Passive: savings on fees/support but longer RTOs; suitable for secondary GEO/methods.
5. 3 Degradation Modes
Disabling high-risk methods, transferring part of the traffic to open banking/APM.
Forced 3DS challenge-all for "burning" BINs/banks.
Temporary limit on amounts/frequency (RG + risk).
6) Working with 3DS/SCA (dynamically)
Frictionless by default for low risk/small checks, challenge for high-risk.
Exceptions PSD2: LVA, MOTO, MIT - in the orchestrator, not in the application.
Fallback: if ACS degrades, raise the challenge rate or temporarily shift traffic to an alternative method (open banking).
KPI: challenge rate, frictionless share, post-3DS approvals.
7) Integration with anti-fraud/KYT/RG
Before routing - scoring (device, behavior, proxy/VPN, BIN risk, history).
In routing, 3DS/channel/PSP selection by risk_score.
Before payment - KYT/velocity/anti-arb (fast win→withdraw, multiple cards, related devices).
RG limits and self-exclusion are "hard" stop rules at the orchestrator level.
8) Observability and data
Real-time metrics: auth_rate, decline_reason mix, p95 latency, PSP health, 3DS success, payout time, queue depth.
Alerts: thresholds by banks/methods, gluing to external status pages.
A/B & Lerning: update of routing weights based on conversion/cost; control groups without retrays for calibration.
9) KPIs and targets
Auth Rate (cards): EU 85-92%, US 80-88%, LATAM 70-85% (no orchestration - bottom edge).
p95 latency auth API: < 3 c; webhooks: < 60 c.
Share of Instant Payouts: ≥ 70% of "light" checks.
Routing Efficiency (conversion ÷ cost): + 5-10% to baseline for the quarter after tuning.
Circuit-break RTO: <2 min for switching; RPO: 0 (idempotency).
Chargeback rate: < 0. 5% by count (depends on product/GEO).
10) Incident playbooks (cheat sheet)
10. 1 Mass declines by issuing bank
1. Confirm spike by BIN/issuer.
2. Open local circuit-breaker → redistribute to alt-PSP/method.
3. Increase the challenge rate for affected BINs, enable smart retry.
4. Communication to status channels; RCA with reason codes.
10. 2 3DS/ACS drop
1. Detection by growth timeouts/" soft decline."
2. Transfer part of the traffic to open banking/APM; enable "challenge-all" where ACS is alive.
3. Reduce the risk check (limits on amounts/speed), strengthen monitoring.
10. 3 PSP instability
1. The health alert → open breaker worked.
2. Transfer to standby PSP/MID; banning "heavy" methods with high latency.
3. Recovery via half-open with canaries (1-5% of traffic), then gradation.
10. 4 Payment delays
1. Transfer to queued payouts with priorities (VIP, amount limit).
2. Moving part to alternative rails (RTP/FPS/SEPA Instant/Pix).
3. Transparent notifications to players; manual escalations> X hours.
11) SLAs and contract anchors (what to require from PSP)
Availability: ≥ 99. 95% admission; p95 latency < 3 c; webhooks < 60 c.
Incidents: TTA ≤ 15 min, fallback MID/APM, RCA ≤ 5 days.
Data: raw reason codes, bank details, token returns ≤ 10 days upon exit.
Finance: limited reserves/holdback, transparent feels (incl. 3DS/network tokens), cap on FX premiums.
Security: PCI-AOC, webhooks signatures, key rotation, SOC 2/ISO 27001 (preferred).
12) Regional patterns
EC/UK: PSD2/SCA; cards + open banking (SEPA Instant/FPS). Strong 3DS orchestration, TRA and whitelisting.
USA: cards + ACH; priority of instant payments (push-to-card, RTP). Chargeback circuits are mandatory.
LATHAM: Pix (BR), SPEI (MX), PSE (CO); APM-heavy; focus on device risk and KYC document.
Turkey/CA: local transfers/wallets; enhanced sanction/AML circuit, limits on amounts/speed.
Asia/India: UPI/e-wallets; strict velocity rules; routing to issuing banks.
13) Implementation checklists
Architecture/Data
- Payment Intent + idempotency to all hops.
- Route Journal, raw reason codes, signed webhooks.
- Separation of money/game ledgers; offsetting transactions.
Routing/Rules
- Rule-engine by GEO/BIN/issuer/risk/cost.
- Smart retry with time off and 3DS/MID change.
- Circuit-breakers local and global; canary-return.
Risk/compliance
- Risk/KYT/RG integration before and after routing.
- Sanctions/PEP, age/self-exclusion - like "hard" filters.
- Velocity/amount limits; decision log.
Observability/SLA
- Dashboards by Auth Rate, latency, decline mix, payout time.
- Alerts by thresholds; runbooks to incidents.
- SLAs in the contract, QBRs and penalties for violations.
14) Strategy pseudo code (for team)
on PaymentRequest(PI):
ensureIdempotency(PI.key)
risk = RiskEngine.score(PI)
if not ComplianceHub.pass(PI, risk): reject()
candidates = RouteCatalog.filter(PI.geo, PI.method, PI.bin, risk)
for route in rankBy(Score(AuthRate, Fee, Health, Risk), candidates):
res = PSP.call(route, PI, policy=ThreeDS.select(risk))
log(RouteJournal, route, res)
if res.approved: return approve(PI)
if isRetryable(res.reason): continue with SmartRetryAdjustments()
return decline(PI)
15) Economics and A/B optimization
Считайте effective rate = (Fees + 3DS + FX + chargeback cost − interchange rebates) / Approved Volume.
A/B: minimum 10k transactions/branch, 2-4 weeks; record by banks/methods.
Optimize AuthRate vs Fee weights by GEO/seasonality; control the "skew" into expensive but conversion rails.
16) What is important to remember
Circuit-breakers and canary-returns provide quick stabilization without "swings."
Orchestrator + rules + data - the heart of payment stability and conversion.
Payment Intent eliminates double debiting and simplifies failover.
PSP contractual SLAs and transparent data are not an option, but a requirement.
Regional rails (open banking, RTP, Pix/UPI) are often better than speed/cost cards - take into account in routing.