Repetitions and backoff in payments
Repetitions and backoff in payments
1) Why replays are needed
Conversion: soft failures (timeouts, 3DS errors, network failures) are often recovered on repeat: + 2-7 pp to Auth Rate.
Robustness: local PSP/ACS/bank failures are smoothed by retras with alternative routes.
Player experience: correctly built replays hide the "noise" of the infrastructure without double charges.
2) Basic principles
1. Idempotency at the "payment intent" (PI) level: one operation = one'idempotency _ key '; any recourse does not alter the monetary condition.
2. Error separation:- Hard decline (e.g. 'Do not honor' with a strict issuer policy, 'Insufficient funds') → usually not retrayem right away.
- Soft decline/technical (timeout, 'Issuer unavailable', 'Try again') → allowed retray.
- 3. Backoff + limit attempts: exponentially increase the delay, add jitter and do not exceed the limits (usually 2-3 attempts).
- 4. Bundle routing: retray is not only a "repeat of the same PSP," but also a change in PSP/MID/3DS mode/method.
- 5. Observability: each hop is recorded in the Route Journal (PSP, reason, latency, 3DS mode, fee, result).
3) Error classification for retreat decision
4) Backoff strategies (practice)
4. 1 Exponential backoff with jitter (recommended)
База: `delay_n = min(base 2^n, max_delay)`
Jitter: 'delay = rand (0, delay_n)' - reduces stampedes when many requests are repeated simultaneously.
Typical parameters are 'base = 200-500 ms', 'max _ delay = 5-10 s', 'n≤2 -3'.
4. 2 Linear backoff
Simple, but worse with "unrest" on the network. Inferior to exponential + jitter.
4. 3 Timeout policy
Client timeout (yours) ≤ PSP SLA (for example, 3-5 s), otherwise the risk of duplicates/freezes increases.
Separately set the waiting time for the webhook/confirm: if the confirmation does not come → the compensating reconciliation (ledger/PSP).
5) Idempotence and protection against takes
Payment Intent (PI) stores status, amount, method, 'idempotency _ key', route history.
Each hop and retry use the same key.
Compensating transactions: when out of sync (approve in PSP, and you have a timeout) - "reconcile-pull" + ledger adjustment.
Exclude re-authorization when re-delivering the webhook: check 'transaction _ id '/' PSP reference' for uniqueness.
6) 3DS/SCA and repetitions
Soft decline after frictionless → retray with challenge.
ACS timeout/unavailable → exponential backoff, then an alternative channel (open banking/APM) or another PSP.
With mass degradation of ACS - circuit-breaker, growth 'challenge rate', time limits on amounts.
7) Reps for APM/open banking
Open banking/instant rails (SEPA Instant/FPS/Pix/UPI):- Retrays are limited: check idempotence on the provider side and statuses in delayed webhook'ax.
- With an indefinite status - polling with backoff and strict reconciliations.
- Vouchers/cash: Retrays do not apply as to an "online transaction," but due date control and "status refresh" apply.
8) Payouts: replays and queues
Bank/PSP technical failure → queued payouts with backoff drain.
KYT/velocity fail → not retrayem, transfer to manual check.
Queue prioritization: VIP/small amounts/application age; SLA and auto-escalation deadlines.
Alternative rails (RTP/FPS/SEPA Instant/Pix) in the second retracted step.
9) Circuit-breaker and retrai
Local (on PSP/MID/BIN): when errors spike, → stop retrays on this route, switch to an alternative one.
Global (per method/region): systemic degradation → disable the method, we offer APM/open banking.
Half-open: return part of the traffic (1-5%) to check recovery before full return.
10) Pseudocode of retray strategy
python def pay_with_retries(pi):
ensure_idempotency(pi.key)
if not compliance_pass(pi): return REJECT
routes = rank_candidates(pi) # по вероятности approve, fee, health attempts = 0 for route in routes:
policy3ds = select_3ds(pi, route)
res = call_psp(route, pi, policy3ds, pi.key, timeout=3.0)
log_attempt(pi, route, res)
if res.approved: return APPROVED
if is_soft_decline(res) or is_transient_error(res):
while attempts < MAX_ATTEMPTS and not breaker_open(route):
delay = backoff_with_jitter(base=0.3, attempt=attempts, cap=8.0)
sleep(delay)
policy3ds = maybe_toggle_3ds(policy3ds, res)
res = call_psp(route, pi, policy3ds, pi.key, timeout=3.0)
log_attempt(pi, route, res)
attempts += 1 if res.approved: return APPROVED if is_hard_decline(res): break перейти к следующему маршруту (PSP-B/APM/open banking)
return DECLINED
11) KPIs and targets
Incremental Approvals from Retries: + 2-7 pp to base conversion.
Avg Retry Attempts per Approved Tx: 1. 2–1. 5 (keep below 1. 7).
Retry Success Rate (soft/tech): ≥ 25–40%.
Duplicate Rate: 0 with correct idempotency.
P95 Latency (including retrays): <7 s until final response.
Payout SLA (instant share): ≥ 70% of easy checks, overdue <target threshold.
12) Incident playbooks
A. Mass timeouts on PSP-A
1. Open local breaker for PSP-A.
2. Reallocate retrays to PSP-B/APM.
3. Exponential backoff with jitter, limit 2-3 attempts.
4. Canary half-open after 10-15 mins.
B. Degradation of ACS/3DS
1. Detection by growth 'soft decline', timeouts.
2. Increase challenge rate; part of the traffic → open banking.
3. Set aside heavy checks, turn on velocity limits.
C. Payouts delays
1. Transfer to the queue, prioritization of VIP/small amounts.
2. Rerout to alternative rails (RTP/FPS/SEPA Instant/Pix).
3. Communication to players + auto-escalation.
13) Observability and data
Route Journal: PSP/MID, BIN/issuer, reason, latency, 3DS-режим, retry chain, итог, fee.
Dashboards: Auth Rate (by bank), Retry Success, Avg Attempts, Decline Mix, p95 latency, Payout Queue Depth.
Alerts: spikes by reason codes, increase in attempts/latency, overflow of output queues.
14) Implementation checklists
Architecture/Data
- Payment Intent + `idempotency_key` на все hops.
- Reason code config matrix: retryable vs non-retryable.
- Signed webhooks, deduplication by PSP reference.
Backoff/rules
- Exponential backoff with jitter; limit of attempts and window time.
- Smart retry: 3DS/MID/PSP/method change; distinction for vs APM/open banking cards.
- Circuit-breakers (local/global), half-open-canaries.
Ledger/reconciliations
- Compensating transactions with "suspended" statuses.
- T + 0/T + 1 reconciliation: PSP ↔ bank ↔ money ledger.
- Timeout and SLA policy on confirm/webhook.
Operations/Compliance
- RG/sanctions/PEP/age - before retrays.
- KYT/velocity на payouts; manual review rules.
- Runbooks and RACI for incidents/escalations.
15) Economics and risk
Consider effective rate taking into account 3DS-phia, FX, chargeback-value, retray-overhead.
Hard limit retrays to high-risk segments so as not to overclock chargeback exposure and reserves.
16) The bottom line
Repetitions work when they are controllable: idempotency, a clear matrix of reason codes, exponential backoff with jitter, attempt restriction, and a bundle with routing (PSP/3DS/method change). Add circuit-breaker, payouts queues and strong reconciliations - and you consistently raise the conversion without creating takes and cash holes.