Playbook of incidents in payments
TL; DR
A payment incident is a controlled operation: quickly classify → stabilize UX (feiler/degradation) → save money (idempotency/block rules) → transparently communicate → restore → fix RCA. Main SLOs: MTTA, MTTR, TtW/TtR, AR, Webhook p95, zero tolerance for double charge/refund.
1) Severity & Impact Matrix
Triggers: SLA/Treasury/reconciliation alerts, support peaks, AR/latency/webhooks monitoring.
2) Roles and communication channel
Incident Commander (IC) is the owner of the timeline and solutions.
Payments Tech Lead - routing, idempotence, feature flags.
Treasury Lead - liquidity, prefunding, stress reserves.
Risk/AML - sanctions, block rules, SoF/SoW.
Comms Manager - templates for support/partners, status updates.
Recon/Finance - reconciliation, reversal/journals, loss estimates.
Headquarters: # payments-incident-warroom (chat), Zoom-bridge + live timeline document (UTC).
3) Universal loop (for any incident)
1. Detect & Triage → confirm metrics/coverage, assign Sev.
2. Stabilize UX → routing feiler, feature degradation, freezing of dangerous auto-actions.
3. Money Safety → enable idempotence/blocks (refund/payout), fix logs.
4. Communicate → internal update (15/30/60 min), external messages (status/ETA/workarounds).
5. Recover → incremental rollback/open, verify SLO.
6. Reconcile → compare ledger/PSP/bank, calculate financial impact.
7. RCA (≤5 business units) → root, actions, preventers, tasks.
4) Typical scenarios and Runbook 'and
4. 1 Auth Drop/Latency Spike (Cards/A2A)
Symptoms: AR↓, soft declines↑, p95 auth> 1-2 s.
Actions:- Smart-routing: PSP_A→PSP_B, increase the 3DS-challenge on vulnerable BINs.
- Limit retrays (backoff + jitter), protect idempotency 'auth _ key'.
- Segment-toggle: high-risk into "strict" script; reduce high-ticket limits.
- Communications: "degradation note," recommend an alternative method.
- Recovery: phased return of traffic share, AR control in the context of BIN × GEO.
4. 2 Webhooks Delay / Duplicate
Symptoms: p95> 3-5 c, gaps capture/refund/payout, duplicates.
Actions:- Switch to polling; enhance TTL idempotency.
- Freeze auto-refands and risky auto-payments.
- Anti-double: store-once by 'idempotency _ key/provider _ txid'.
- Perform catch-up processing; reconciliation with PSP registries.
- Recovery: enable webhooks, compare consistency with reports.
4. 3 Payout Fail / TtW Degradation
Symptoms: Success%↓, TtW p95↑, returns/timeouts.
Actions:- Feilover to standby rail (RTP/SEPA/other PSP).
- Treasury: prefund top-up payout pool, StressRes activation.
- Payout-lock for high-risk, VIP prioritization.
- Communications: ETA and alternatives, transparency of statuses in the personal account.
4. 4 Refund Errors / Double Refund Risk
Symptoms: Refund error rate↑, disputed/duplicate returns.
Actions:- Global refund-freeze on auto-route, manual only with rights.
- Hard idempotency 'payment _ id + amount + reason'; row-lock on balance.
- Recalibration according to PSP report; reversal of duplicates in the ledger, cases in DLQ.
- Kommunikatsii:模板 for cards (T + 1-T + 5 bp), instant - up to 60 s.
4. 5 Settlement Delay / PSP Batch Mismatch
Symptoms: D + N not enrolled, diff in amounts/fee.
Actions:- Treasury: Turn on StressRes, limit instant payouts.
- Recon: mark the batch "SUSPENSE," raise the PSP ticket, request a statement.
- FX/Fees: accept temporary "truth" (policy) or wait for correction.
- Communications: Q&A for support (security of funds, timing of settlement).
4. 6 Crypto On/Off-Ramp Degradation
Symptoms: TtH↑, slippage↑, lack of liquidity of the site.
Actions:- SOR→alternativnyy CEX/OTC, reduce lot size (TWAP).
- Transfer of those entering the stable/fiat, depeg exposure limit.
- Kill-switch if oracle divergence> bps limit.
4. 7 Voucher/Wallet Anomalies
Symptoms: Invalid PIN spike, velocity, geo-bowl.
Actions:- Limits/cooldown, binding redeem to the device, payout-lock + turnover.
- Request checks/SoF, replenishment of block lists (email/device/ASN/retailer).
5) Action checklists
5. 1 First five minutes (P0/P1)
- Assign IC, open war-room.
- Record Sev, coverage, start of timeline (UTC).
- Enable secure feature flags (idempotency, freeze of the necessary auto processes).
- Start Feature Failover/Degradation.
- First internal update (context, measures, next ETA).
5. 2 Before closing the incident
- SLO restored (AR/latency/webhooks/TtW/TtR).
- Reconciliation (internal↔PSP↔bank), no black holes.
- Financial impact valuated, reversals/journals issued.
- External update/status channel post.
- The owner of the RCA and the prevention task is assigned.
6) Monitoring, alerts and dashboards
Key alerts:- 'AR_gross↓> 3 pp (to p7 median)' → P1/P0 in coverage.
- `Auth p95>1. 5 s / Webhook p95>5 s / Capture Success<98%` → P1.
- `Payout TtW p95> SLO` или `Success%<99%` → P1.
- `Refund Error>0. 3%` или `Double Refund>0` → P0.
- `Settlement on-time<99%` / `Report Delivery SLA breach` → P1.
1. Fanel Attempt→Auth→Capture (comparison to basis line).
2. Heatmap AR по BIN×GEO×PSP.
3. Webhook p50/p95, duplicates, bounce.
4. Payout/Refund Health (Success%, TtW/TtR).
5. Treasury: L0 balance, prefund, StressRes.
6. Recon: Mismatch Rate, Aging DLQ.
7) Communications (templates)
Internal (15 min):8) Reconciliation and money (after stabilization)
Run auto-reconciliation: provider_txid/idem_key/amount/time-bucket.
Select DLQ: orphan/duplicate/amount mismatch/fee drift.
Make a reversal/correction in the ledger, recalculate Cost/GGR and Fraud Loss.
Treasury: close temporary measures (StressRes, payout-lock), rebalance pools.
9) RCA (Root Cause Analysis) template
Context: Date/Time (UTC), Sev, Coverage, Metrics.
Symptoms: what you saw (graphs/screenshots).
Reason: root (those/processes/counterparty).
What worked/did not work: feilover, feature flags, communications.
Financial effect: write-offs/non-payment/commissions/SLA loans.
- Those: limits, idempotency, retreats, tests.
- Processes: update playbook, QBR with PSP, SLA changes.
- Deadlines and task owners.
10) Automation and integration
Feature-flag platform: instant routing/degradation by country/BIN/method.
Runbook-bot: commands '/failover PSP_A→B', '/freeze returns', '/enable polling '.
Anomaly detector: statistical deviation of AR/latency with knowledge of seasonality.
Post-incident macros: automatic opening of the RCA template, collection of logs/graphs, reconciliation checklist.
11) Drill calendar and UAT
Monthly: "Auth drop" drill (15 min from detecta to feilover).
Quarterly: "Webhook outage" + "Refund double-strike" (idempotence).
Semi-annual: "Settlement delay + Treasury stress" (StressRes).
UAT package: test cases of idempotency, feilover, reconciliation, communications.
12) Playbook Success Metrics (Operational KPIs)
MTTA/MTTR: median/p95 by P0/P1.
Percent auto-failover within 10 min.
Incidents preventing double charge/refund (=100%).
Post-incident recon complete ≤ D+1.
Service credits recovered / month (по SLA).
User impact minutes.
13) Frequent mistakes and how to avoid them
Late activation of the feilover (no automatic thresholds).
Lack of "freeze" on auto-refands when webhooks bounce.
No row-lock/versioning → partial refund> remainder.
Communication without facts/ETA → escalation to support.
No tie-up with Treasury → TtP/TtW exit SLO.
Skipping reconciliation → "black holes" in revenue.
14) Applications (reference blocks inside your wiki)
SLAs with payment providers - alert thresholds and loans.
Reconciliation of PSP payments and reports - recon/DLQ procedures.
Treasury: Liquidity and Reserves - StressRes/Prefunding.
Payment loop KPI - AR/TtW/TtR/Refund Health formulas.
Partial and complete refands are idempotence and politics.
Summary
The working playbook is a scenario runbook 'and + automation + discipline of post-mortems. It reduces MTTR, protects money (idempotence/reconciliation/treasury), minimizes user damage, and systemically improves relationships with PSPs on SLAs. Result - AR above, TtW/TtR in corridors, zero takes, predictable money flow.