A/B tests of payment scenarios
1) Why test payment scenarios
Increase approvals (AR) and reduce failures (DR).
Reduce cost: take-rate (interchange/scheme/markup/fixed) and cost-per-approval.
Reduce risk: less chargebacks/fraud with the same approvals.
Sustainability: choose a provider/3DS strategy/routing for specific GEO/BIN/methods.
2) Experiment design
2. 1. Randomization unit
User-level (recommended): all attempts of one user fall into one branch → there is no "mixing" of 3DS/tokens.
BIN-level: when the test is about routing by issuer; risk of cross-user confounding.
Order/Attempt-level: acceptable for small UI experiments (for example, a copy of an error), undesirable for routing/3DS.
2. 2. Stratification (before randomization)
Stratify by: GEO player, issuer country/BIN6, payment method, channel (web/app), amount-segment, risk-rate. This will reduce the variance and risk of SRM.
2. 3. What we test
Routing/cascade: PSP_A vs PSP_B, sticky BIN, limit-aware.
3DS policy: frictionless→challenge, enforced 3DS for BIN/GEO.
UX flow: sequence of steps, error/repetition texts.
Parameters: windows and soft-decline codes.
Pricing: Provider with IC++ vs blended and impact on all-in cost.
3) Metrics: targeted, secondary, guardrails
3. 1. The main
AR (Approval Rate) = approved/attempted.
Cost-per-Approval = (auth+decline fees)/approved.
Take-rate% (all-in) = fees/volume (in reporting currency).
3DS pass-rate; liability shift %.
Latency p95/p99 payment flow.
3. 2. Risk metrics
Chargeback ratio (CBR), refund rate, fraud alerts/1000 trx.
FX slippage (bps) = effective vs reference FX.
3. 3. Guardrails (stop conditions)
A drop in AR> Y bps or a rise in CBR/Refunds above the threshold.
SRM (Sample Ratio Mismatch) - traffic imbalance versus expected.
Spikes: latency, soft-decline surge, 3DS anomaly.
4) Stats and power
4. 1. Sample size (approximation for fractions)
n_per_group ≈ 2 (Z_{1-α/2} + Z_{1-β})^2 p(1-p) / δ^2
where'p 'is the base AR,' δ 'is the expected uplift in AR, α is the significance level, β is a type II error.
4. 2. Sequential Analysis (Early Stops)
Alpha-spending (O'Brien-Fleming/Pocock): we fix the inspection schedule and spend α in stages.
SPRT/Bayes - for operational solutions, but fix the protocol.
4. 3. Varys Editorial
CAPPED: 'Y = Y − θ (X − μ_X)', where X is the pre-experimental covariate (AR/DR/risk rate), θ is the covariate coefficient.
Stratified scores, cluster-robust errors (user/BIN clusters).
Bootstrap for take-rate/cost metrics (heavy tails).
4. 4. Multivariable tests and bandits
MAB (UCB/Thompson): When it's important to "learn" on the fly and keep turning.
For compliance-critical metrics (CBR, liability) - prefer classic A/B with guardrails.
5) Experimental platform architecture
1. Assignment service: deterministic hash '(user_id, experiment_id, salt)' → bucket.
2. Feature-flags/Rules-engine: activation of the route/3DS/retract along the branch.
3. Events: attempts/results (authorize/capture/refund/cb) → bus (Kafka/PubSub).
4. Idempotency: total'idempotency _ key'per cascade.
5. DWH/Showcases: normalized statuses, fees, FX, risk flags.
6. Monitoring: online-SLI (AR/3DS/latency), alerts, SRM check.
7. Protocols: pre-register hypothesis, final criteria, data frieze.
6) Data model (minimum)
sql ref. experiments (
exp_id PK, name, hypothesis, owner, start_at, end_at,
unit -- USER BIN ORDER,
target_metric, guardrails JSONB, design JSONB, alpha NUMERIC, power NUMERIC, meta JSONB
);
ref. experiment_arms (
exp_id FK, arm_id, name, traffic_share NUMERIC, params JSONB, enabled BOOLEAN
);
assignments. buckets (
exp_id, user_id, assigned_arm, assigned_at, salt, hash_key, PRIMARY KEY (exp_id, user_id)
);
events. payments (
attempt_id PK, user_id, exp_id, arm_id,
provider, method, bin, iso2, risk_score,
status, decline_code, three_ds_used BOOLEAN, liability_shift BOOLEAN,
amount_minor BIGINT, currency, latency_ms INT,
authorized_at, captured_at, settled_at, meta JSONB
);
finance. fees (
attempt_id FK, interchange_amt NUMERIC, scheme_amt NUMERIC, markup_amt NUMERIC,
auth_amt NUMERIC, refund_amt NUMERIC, cb_amt NUMERIC, gateway_amt NUMERIC,
fx_slippage_amt NUMERIC, reporting_currency TEXT
);
risk. outcomes (
attempt_id FK, is_refund BOOLEAN, is_chargeback BOOLEAN, fraud_alert BOOLEAN
);
7) SQL templates
7. 1. SRM check (share of traffic by hand)
sql
SELECT arm_id,
COUNT() AS n,
ROUND(100. 0 COUNT() / SUM(COUNT()) OVER (), 2) AS share_pct
FROM assignments. buckets
WHERE exp_id =:exp
GROUP BY 1;
7. 2. Key metrics by hand
sql
WITH base AS (
SELECT e. arm_id,
COUNT() AS attempts,
COUNT() FILTER (WHERE status='APPROVED') AS approvals,
AVG(latency_ms) AS latency_avg_ms,
AVG((three_ds_used)::int) AS three_ds_share
FROM events. payments e
WHERE e. exp_id=:exp AND e. authorized_at BETWEEN:from AND:to
GROUP BY 1
),
cost AS (
SELECT e. arm_id,
SUM(f. interchange_amt + f. scheme_amt + f. markup_amt +
f. auth_amt + f. refund_amt + f. cb_amt + f. gateway_amt + f. fx_slippage_amt) AS fees_rep,
SUM(e. amount_minor)/100. 0 AS volume_rep
FROM events. payments e
JOIN finance. fees f USING (attempt_id)
WHERE e. exp_id=:exp AND e. settled_at BETWEEN:from AND:to
GROUP BY 1
)
SELECT b. arm_id,
approvals::numeric/NULLIF(attempts,0) AS ar,
fees_rep/NULLIF(volume_rep,0) AS take_rate,
(SELECT COUNT() FROM risk. outcomes r
JOIN events. payments e2 USING (attempt_id)
WHERE e2. exp_id=:exp AND e2. arm_id=b. arm_id AND r. is_chargeback)=0
AS cb_zero_flag,
latency_avg_ms, three_ds_share
FROM base b LEFT JOIN cost c ON c. arm_id=b. arm_id;
7. 3. CAPPED for AR (example)
sql
WITH pre AS (
SELECT user_id, AVG((status='APPROVED')::int) AS ar_pre
FROM events. payments
WHERE authorized_at <:pre_from_end
GROUP BY 1
),
cur AS (
SELECT e. user_id, e. arm_id, (e. status='APPROVED')::int AS ar_flag
FROM events. payments e
WHERE e. exp_id=:exp AND e. authorized_at BETWEEN:from AND:to
)
SELECT arm_id,
AVG(ar_flag - theta (ar_pre - mu_pre)) AS ar_cuped
FROM cur
LEFT JOIN pre USING (user_id),
LATERAL (SELECT AVG(ar_pre) AS mu_pre FROM pre) mu,
LATERAL (SELECT COVAR_SAMP(ar_flag, ar_pre)/VAR_SAMP(ar_pre) AS theta FROM cur LEFT JOIN pre USING(user_id)) t
GROUP BY arm_id;
7. 4. Checking guardrails (example)
sql
SELECT arm_id,
100. 0 SUM(is_chargeback::int)::numeric / NULLIF(COUNT(),0) AS cbr_pct,
100. 0 SUM(is_refund::int)::numeric / NULLIF(COUNT(),0) AS refund_pct
FROM risk. outcomes r
JOIN events. payments e USING (attempt_id)
WHERE e. exp_id=:exp AND e. settled_at BETWEEN:from AND:to
GROUP BY 1
HAVING 100. 0 SUM(is_chargeback::int)::numeric / NULLIF(COUNT(),0) >:cbr_threshold
OR 100. 0 SUM(is_refund::int)::numeric / NULLIF(COUNT(),0) >:refund_threshold;
8) Test process (end-to-end)
1. Pre-registration: hypothesis, metrics, design, dimensions, stop rules.
2. SRM/AA test on the "empty" effect (a couple of days).
3. Launch: assignment freeze, logic in rules-engine/phicheflags.
4. Online Monitoring: AR/3DS/latency/health + guardrails.
5. Intermediate alpha-spending checks (if planned).
6. Finish and date frieze: only after accounting for funding/reserves/late CB/refunds.
7. Analytics: CUPED/stratification, sensitivity, GEO/BIN/method/channel heterogeneity.
8. Solution: roll-out, roll-back, or follow-up test; updating rules/routing.
9. Documentation and retrospective: lessons, updating thresholds/weights.
9) Anti-patterns and traps
Peeking/re-review without protocol → false victories.
Order-level randomization in routing tests → leakage between hands.
Multiplicity game (many metrics/slices) without α correction.
Incomplete cost (forgot FX/reserve/refund fees) → wrong take-rate.
Missing SRM check → misaligned pins.
Non-idempotent retrays → double AR authorizations/distortions.
10) Safety, compliance and ethics
Same-method/return-to-source should not be broken by the test.
Sanctions/licenses/GEO policies are beyond experimentation.
RG/responsible game: do not degrade defense mechanisms for the sake of AR.
PCI/GDPR: tokens instead of PAN, minimizing personal data, DPA/SOC2.
11) Experiment dashboard KPI
AR/DR, uplift and confidence intervals by arms and key stratification (GEO/BIN/method).
Cost-per-Approval, take-rate %, FX slippage (bps).
3DS pass/liability shift, soft-decline share.
Latency p95/p99, errors/timeouts.
CB/Refunds (lag-aware), SRM, traffic coverage, duration.
12) Best practices (short)
1. Randomize at the user level and stratify.
2. Use guardrails and SRM check; fix the protocol.
3. Consider full cost (fees + FX + reserve) and cost-per-approval.
4. Use CAPPED, cluster-robust errors, and bootstrap for cost metrics.
5. For critical risks - classic A/B; bandits - for mainly price/AR tasks.
6. Consider funding/reserves/late CBs before final withdrawal.
7. Document and version the rules; do post-mortem.
13) Start-up checklist
- Hypothesis, metrics, effect, design, sample size, term.
- Unit randomization and strata, assignment service, phicheflags.
- Guardrails/thresholds, SRM/AA-precheck, alerts.
- Logs/events, idempotency, status normalization.
- Display cases fees/FX/reserve; reporting currency.
- Alpha-spending plan and data freeze.
- Playbooks roll-out/roll-back; documentation of results.
Summary
A/B tests of payment scenarios are an engineering statistical discipline: correct randomization and stratification, full cost and risk metrics, guardrails and SRM, neat analytics (CUPED/cluster-robustness/sequential analysis) and "combat-ready" infrastructure (idempotency, telemetry, reconciliation). By following this technique, you increase AR, reduce all-in take-rate, and at the same time do not pay for "false victories" with an increase in chargebacks and regulatory risks.