A/B interface testing
Introduction
A/B testing is a controlled experiment where two (or more) versions of an interface are compared on real users to understand which version leads to better product metrics. The goal is to reduce uncertainty in decision making and improve UX through verifiable changes rather than opinions.
When A/B testing is appropriate
There is a measurable goal (conversion, time to action, hold, NPS, task speed).
The expected effect is not obvious or may differ by segment.
The risk of change is high enough to justify the experiment.
Traffic allows you to quickly collect a statistically significant sample.
When it is better not to test: microcopies on underused screens, features with strong network/social dependence (overflow of effects), edits that require long-term training of users.
Hypothesis formulation
Template:- If we change [X in the interface] for [Y-segment/all], then [Z metric] will change to [direction/magnitude] because [behavioral reason].
Example: If you move the main CTA above the crease line and reduce the shape from 6 to 3 fields, then the CR of the primary action will increase by + 3-5% due to a decrease in friction.
Metrics: Target and Defensive
Primary: one key - for example, target script completion/conversion.
Secondary: scrolling depth, CTR, time to action, errors, page speed.
Guardrails (protective): performance stability (TTFB, LCP), returns/refusals, complaints/rollbacks, compliance with notification limits, availability.
It is recommended to fix MDE (minimum detectable effect), observation window and success criteria in advance.
Experiment design
Randomization and unit of analysis
Randomization unit: user (user_id), sometimes session or organization (cluster).
Stratification/blocking: by device/channel if there are strong differences.
Overflow-Avoid when the behavior of one group affects another (for example, shared lists/tapes). In such cases, cluster tests.
Sample size and MDE (simplified)
Approximate: the lower the basic conversion and the smaller the effect, the larger the sample.
For CR ~ 10% and MDE ~ + 5% of the relative effect, tens of thousands of observations per variant are often required.
Duration
Focus on the full weekly behavior cycle + margin (usually 2-4 weeks) or until reaching the planned capacity. Do not stop the test prematurely.
Ramp-up (gradual withdrawal)
1-5% of traffic (canary) → 10-25% → 50% → 100%, with guardrails monitoring.
Data quality and validity
SRM (Sample Ratio Mismatch)
Verify that the actual traffic distribution (A/B) is as planned (for example, 50/50). Significant deviations = inclusion/flag/bot problem.
Identity and cross-device
Use a stable user_id; consider cross-devices, cookie-decay, authorization later in the funnel.
Bots and anomalies
Filter unnatural patterns (super-fast clicks, missing user agents, invalid referrers).
Seasonality and events
Do not run tests for strong "abnormal" periods (holidays/sales) unless it is the purpose of the test.
Statistical analysis
Frequency approach (classic)
Fix alpha (usually 0.05) and power (usually 80%).
Do not "peep" every hour without adjustments - the risk of false positives.
For multiple metrics/variants, apply adjustments (Bonferroni/Holm/Hochberg) or a hierarchy of metrics.
Bayesian approach
Estimates the probability distribution of the effect and the probability of superiority of the variant.
Convenient for real-time monitoring and "good enough" decision making.
CUPED/covariates
Variance reduction due to pre-test covariates (eg, last week "s activity) → faster power is achieved.