Experiment flags and A/B tests
1) Why do you need it
Experimentation is a manageable way to improve conversion and reliability without the risk of "breaking the food." In iGaming, this affects: registration, deposit/withdrawal, bets/settle, KYC/AML funnels, lobby/UX, bonuses and anti-fraud. Ficheflags produce rapid, reversible changes; A/B tests - evidence of effect before scaling.
2) Platform principles
1. Safety-by-design: flags with TTL, rollbacks and reach limits; prohibition of switching on at red SLO.
2. Compliance-aware: SoD/4-eyes for sensitive flags (payments, RG, PII); geo-residency data.
3. Single Source of Truth: all flags/experiments - as data (Git/policy repository).
3) Taxonomy of flags
Release-flags: control the rolling out of versions (canary/rollout/kill-switch).
Experiment flags: A/B/n, multi-armed bandit, interleaving for ranking.
Ops flags: degradation of features (temporary), switching providers (PSP/KYC).
Config flags: parameters without release (limits, texts, coefficients).
Safety-flags: emergency switches (export PII off, bonus caps).
Each flag has: 'owner', 'risk _ class', 'scope (tenant/region)', 'rollout _ strategy', 'ttl', 'slo _ gates', 'audit'.
4) Platform architecture
Flag Service (CDN cache): gives the solution in ≤10 -20 ms; subscribed to GitOps/pe-consiler.
Assignment Engine: stable hash + stratification (GEO/brand/device) → buckets.
Experiment Service: test catalog, MDE/power calculation, SRM/guardrails, statistics.
Exposure Logger: idempotent log of "falling under the flag/variant" + event key.
Metrics API: SLI/KPI/KRI and Experiment Aggregates (CUPED/Adjustments).
Policy Engine: SoD/4-eyes, freeze windows, geo-constraints, SLO gates.
Dashboards & Bot: reports, alerts guardrail, short commands in chatbot.
5) Data model (simplified)
Flag: `id`, `type`, `variants`, `allocation{A:0. 5,B:0. 5}`, `strata{geo,tenant,device}`, `constraints`, `ttl`, `kill_switch`, `slo_gates`, `risk_class`, `audit`.
Experiment: `id`, `hypothesis`, `metrics{primary,secondary,guardrails}`, `audience`, `power`, `mde`, `duration_rule`, `sequential?`, `cuped?`, `privacy_scope`.
6) Idea-to-inference process
1. Hypothesis: metric-goal, risk/compliance assessment, MDE (minimally noticeable effect).
2. Design: choice of audience and stratification (GEO/tenant/device), calculation of power and duration.
3. Randomization and start: enabling via Policy-Engine (SLO green, SoD passed).
4. Monitoring: SRM checks (randomization distortion), guardrails (errors/latency/revenue).
5. Analytics: frequency (t-test, U-test) or Bayesian; CUPED for variance reduction.
6. Solution: promote/rollback/iterate; entry in the knowledge directory.
7. Archiving: turning off the TTL flag, releasing configuration/code, cleaning telemetry.
7) Purpose and bucketing
Deterministic: 'bucket = hash (secret_salt + user_id) mod N'.
Stratification: separately by 'geo, tenant, device, new_vs_returning' → uniformity in layers.
Single salt for a period: changes controlled to avoid collisions/leaks.
Exposures: Logged to the first target metric (to avoid selective logging).
8) Metrics and guardrails
Primary: registration/deposit conversion, ARPPU, D1/D7 retention, KYC speed, CTR lobby.
Secondary: LCP/JS errors, p95 "stavka→settl," auth-success PSP.
Guardrails: error_rate, p99 latency, SLO-burn-rate, complaints/tickets, RG-threshold (responsible game).
Long-term: churn, LTV proxes, chargebacks, RG flags.
9) Statistics and decision-making
MDE & capacity: predefined (e.g. MDE = + 1. 0 pp, power = 80%, α = 5%).
SRM (Sample Ratio Mismatch): χ ² - test every N minutes; with SRM - pause the test and investigate.
CUPED: covariate - pre-test behavior/basic conversion (reduces variance).
Multiplicity corrections: Bonferroni/Holm or control FDR.
Sequential: group sequential/always-valid p-values (SPRT, mSPRT) - safe early stops.
Bayesian: posterior probability of improvement and expected loss; good for making decisions with price asymmetry errors.
Interference/peeking: prohibition of "look and decide" outside of sequential procedures; logs of all views.
Non-parametric: Mann-Whitney for heavy tails; bootstrap for stability.
10) Privacy and compliance
Without PII in labels and expositions: tokenization, geo-scope storage.
SoD/4-eyes: experiments affecting payouts/limits/PII/responsible play.
Holdout by RG/Compliance: part of the traffic is always in control (to see regulatory/ethical effects).
Data minimization - store only the necessary aggregates and keys.
WORM audit: who started/changed/stopped, parameters, versions.
11) Integrations (operational)
CI/CD & GitOps: flags as data; PR review, validation of schemes.
Alerting: flag guardrail→avto, IC/owner notification.
Incident bot: commands '/flag on/off ', '/exp pause/resume', '/exp report '.
Release-gates: prohibit releases if active experiments in sensitive areas without owner-online.
Metrics API: reports, SLO-gates, exemplars (trace_id for degradation).
Status page: does not publish details of experiments; only if affects availability.
12) Configurations (examples)
12. 1 Canary roll flag
yaml apiVersion: flag. platform/v1 kind: FeatureFlag metadata:
id: "lobby. newLayout"
owner: "Games UX"
risk_class: "medium"
spec:
type: release scope: { tenants: ["brandA"], regions: ["EU"] }
allocation:
steps:
- { coverage: "5%", duration: "30m" }
- { coverage: "25%", duration: "1h" }
- { coverage: "100%" }
slo_gates: ["slo-green:auth_success","slo-green:bet_settle_p99"]
ttl: "30d"
kill_switch: true
12. 2 Experiment A/B with guardrails and CUPED
yaml apiVersion: exp. platform/v1 kind: Experiment metadata:
id: "payments. depositCTA. v3"
hypothesis: "The new button increases the deposit-conversion by + 1 pp"
owner: "Payments Growth"
spec:
audience:
strata: ["geo","tenant","device"]
filters: { geo: ["TR","EU"] }
split: { A: 0. 5, B: 0. 5 }
metrics:
primary: ["deposit_conversion"]
secondary: ["signup_to_kyc","auth_success_rate"]
guardrails: ["api_error_rate<1. 5%","latency_p99<2s","slo_burnrate<1x"]
stats:
alpha: 0. 05 power: 0. 8 mde: "1pp"
cuped: true sequential: true operations:
srm_check: "5m"
pause_on_guardrail_breach: true ttl: "21d"
13) Dashboards and reporting
Exec: lift by key metrics, percentage of successful experiments, economic effect.
Ops/SRE: guardrail-alerts, SRM, SLO degradation, impact on lags/queues.
Domain: funnels (registratsiya→depozit→stavka), GEO/PSP segments/device.
Catalog: knowledge base on completed experiments (what tried, what worked/didn't, effects on RG/compliance).
14) KPI/KRI functions
Time-to-Test: ideya→start (days).
Test Velocity: experiments/month per team/domain.
Success Rate: proportion of tests with a positive, statistically significant effect.
Guardrail Breach Rate: SLO/error rate.
SRM Incidence: proportion of tests with impaired randomization.
Documentation Lag: time from completion to directory write.
Cost per Test: $ Telemetry/Settlement/Maintenance.
Long-term Impact: LTV/churn/chargebacks change on winning variant cohorts.
15) Implementation Roadmap (6-10 weeks)
Ned. 1–2:- Repository of flags/experiments, schemes (JSON Schema), basic Flag Service with cache.
- Policy-Engine (SoD/4-eyes, SLO-gates), integration with GitOps.
- Assignment Engine (hash + strata), Exposure Logger, SRM check, guardrails alerts.
- The first set of flags: release + ops (kill-switch), 1-2 safe A/B.
- Statistical module: CUPED, frequency and Bayesian reports, sequential control.
- Dashboards (Exec/Ops/Domain), incident-bot commands '/flag ', '/exp'.
- Autopause by guardrails, integration with Release-gates, knowledge catalog.
- Process documentation, team training (Growth/Payments/Games).
- Multi-region and geo-residency, FinOps limits of cardinality, chaos teachings (SRM disruption).
- Certification of experiment owners, WORM audit.
16) Antipatterns
Include flags "all at once" without canaries and SLO-gates.
Mix release flags and experimental flags into one entity without explicit goals.
On-client randomization without salt/determinism → SRM/manipulation.
Peeking without sequential control; choose the winning metric after the fact.
Lack of guardrails and owner-on-duty → an increase in incidents.
Store PII in expositions/labels; ignoring geo-residency.
Do not turn off the TTL flags → "frozen" branches and behavior.
17) Best Practices (Brief)
Small, clear hypotheses; one Primary metrics per test.
Start with 5-10% traffic and strict guardrails.
CUPED almost always; Bayesian - when solution speed is important and the cost of errors is asymmetric.
Always check SRM and invariant metrics.
Write a post-analysis and add to the knowledge catalog.
Respect Responsible Play (RG): Don't incentivize harmful behavior with short-term revenue metrics.
Total
Flags and A/B tests are the production contour of change: flags as data, safe randomization and strict statistics, SLO/compliance-guardrails, observability and auditing. This approach allows you to quickly learn from the sale, increasing conversion and quality without increasing risks, with a proven effect for business and regulators.