Feature Flags and feature release
Feature Flag (FF) is a managed condition that enables/disables system behavior without releasing code. Flags allow you to: roll out features safely, target groups of users/markets/tenants, quickly disable problematic components, conduct experiments and configure parameters in runtime.
Key objectives:- Reduce blast radius for releases.
- Separate deployment and activation.
- Allow transparent change management with auditing, SLO and one-click rollback.
1) Types of flags and when to apply them
Release flags - phased inclusion of a new feature (dark → canary → ramp-up → 100%).
Ops/kill-switch - instant disconnection of dependencies (provider, subsystem, heavy calculations).
Experiment (A/B, multi-variant) - division of traffic into variants (weights, sticky bucketing).
Permission/Entitlement - access to features by role/plan/jurisdiction.
Remote Config - behavior parameters (threshold, timeout, formula) from the flag/config.
Migration flags - switching schemes/data paths (moving to a new index/DB/endpoint).
Anti-pattern: the same flag "about everything" - split into feature, comp switch and parameters.
2) Flag data model (minimum)
yaml flag:
key: "catalog. new_ranker"
type: "release" # release ops kill experiment permission config migration description: "New Directory Ranking"
owner: "search-team@company"
created_at: "2025-10-01T10:00:00Z"
ttl: "2026-01-31" # delete deadline after 100% enable rules:
- when:
tenant_id: ["brand_eu","brand_latam"]
region: ["EE","BR"]
user_pct: 10 # progressive percentage then: "on"
- when:
kyc_tier: ["unverified"]
then: "off"
variants: # for experiments
- name: "control"; weight: 50
- name: "v1"; weight: 30
- name: "v2"; weight: 20 payload:
v1:
boost_freshness: 0. 3 boost_jackpot: 0. 2 v2:
boost_freshness: 0. 2 boost_jackpot: 0. 4 prerequisites: # dependent flags/schema versions
- key: "catalog. index_v2_ready"
must_be: "on"
audit:
require_ticket: true change_window: "09:00-19:00 Europe/Kyiv"
safeguards:
max_rollout_pct: 50 # stop threshold auto_rollback_on:
p95_ms: ">200"
error_rate: ">2%"
3) Evaluation and targeting
Ключи таргетинга: `tenant_id, region/licence, currency, channel, locale, role, plan, device, user_id, cohort, kyc_tier, experiment_bucket`.
Evaluation order: prerequisites → deny rules → allow rules → default.
Sticky bucketing: for experiments, hash a stable identifier (for example, 'hash (user_id, flag_key)') so that the user always gets one option.
ts result = evaluate(flag, context) // pure function if (!prereqs_ok(result)) return OFF if (deny_match(result, ctx)) return OFF if (allow_match(result, ctx)) return resolve_variant_or_on(result, ctx)
return flag. default
4) FF distribution and architecture
Options:- Server-side SDK (recommended): sources of truth and cache in the backend; unification of logic.
- Edge/CDN evaluation: fast targeting on the perimeter (where there are no PII/secrets).
- Client-side SDK: when you need UI personalization, but only with minimal context and no sensitive rules.
- Config-as-Code: storing flags in the repository, CI validation, rollout via CD.
- Startup bootstrap + streaming updates (SSE/gRPC) + fallback to the last snapshot.
- SLA "freshness" flags: p95 ≤ 5 s.
5) Release strategies
5. 1 Dark Launch
The feature is enabled but invisible to the user; collect metrics and errors.
5. 2 Canary
We include 1-5% of traffic in one jurisdiction/tenant; monitor p95/p99, errors, conversion.
Stop conditions - autocatoph threshold triggers by metrics.
5. 3 Progressive Rollout
10% → 25% → 50% → 100% scheduled with manual/auto verification.
5. 4 Shadow / Mirroring
We duplicate requests to the new path (with no apparent effect) and compare the results/latency.
5. 5 Blue/Green + FF
We deploy two versions; the flag steers traffic and switches dependencies by segment.
6) Dependencies and cross-service consistency
Use prerequisites and "health-flags" of readiness: the index is built, the migration is completed.
Coordination through events: 'FlagChanged (flag_key, scope, new_state)'.
1. enable read-path → 2) check metrics → 3) enable write/side-effects.
- Service contracts: default must be fail-safe OFF.
7) Observability and SLO
Metrics per flag/variant/segment:- `flag_eval_p95_ms`, `errors_rate`, `config_freshness_ms`.
- Business metrics: 'ctr', 'conversion', 'ARPU', 'retention', guardrails (e.g. RG incidents).
- Automatic SLO thresholds for autocatopa.
Logs/tracing: add 'flag _ key', 'variant', 'decision _ source' (server/edge/client), 'context _ hash'.
Dashboards: rollout "ladder" with thresholds, heatmap errors by segments.
8) Safety and compliance
PII-minimization in context.
RLS/ACL: who can change which flags (by domain/market).
Hour windows of changes (change windows) and "double confirmation" for sensitive flags.
Immutable audit: who/when/what/why (ticket/incident link).
Jurisdictions: Flags must not circumvent regulatory bans (for example, include playing in a banned country).
9) Managing "long-lived" flags
Each flag has a TTL/deletion date.
After 100% inclusion - create a task to delete code branches, otherwise the "flag-debt" will grow.
Mark the flags as' migration '/' one-time ', separate them from the constant' permission/config '.
10) Sample Contract API/SDK
Evaluation API (server-side)
http
POST /v1/flags/evaluate
Headers: X-Tenant: brand_eu
Body: { "keys":["catalog. new_ranker","rgs. killswitch"], "context": { "user_id":"u42", "region":"EE" } }
→ 200
{
"catalog. new_ranker": { "on": true, "variant":"v1", "as_of":"2025-10-31T12:10:02Z" },
"rgs. killswitch": { "on": false, "variant":null, "as_of":"2025-10-31T12:10:02Z" }
}
Client SDK (кэш, fallback)
ts const ff = await sdk. getSnapshot() // bootstrap const on = ff. isOn("catalog. new_ranker", ctx)
const payload = ff. payload("catalog. new_ranker", "v1")
11) Interaction with other circuits
Rate limits/quotas: flags can lower RPS/enable throttling for the duration of the incident.
Circuit breaker/degradation: kill-switchi disable heavy paths and enable degradation.
Directory/Personalization: Flags change weights/ranking rules (via Remote Config).
Database migrations: flags gradually translate reads/writes to a new scheme (read-replica → dual-write → write-primary).
12) Playbooks (runbooks)
1. Incident after 25% inclusion
Autocatoff triggered → OFF flag for all/segment, ticket to on-call, stats collection, RCA.
Temporarily enable degradation/old branch through the migration flag.
2. p95 catalog growth
Threshold 'p95 _ ms> 200' - autocatoph; fix a snapshot of logs with'flag _ key = catalog. new_ranker`.
Enable payload config.
3. Lack of jurisdiction
The permission flag mistakenly opened the game in 'NL' - OFF + post-fact audit, adding the guard rule "region deny."
4. Variance in A/B
Stop the experiment, perform CUPED/stratified analysis, re-roll with updated scales.
13) Testing
Unit: deterministic evaluation of rules/priorities/prerequisites.
Contract: flag scheme (JSON/YAML), validators, CI-check before merge.
Property-based: "deny> allow," "most specific wins," stable bucketing.
Replay-Plays real contexts on the new configuration.
E2E: canary scripts (step-up/step-down), autocatoff check and audit events.
Chaos: Streaming cliff, legacy snapshot, massive flag update.
14) Typical errors
Secret logic in client flags (leaks/spoofing).
The absence of TTL → the "cemetery" of flags in the code.
"Universal" flags without → segmentation cannot localize the problem.
No guardrails/autocatophones - manual incidents.
Incompatible dependencies between flags → loops/out of sync.
Evaluation of flags in each request without cache → latency spikes.
No audit/change window - compliance risks.
15) Pre-sale checklist
- Flag created with type, owner, description, TTL and ticket requirement.
- Targeting rules defined; 'deny' on unwanted regions/roles.
- Sticky bucketing is deterministic; ID is stable.
- Pre-requisites and health flags ready; default safe.
- Dashboards and alerts on p95/p99, error_rate, business guardrails.
- Autocatoff configured; rollout stop threshold and rollback conditions.
- Canary Plan - Percentages/Milestones/Change Window/Owners
- Configs are validated in CI; snapshot distributed across clusters/regions.
- Support/product documentation; incident playbooks.
- Plan to remove code branches and the flag itself after 100%.
16) Example of "migration" flag (DB/index)
yaml flag:
key: "search. use_index_v2"
type: "migration"
description: "Switching reads to index v2"
prerequisites:
- key: "search. index_v2_built"
must_be: "on"
rules:
- when: { tenant_id: ["brand_eu"], user_pct: 5 } then: "on"
- when: { tenant_id: ["brand_eu"], user_pct: 25 } then: "on"
safeguards:
auto_rollback_on:
search_p95_ms: ">180"
error_rate: ">1%"
ttl: "2026-02-01"
Conclusion
Feature Flags is not only "on/off," but the discipline of change risk management. Clear flag types, deterministic targeting, progressive displays with guardrails, autocathof, audit and deletion plan make releases predictable and incidents concise and controlled. Build flags into architecture as a first class of citizens - and you can deliver value more often, safer and more meaningfully.