Configuration Version Control
1) Why versioning configurations
Configuration is an executable policy: it defines routing, limits, feature flags, accesses, data schemas. Version control makes changes repeatable, observable and reversible: reduces MTTR and change-failure rate, eliminates "magic in sales," gives audits for security and compliance.
2) Configuration taxonomy
Infrastructure (IaC): clusters, networks, LB, DB, queues.
Service: application parameters, resources, limits, timeouts, retrays.
Product/business logic: tariffs, AB experiments, content rules.
Data/DataOps: contracts schemes, SLA freshness, transformation.
Security: access policies, roles, keys/certificates (the secrets themselves are outside the repo).
Observability: SLI/SLO, alerts, dashboards.
Rule: everything that affects the behavior of the system is configuration and must live under versioning.
3) Versioning principles
1. GitOps: the only source of truth is the repository; changes via PR and automatic pipelines.
2. Declarative: description of the target state, not step scripts.
3. Immutability of artifacts: config → unambiguously materialized snapshot.
4. Schemas and validation: JSON/YAML-schema, strict type casting, required fields.
5. Environments like code: 'env' - folders/overlays (dev/stage/prod), the differences are minimal and obvious.
6. Idempotence and rollbacks: revert/rollback any configuration release.
7. Audit and traceability: author, reason, ticket/RFC, change signatures.
4) Versioning strategies
SemVer for config packets ('MAJOR. MINOR. PATCH`):- MAJOR - incompatible schema/policy changes.
- MINOR - new fields/rules, backward compatibility.
- PATCH - fixes values without changing schemes.
- Tag releases and release notes: what has changed, how to roll back, checkpoints.
- Pinning/lock files: fix dependency versions (modules, charts).
- Matrix versions: the artifact of the X application is compatible with the Y config (matrix in the service catalog).
5) Repository organization
config-repo/
policies/ # общие политики (RBAC, SLO, алерты)
services/
checkout/
schema/ # JSON/YAML схемы конфигов base/ # дефолтные значения overlays/
dev/
stage/
prod/
data-contracts/ # схемы данных, SLA свежести releases/ # теги, changelog, артефакты валидации tools/ # линтеры, генераторы, тесты
Branch: trunk-based (main) + short feature branches. Merge - via PR only with mandatory CI.
6) Validation and testing
Schema: Each change passes schema validation (required, enum, ranges).
Static linters: format, keys, duplicates, prohibited fields.
Compatibility tests: config + service/chart version go up in the sandbox.
Test runs: dry-run applications, "what-if" diff target state.
Policies-as-code: admission rules (Rego/CEL) - who can change what.
7) Unwind and roll back configurations
Progressive delivery: canary 1%→5%→25% with SLO-gardrails.
Deploy gate: no active SEV-1, alerts are green, signatures are valid, rollback is ready.
Rollback: 'revert tag vX. Y.Z 'or switching to the previous snapshot; rollback commands are documented in the runbook.
Release annotations: The config version is published in metrics/logs to quickly correlate with incidents.
8) Dynamic and remote configuration
Remote config/feature flags: change parameters without restarts; all flags are also under GitOps.
Borders: which parameters are allowed to change dynamically (list of whitelists).
Cache and consistency: TTL, versions, atomic set replacement (two-phase publishing).
Safe railings: limits and ranges for runtime changes, auto-rollback when leaving SLO.
9) Secrets and sensitive data
Never keep secrets in a repo. In configurations - only links/placeholders.
Encryption of configuration files, if necessary: integration with the secret/key manager.
Rotation and JIT: accesses are issued for the duration of operations; the trail of action is immutable.
Field masking: Validation prohibits PII/secrets from entering the config.
10) Environment management
Base + overlays: the differences between dev/stage/prod are minimal and transparent.
Promotion on artifacts: the same snapshot that passed the stage is promoted in prod.
Time windows: changes in configs do not occur at the time of the change of duty; for risk-high - RFC and maintenance window.
11) Drift detection and elimination
The controller compares the target state to the actual state and reports the diff.
Drift alerts: Page only for critical discrepancies; the others are Ticket.
Auto-remediation: at resolution - return to target state.
Audit manual edits: any "kubectl edit/ssh" → process incident and CAPA.
12) Configuration catalog and ownership
Service catalog: owner, SLO, related policies, schemas, versions, compatibility.
RACI: who offers, who reviews, who approves; CAB for high-risk.
Transparency: Each entry has a version history and links to PR/tickets/AAR.
13) Maturity metrics
Coverage:% of services/policies for GitOps (target ≥ 95%).
Lead time config changes: median from PR to prod.
Change failure rate: the proportion of config releases with rollback/incident.
Drift rate: number of discrepancies/week and time of elimination.
Rollback time: median recovery to the previous version.
Audit completeness: proportion of changes with full evidence (validators, dry-run, reviews).
14) Checklists
Before changing the configuration
- There is a ticket/RFC and a change owner.
- Schemes and linters have been validated.
- There is a rollback plan and commands in the runbook.
- Gate: tests green, signatures valid, no active SEV-1.
- For high-risk, a maintenance window is assigned.
During the unwind
- Canary and SLO-gardrails are active.
- Release annotations are published.
- There are echo messages to the channel; alert noise suppressed by MW rules.
Later
- Observation window passed, SLO green.
- Totals and evidence (before/after charts, dry-run reports) are attached to the ticket.
- Updated schematics/documentation as needed.
15) Mini templates
15. 1 Configuration diagram (YAML-schema, fragment)
yaml type: object required: [service, timeouts, retries]
properties:
service: { type: string, pattern: "^[a-z0-9-]+$" }
timeouts:
type: object properties:
connect_ms: { type: integer, minimum: 50, maximum: 5000 }
request_ms: { type: integer, minimum: 100, maximum: 20000 }
retries:
type: object properties:
attempts: { type: integer, minimum: 0, maximum: 10 }
backoff_ms: { type: integer, minimum: 0, maximum: 5000 }
15. 2 Basic config + overlay prod
yaml services/checkout/base/config.yaml service: checkout timeouts: { connect_ms: 200, request_ms: 1500 }
retries: { attempts: 2, backoff_ms: 200 }
limits: { rps: 500 }
features:
degrade_search: false psp_a_weight: 80 psp_b_weight: 20
yaml services/checkout/overlays/prod/config.yaml limits: { rps: 1200 }
features:
psp_a_weight: 70 psp_b_weight: 30
15. 3 Admission policy (idea)
yaml allow_change_when:
tests: passed schema_validation: passed active_incidents: none_of [SEV-0, SEV-1]
rollback_plan: present signed_by: ["owner:team-checkout","platform-sre"]
15. 4 Config release card
Release: checkout-config v2.3.1
Scope: prod EU
Changes: psp_b_weight 20→30, request_ms 1500→1300
Risk: Medium (маршрутизация платежей)
Canary: 1%→5%→25% (30/30/30 мин), guardrails: success_ratio, p95
Rollback: tag v2.3.0
16) Anti-patterns
Edits in the prod past GitOps ("quickly twisted").
Secrets/PII in the config repository.
Lack of diagrams and static checks.
Strong divergence of environments (base≠prod).
"Live" feature flags without versions and history.
Ignoring drift and manual edits on servers.
Tags without release notes and rollback plan.
17) Implementation Roadmap (4-6 weeks)
1. Ned. 1: inventory of configs; separate catalogs, schemes for top 10 services.
2. Ned. 2: include linters/validation and dry-run in the CI; banning merge without green checks.
3. Ned. 3: GitOps roll + canaries; version annotations in telemetry.
4. Ned. 4 - Enter policy-as-code and rollback patterns. alerts to drift.
5. Ned. 5-6: cover 90% of services; reduce env differences to overlays; add maturity metrics and weekly review of config changes.
18) The bottom line
Configuration version control is a system, not just Git. Schematics and validation, GitOps and access policies, canaries and rollbacks, drift detection and full auditing turn config into a managed artifact. The result is quick and secure changes, SLO predictability and team confidence in each release.