Operations and → Management Audit Configurations
Audit configurations
1) Purpose and value
Auditing configurations ensures provable accountability and repeatability of change: who, when and what changed; what is justified; as tested; how to roll back. This reduces the risk of incidents, leaks of secrets, compliance inconsistencies and "hidden" edits in the prod.
Key results:- A single source of truth (SoT) for configs.
- Full change tracing (end-to-end).
- Predictable releases and quick rollback.
- Compliance and security policies.
2) Scope
Infrastructure: Terraform/Helm/Ansible/K8s manifests, network ACL/WAF/CDN.
Application configs: 'yaml/json/properties' files, feature flags, limits/quotas.
Secrets and keys: vault/kms, certificates, tokens, passwords.
Data pipelines: schemas, transformations, ETL/stream schedules.
Integrations: PSP/KYC/providers, webhooks, retry/timeout policies.
Observability: Alert rules, dashboards, SLO/SLA.
3) Principles
Config as Data: declarative, versioned, testable artifacts.
Immutability and idempotency: reproducibility of the medium from the code.
Schemes and contracts: strict validation (JSON-Schema/Protobuf), back/forward compatibility.
Minimizing manual edits: changes only through MR/PR.
Separation of duties (SoD) and 4-eyes: author! = deploer; mandatory review.
Attribution and signatures: signatures of commits/releases, attestations of artifacts.
4) Audit architecture
1. SCM (Git) as SoT: all configs in the repository, the 'main' branch is protected.
2. Registers:- Config Registry (directory of configs, possessions, SLAs, environments),
- Schema Registry (config/event schema versions),
- Policy Engine (OPA/Conftest) - set of checks.
- 3. CI/CD-gates: format/scheme → static check → policy checks → secret scan → dry-run → change plan.
- 4. Delivery: GitOps (e.g. ArgoCD/Flux) with drift detector and application audit logs.
- 5. Evidence Store: a repository of audit artifacts (plan, logs, signatures, builds, SBOM).
- 6. Action log: invariable log (append-only) of'CREATE/APPROVE/APPLY/ROLLBACK/ACCESS 'events.
5) Audit data model (minimum)
Сущности: `ConfigItem(id, env, service, owner, schema_version, sensitivity)`
События: `change_id, actor, action, ts, diff_hash, reason, approvals[]`
Артефакты: `plan_url, test_report_url, policy_report, signature, release_tag`
Connections: RFC/ticket ↔ PR ↔ depla (sha) ↔ release recording ↔ SLO monitoring.
6) Change process (end-to-end)
1. RFC/ticket → target, risk, backout.
2. PR/MR → linting, schematic validation, policy checks, secret scan.
3. Plan/preview → dry-run/plan, resource diff, cost/impact estimate.
4. Approve (4-eyes/SoD, CAB label at high risk).
5. Deploy (by window/calendar) → GitOps applies; drift alert enabled.
6. Verification → smoke/SLO-gardrails, confirmation of the result.
7. Archiving evidence → evidence store; updating the config dictionary.
7) Policies and rules (examples)
SoD: PR author does not hold in prod.
Time limit: No production outside "freeze."
Scope: changing sensitive keys requires 2 updates from Security/Compliance.
Secrets: forbidden to keep in repo; vault path + access role references only.
Nets: ingress with '0. 0. 0. 0/0 'is not allowed without a temporary exception and TTL.
Alerts: it is forbidden to reduce the criticality of P1 without CAB.
8) Secret control
Vault/KMS storage, short TTLs, automatic rotation.
Secret scanning in CI (key patterns, high-entropy).
Isolation of secrets by environments/roles; minimum necessary privileges.
Encryption "on the wire" and "at rest"; closed audit logs of access to secrets.
9) Tools (variable)
Lint/Schema: `yamllint`, `jsonschema`, `ajv`, `cue`.
Policy: OPA/Conftest, Checkov/tfsec/kube-policies.
GitOps: ArgoCD/Flux (drift detection, audit, RBAC).
Secrets: HashiCorp Vault, cloud KMS, cert managers.
Scanners: trufflehog, gitleaks (secrets); OPA/Regula (rules).
Reporting: export logs to DWH/BI, link to incident and change system.
10) Examples of rules and artifacts
JSON-Schema for Limit Configuration
json
{
"$schema": "http://json-schema. org/draft-07/schema#",
"title": "limits",
"type": "object",
"required": ["service", "region", "rate_limit_qps"],
"properties": {
"service": {"type":"string", "pattern":"^[a-z0-9-]+$"},
"region": {"type":"string", "enum":["eu","us","latam","apac"]},
"rate_limit_qps": {"type":"integer","minimum":1,"maximum":5000},
"timeouts_ms": {"type":"integer","minimum":50,"maximum":10000}
},
"additionalProperties": false
}
Conftest/OPA (rego) - deny '0. 0. 0. 0/0` в ingress
rego package policy. network
deny[msg] {
input. kind == "IngressRule"
input. cidr == "0. 0. 0. 0/0"
msg:= "Ingress 0. 0. 0. 0/0 is not allowed. Specify specific CIDRs or throw an exception with TTL"
}
Conftest/OPA - SoD
rego package policy. sod
deny[msg] {
input. env == "prod"
input. pr. author == input. pr. merger msg: = "SoD: PR author cannot hold in prod."
}
SQL (DWH) - who reduced the criticality of alerts in a month
sql
SELECT actor, COUNT() AS cnt
FROM audit_events
WHERE action = 'ALERT_SEVERITY_CHANGED'
AND old_value = 'P1' AND new_value IN ('P2','P3')
AND ts >= date_trunc('month', now())
GROUP BY 1
ORDER BY cnt DESC;
Git commit message example (required fields)
feat(config/payments): raise PSP_B timeout to 800ms in EU
RFC: OPS-3421
Risk: Medium (PSP_B only, EU region)
Backout: revert PR + restore timeout=500ms
Tests: schema ok, conftest ok, e2e ok
11) Monitoring and alerting
Drift-detection: config in a cluster ≠ Git → P1/P2 signal + auto-remediation (reconcile).
High-risk change: change networks/secrets/policies - notification in # security-ops.
Missing evidence: deploy without plan/signature/reports - block or alert.
Expired assets: certificate/key validity periods → pro-active alerts.
12) Metrics and KPIs
Audit Coverage% - the share of configs under schemas/policies/scanners.
Drift MTTR is the average drift clearing time.
Policy Compliance% - Pass policies to PR.
Secrets Leak MTTR - from leak to recall/rotation.
Backout Rate - the proportion of rollbacks of config changes.
Mean Change Size - average diff on lines/resources (less is better).
13) Reporting and Compliance
Audit traces: storage ≥ 1-3 years (according to requirements), unchangeable storage.
Regulatory: ISO 27001/27701, SOX-like SoD, GDPR (PII), industry requirements (iGaming: accounting for changes in GGR/NGR calculations, limits, bonus rules).
Monthly reports: top changes, policy violations, drift, expiring certificates, rotation status.
14) Playbooks
A. Drift detected in prod
1. Block auto-deposit for affected service.
2. Remove the snapshot of the current state.
3. Compare with Git, initiate 'reconcile' or rollback.
4. Create incident P2, specify drift source (manual kubectl/console).
5. Enable protection: no direct changes (PSP/ABAC), notify owners.
B. PSP certificate expired
1. Switch to the backup path/PSP, lower the timeouts/retraces.
2. Issue a new certificate through the PKI process, update the config through Git.
3. Smoke test, return traffic, close the incident, post-mortem.
C. Secret hit PR
1. Revoke key/token, use rotation.
2. Rewrite history/remove artifact from caches, issue RCA.
3. Add a rule to the secret scanner, train the command.
15) Anti-patterns
Manual edits "on sale" without a trace and rollback.
Configs without schemes and without validation.
Secrets in Git/CI variables without KMS/Vault.
Monorepos with the equivalent of "global super-right."
"Deaf" GitOps without drift alerts and application logs.
Huge PRs "all at once" - unclear attribution and high risk.
16) Checklists
Before merge
- Diagram and linters passed
- OPA/Conftest policies are green
- Secret-scan - "clean"
- Plan/diff attached, risk assessed, backout ready
- 2 April (prod) and SoD met
Before deploy
- Release window and calendar checked
- Drift monitoring is active
- SLO gardrails configured, smoke tests ready
Monthly
- Rotation of keys/certificates on schedule
- Inventory of owners and rights
- OPA/Exclusion Rules Review (TTL)
- Fire-drill test
17) Design tips
Split the changes into small diffuses; one PR is one goal.
Mandatory PR/commit templates with RFC/risk/rollback.
For dynamic configs, use "config centers" with audit and rollback.
Versionize circuits; prohibit breaking without migrations.
Visualize the "config map": what, where, who is controlled.
18) Integration with Change and Incident Management
PR ↔ RFC ↔ release calendar ↔ incidents/post-mortems.
Auto-tie metrics (SLO/business) to config releases.
Auto-create tasks to delete old flags/exceptions (TTL).
19) The bottom line
Auditing configurations is not "paper reporting," but an operational reliability mechanism: configs are data, changes are controlled and verifiable, secrets are under lock and key, and the whole story is transparent and verifiable. This is how a stable, compliant and predictable platform is built.