Operations and → Management Documentation of Operations as Code
Transaction Documentation as Code
1) The essence of the approach
Documentation as Code is a practice in which operational knowledge, instructions, and processes are stored, edited, and validated in the same way as code: through Git, pull-requests, review, and CI validation.
In an operational loop, this forms the basis for reliability, transparency, and command compatibility.
- Create a living, reproducible and versioned knowledge system, where each instruction is an artifact of the infrastructure, and not an outdated PDF.
2) Why do you need it
Transparency: you can see who, when and why changed the procedure.
Consistency: all teams work on current versions.
Integration with CI/CD: automatic validation of instructions.
Replicability - Infrastructure and documentation are synchronized.
Security: access control and auditing via Git.
Onboarding acceleration: New operators see exact code-related scenarios.
3) Main facilities
4) Repository architecture
ops-docs/
├── README.md # описание структуры
├── standards/
│ ├── sop-deploy.md
│ ├── sop-oncall.md
│ └── sop-release.md
├── runbooks/
│ ├── payments-latency.md
│ ├── games-cache.md
│ └── kyc-verification.md
├── playbooks/
│ ├── dr-failover.yaml
│ ├── psp-switch.yaml
│ └── safe-mode.yaml
├── postmortems/
│ └── 2025-03-17-bets-lag.md
├── policies/
│ ├── alerting.yaml
│ ├── communication.yaml
│ └── security.yaml
└── templates/
├── postmortem-template.md
├── sop-template.md
└── playbook-template.yaml
Tip: each folder has its own Git repository or submodule so that different teams can manage content independently.
5) Format and standards
Metadata (front-matter YAML):yaml id: sop-deploy owner: platform-team version: 3.2 last_review: 2025-10-15 tags: [deployment, ci-cd, rollback]
sla: review-180d
Markdown structure:
Цель
Контекст
Последовательность шагов
Проверка результата
Риски и откат
Контакты и каналы
YAML-playbook (example):
yaml name: failover-psp triggers:
- alert: PSP downtime steps:
- action: check quota PSP-X
- action: switch PSP-Y
- action: verify payments latency < 200ms rollback:
- action: revert PSP-X
6) GitOps and change processes
Pull Request = RFC documentation changes.
Review: Domain owner and Head of Ops must approve.
CI validation: structure check, mandatory fields, Markdown/YAML linter.
Automatic publishing: after merge - generating HTML/wiki/dashboards.
Change log: auto-history of changes with dates and authors.
Alert reminders: document revision every N days (by SLA).
7) CI/CD integration
Lint checks: Markdown syntax, YAML validity, owner/version fields.
Link-check: checking URLs and internal links.
Docs-build: converting to HTML/Confluence/portal.
Diff analysis: what has changed since the last release of the documentation.
Auto-sync: updating links in dashboards Grafana, Ops UI, Slack.
Review bots: tips for outdated sections or missing owners.
8) Integration with operational tools
Grafana/Kibana: annotations and links to the corresponding runbook directly from the panel.
Incident Manager: "Open Runbook" button when creating a ticket.
On-call portal: issuance of current SOPs and playbooks by incident category.
AI assistants: repository search, TL generation; DR and action tips.
BCP panels - Automatically loads DR instructions when a script is activated.
9) Document Lifecycle Management
10) Automation and synchronization
Docs bot: checks which documents are out of date.
Version badge: '! [last review: 2025-05]' right in the cap.
Runbook-finder: by alert opens the desired document by tag.
Templates-generator: creates new SOPs by template ('make new-sop "Deployment"').
Audit-sync: Associates the SOP version with the system release and commit-ID.
11) Security and privacy
RBAC per repository: only domain owners can edit.
Secrets and PII: Cannot be kept in open documents; only links to protected vaults.
Audit: log of all changes, reviews and publications.
Update Policy: Review of SOPs every 6 months.
Backups: regular repository snapshots and portal caches in the DR zone.
12) Maturity metrics
13) Anti-patterns
Documentation is stored in Google Docs without versions and owners.
Runbook is not updated after releases.
SOP refers to legacy commands/tools.
No CI validation: Markdown with errors and broken links.
Duplicate the same instructions in different locations.
Lack of owners and review process.
14) Implementation checklist
- Identify domain owners and document owners.
- Create Git repository 'ops-docs/' and SOP/runbook/playbook templates.
- Configure CI checks and linters (Markdown/YAML).
- Configure Auto-Publish to Portal or Wiki.
- Integrate with Grafana/Incident Manager.
- Add an Ops bot for reminders and SLA revisions.
- Train docs-as-code workflow commands.
15) 30/60/90 - implementation plan
30 days:- Create repository structure, templates, CI linter and PR review process.
- Migrate key SOPs and 5-10 critical runbooks.
- Set up auto-build in the portal.
- Implement integrations with Incident Manager and Grafana.
- Connect Ops bot for audits and reporting.
- Update the postmortem template and link to the dashboard incident.
- Full coverage of SOP/runbook (≥90%).
- Enter KPI: Coverage, Review SLA, Usage.
- Retro on the convenience and quality of the "docs-as-code" process.
16) Example of SOP template (Markdown)
SOP: Deployment через ArgoCD id: sop-deploy owner: platform-team last_review: 2025-10-15 tags: [deployment, rollback, argo]
Цель
Обеспечить безопасное и управляемое развертывание сервисов через ArgoCD.
Контекст
Используется для всех микросервисов с шаблоном Helm v2+.
Требует активного GitOps-контура и включенных health-checks.
Последовательность шагов
1. Проверить статус `argocd app list`
2. Выполнить `argocd app sync payments-api`
3. Убедиться, что `status: Healthy`
4. В случае проблем — `argocd app rollback payments-api --to-rev <rev>`
Проверка результата
SLO API доступность ≥ 99.95%, алертов нет.
Риски и откат
- Ошибка синхронизации — rollback.
- При повторных ошибках — эскалация Head of Ops.
Контакты
@platform-team / #ops-deploy
17) Integration with other processes
Operational analytics: Coverage and SLA audit reports.
Operator training: training based on real runbooks.
Postmortems: automatic insertion of links to SOP and playbook.
Governance ethics: transparency of change and authorship.
AI assistants: context search and TL; DR from the repository.
18) FAQ
Q: Why Git if there's Confluence?
A: Git gives versions, review, automation and reproducibility. Confluence may be the ultimate showcase, but not the source of truth.
Q: How to avoid outdated instructions?
A: SLA for revision (180 days) + Ops-reminder bots + automatic badge of the last check.
Q: Can the CI be connected to the documentation?
A: Yes. Syntax, required fields, and broken references are checked as standard pipeline, similar to code tests.