Ops automation and scripts
1) Why automate operations
Reduces MTTR/human error, accelerates releases and reactions.
Makes actions repeatable and auditable (compliance).
Frees up engineers "time for improvement, not routine.
2) Basic principles
1. Idempotency: rerun → same result.
2. Safety railings: dry-run, confirmations, limits, auto-rollbacks.
3. Observability: logs/metrics/trails are built into each script/pipeline.
4. Configuration> constants in code: all through parameters/manifests.
5. GitOps/Docs-as-Code: transaction code is versioned, reviewed, tested.
6. Small steps: canary lobes, batches, retrays with budgets.
7. No secrets in repo: only through secret storages.
3) Automation task classes
Remediation and incidents: rollbacks, provider switches, degradation feature flags.
Planned work: rotation of certificates/keys, database migrations (expand→migrate→contract).
Infrastructure management: IaC (Terraform), configurations (Ansible), K8s manifests.
Data and DataOps: backfills, ETL, quality validation.
Xaoc/DR exercises: simulation of failures with security gates.
4) How to choose a tool
Bash - short glue scripts, CLI orchestration.
Python - logic/SDK, retrai, API, work with JSON/YAML.
Ansible - idempotent configuration, no agents needed.
Terraform is a declarative infrastructure.
Kubernetes Jobs/CronJobs - batch tasks/scheduling.
Argo/Airflow - dependent DAGs and orchestration.
ChatOps - safe launch from chat with audit.
5) Automation architecture (reference)
CLI/ChatOps → Controller (GitOps/orchestrator) → Performers (Ansible/Terraform/K8s Job) → Monitoring (logs/metrics/trails) → Auditing/ticketing → Docking artifacts (evidence).
6) Idempotency and condition management
"Check, then change": detect-then-act (if already OK - do nothing).
Store "state/lock" for long procedures.
Divide the procedures into atomic steps with the possibility of repeated run.
7) Bugs, retreats and rollbacks
Retrai with exponential delay and jitter.
Operation time budget (total SLA per task).
Rollbacks and circuit breaker are always provided.
Explicit return codes and structured errors.
8) Security and secrets
RBAC/ABAC, minimum privileges, temporary tokens (JIT/JEA).
Secrets from Vault/KMS/Cloud Secret Manager; the keys are rotated.
"Separation of duties": who writes is not the one who approves and launches in prod.
Audit log: who/when/what/with what result.
9) GitOps и ChatOps
PR → tests → review → merge → auto-promotion to environments.
Commands in the chat (for example, '/ops deploy checkout --canary 5% ') cause pipelines; bots apply evidence and links to dashboards.
10) Planning and orchestration
CronJobs/DAG with dependencies and deadlines.
Competition: 'Forbid', 'Replace', 'Allow' (K8s) depending on the task.
Resource policies/quotas so as not to "eat" the prod.
11) Observability of automation
Metrics: success/error, duration, retrays, affected objects.
Logs: structured, correlation-ID, red line on error.
Traces: The steps of long operations are visible in distributed traces.
Alerts: by symptoms (SLO) and by technical metrics (deadline,% of errors).
12) Testing and simulations
Unit tests of logic and artifact parsers.
Integration tests in sandbox and canary.
"Simulators" (dry-run + dummy providers), replay real scenarios.
Exercises: clear goals, security gates, AAR→RCA→CAPA.
13) Code templates
Bash (skeleton with railings)
bash
!/usr/bin/env bash set -Eeuo pipefail trap 'echo "[ERR] line $LINENO"; exit 1' ERR
log(){ printf '%s %s\n' "$(date -Iseconds)" "$"; }
DRY=${DRY_RUN--true}
ensure_dep(){ command -v "$1" >/dev/null { echo "need $1"; exit 2; }; }
apply_change(){
local target="$1"
if [[ "$DRY" == "true" ]]; then log "[DRY] would update $target"
else kubectl apply -f "$target"
fi
}
main(){
ensure_dep kubectl for f in manifests/.yaml; do apply_change "$f"
done log "done"
}
main "$@"
Python (Retrai + Idempotency)
python import argparse, time, json, sys from pathlib import Path import requests
def with_retries(fn, attempts=5, base=0. 2):
for i in range(attempts):
try:
return fn()
except Exception as e:
sleep = base (2i)
time. sleep(sleep)
raise
def already_done(marker):
return Path(marker). exists()
def mark_done(marker):
Path(marker). write_text("ok")
def main():
ap = argparse. ArgumentParser()
ap. add_argument("--endpoint", required=True)
ap. add_argument("--marker", default="/tmp/op. marker")
args = ap. parse_args()
if already_done(args. marker):
print("idempotent: nothing to do"); return
def call():
r = requests. post(args. endpoint, json={"action":"rotate"})
r. raise_for_status()
return r. json()
resp = with_retries(call)
print(json. dumps(resp))
mark_done(args. marker)
if __name__ == "__main__":
sys. exit(main())
Ansible (idempotent task)
yaml
- hosts: web become: true tasks:
- name: Ensure nginx present and enabled ansible. builtin. package:
name: nginx state: present
- name: Deploy config ansible. builtin. template:
src: nginx. conf. j2 dest: /etc/nginx/nginx. conf mode: '0644'
notify: restart nginx handlers:
- name: restart nginx ansible. builtin. service:
name: nginx state: restarted
Kubernetes CronJob (planned rotation)
yaml apiVersion: batch/v1 kind: CronJob metadata:
name: cert-rotate spec:
schedule: "0 3 "
concurrencyPolicy: Forbid jobTemplate:
spec:
template:
spec:
serviceAccountName: ops-automation restartPolicy: OnFailure containers:
- name: rotator image: registry/ops/rotator:1. 2. 3 args: ["--rotate", "--budget-ms=60000"]
envFrom:
- secretRef: { name: rotator-secrets }
GitHub Actions (ChatOps trigger)
yaml name: ops-deploy on:
workflow_dispatch:
inputs:
service: {required: true}
canary: {required: false, default: "5"}
jobs:
deploy:
runs-on: ubuntu-latest steps:
- uses: actions/checkout@v4
- run:./scripts/deploy. sh "${{ inputs. service }}" --canary "${{ inputs. canary }}"
14) Implementation checklist
- A tool is selected for each operation and a runbook is described.
- There are dry-run, confirmations and limits (railings).
- Logs are structured, metrics and alerts are connected.
- Secrets from storage, minimal and temporary access.
- Tests (unit/integration/canary) and simulations performed.
- GitOps/PR reviews are required, there is an audit.
- Rollback plan and success criteria documented.
- Automation is tied to SLO/error budgets.
15) Anti-patterns
Scripts without idempotency and rollbacks.
"Secrets in the code," superadmin accounts for everything.
Manual edits in sales without audit.
Chunky Bash Zoo instead of declarative IaC.
Parameters "protected" in the code - no reuse.
No dry-run/canaries → big explosions.
Logs "for people" without structure and correlation.
16) Ops automation maturity metrics
Coverage:% of automation and runbook operations.
Success rate/Retry rate of automatic tasks.
Mean time to execute and on-time.
Change failure rate before/after automation.
Audit-Completeness:% of operations with full evidence.
Security: key/certificate rotation time, share of JIT accesses.
17) The bottom line
Ops automation is not a set of disparate scripts, but a system: idempotent actions, secure railings, observability, secrets and access under control, GitOps/ChatOps, tests and exercises. In such a system, operations become fast, predictable and auditable - and the business receives stable releases and a low risk of incidents.