Ops automation and scripts

1) Why automate operations

Reduces MTTR/human error, accelerates releases and reactions.
Makes actions repeatable and auditable (compliance).
Frees up engineers "time for improvement, not routine.

2) Basic principles

1. Idempotency: rerun → same result.
2. Safety railings: dry-run, confirmations, limits, auto-rollbacks.
3. Observability: logs/metrics/trails are built into each script/pipeline.
4. Configuration> constants in code: all through parameters/manifests.
5. GitOps/Docs-as-Code: transaction code is versioned, reviewed, tested.
6. Small steps: canary lobes, batches, retrays with budgets.
7. No secrets in repo: only through secret storages.

3) Automation task classes

Remediation and incidents: rollbacks, provider switches, degradation feature flags.
Planned work: rotation of certificates/keys, database migrations (expand→migrate→contract).
Infrastructure management: IaC (Terraform), configurations (Ansible), K8s manifests.
Data and DataOps: backfills, ETL, quality validation.
Xaoc/DR exercises: simulation of failures with security gates.

4) How to choose a tool

Bash - short glue scripts, CLI orchestration.
Python - logic/SDK, retrai, API, work with JSON/YAML.
Ansible - idempotent configuration, no agents needed.
Terraform is a declarative infrastructure.
Kubernetes Jobs/CronJobs - batch tasks/scheduling.
Argo/Airflow - dependent DAGs and orchestration.
ChatOps - safe launch from chat with audit.

5) Automation architecture (reference)

CLI/ChatOps → Controller (GitOps/orchestrator) → Performers (Ansible/Terraform/K8s Job) → Monitoring (logs/metrics/trails) → Auditing/ticketing → Docking artifacts (evidence).

6) Idempotency and condition management

"Check, then change": detect-then-act (if already OK - do nothing).
Store "state/lock" for long procedures.
Divide the procedures into atomic steps with the possibility of repeated run.

7) Bugs, retreats and rollbacks

Retrai with exponential delay and jitter.
Operation time budget (total SLA per task).
Rollbacks and circuit breaker are always provided.
Explicit return codes and structured errors.

8) Security and secrets

RBAC/ABAC, minimum privileges, temporary tokens (JIT/JEA).
Secrets from Vault/KMS/Cloud Secret Manager; the keys are rotated.
"Separation of duties": who writes is not the one who approves and launches in prod.
Audit log: who/when/what/with what result.

9) GitOps и ChatOps

PR → tests → review → merge → auto-promotion to environments.
Commands in the chat (for example, '/ops deploy checkout --canary 5% ') cause pipelines; bots apply evidence and links to dashboards.

10) Planning and orchestration

CronJobs/DAG with dependencies and deadlines.
Competition: 'Forbid', 'Replace', 'Allow' (K8s) depending on the task.
Resource policies/quotas so as not to "eat" the prod.

11) Observability of automation

Metrics: success/error, duration, retrays, affected objects.
Logs: structured, correlation-ID, red line on error.
Traces: The steps of long operations are visible in distributed traces.
Alerts: by symptoms (SLO) and by technical metrics (deadline,% of errors).

12) Testing and simulations

Unit tests of logic and artifact parsers.
Integration tests in sandbox and canary.
"Simulators" (dry-run + dummy providers), replay real scenarios.
Exercises: clear goals, security gates, AAR→RCA→CAPA.

13) Code templates

Bash (skeleton with railings)

bash
!/usr/bin/env bash set -Eeuo pipefail trap 'echo "[ERR] line $LINENO"; exit 1' ERR

log(){ printf '%s %s\n' "$(date -Iseconds)" "$"; }
DRY=${DRY_RUN--true}

ensure_dep(){ command -v "$1" >/dev/null          { echo "need $1"; exit 2; }; }

apply_change(){
local target="$1"
if [[ "$DRY" == "true" ]]; then log "[DRY] would update $target"
else kubectl apply -f "$target"
fi
}

main(){
ensure_dep kubectl for f in manifests/.yaml; do apply_change "$f"
done log "done"
}
main "$@"

Python (Retrai + Idempotency)

python import argparse, time, json, sys from pathlib import Path import requests

def with_retries(fn, attempts=5, base=0. 2):
for i in range(attempts):
try:
return fn()
except Exception as e:
sleep = base (2i)
time. sleep(sleep)
raise

def already_done(marker):
return Path(marker). exists()

def mark_done(marker):
Path(marker). write_text("ok")

def main():
ap = argparse. ArgumentParser()
ap. add_argument("--endpoint", required=True)
ap. add_argument("--marker", default="/tmp/op. marker")
args = ap. parse_args()

if already_done(args. marker):
print("idempotent: nothing to do"); return

def call():
r = requests. post(args. endpoint, json={"action":"rotate"})
r. raise_for_status()
return r. json()

resp = with_retries(call)
print(json. dumps(resp))
mark_done(args. marker)

if __name__ == "__main__":
sys. exit(main())

Ansible (idempotent task)

yaml
- hosts: web become: true tasks:
- name: Ensure nginx present and enabled ansible. builtin. package:
name: nginx state: present
- name: Deploy config ansible. builtin. template:
src: nginx. conf. j2 dest: /etc/nginx/nginx. conf mode: '0644'
notify: restart nginx handlers:
- name: restart nginx ansible. builtin. service:
name: nginx state: restarted

Kubernetes CronJob (planned rotation)

yaml apiVersion: batch/v1 kind: CronJob metadata:
name: cert-rotate spec:
schedule: "0 3  "
concurrencyPolicy: Forbid jobTemplate:
spec:
template:
spec:
serviceAccountName: ops-automation restartPolicy: OnFailure containers:
- name: rotator image: registry/ops/rotator:1. 2. 3 args: ["--rotate", "--budget-ms=60000"]
envFrom:
- secretRef: { name: rotator-secrets }

GitHub Actions (ChatOps trigger)

yaml name: ops-deploy on:
workflow_dispatch:
inputs:
service: {required: true}
canary: {required: false, default: "5"}
jobs:
deploy:
runs-on: ubuntu-latest steps:
- uses: actions/checkout@v4
- run:./scripts/deploy. sh "${{ inputs. service }}" --canary "${{ inputs. canary }}"

14) Implementation checklist

A tool is selected for each operation and a runbook is described.
There are dry-run, confirmations and limits (railings).
Logs are structured, metrics and alerts are connected.
Secrets from storage, minimal and temporary access.
Tests (unit/integration/canary) and simulations performed.
GitOps/PR reviews are required, there is an audit.
Rollback plan and success criteria documented.
Automation is tied to SLO/error budgets.

15) Anti-patterns

Scripts without idempotency and rollbacks.
"Secrets in the code," superadmin accounts for everything.
Manual edits in sales without audit.
Chunky Bash Zoo instead of declarative IaC.
Parameters "protected" in the code - no reuse.
No dry-run/canaries → big explosions.
Logs "for people" without structure and correlation.

16) Ops automation maturity metrics

Coverage:% of automation and runbook operations.
Success rate/Retry rate of automatic tasks.
Mean time to execute and on-time.
Change failure rate before/after automation.
Audit-Completeness:% of operations with full evidence.
Security: key/certificate rotation time, share of JIT accesses.

17) The bottom line

Ops automation is not a set of disparate scripts, but a system: idempotent actions, secure railings, observability, secrets and access under control, GitOps/ChatOps, tests and exercises. In such a system, operations become fast, predictable and auditable - and the business receives stable releases and a low risk of incidents.

Ops automation and scripts

Python (Retrai + Idempotency)

Ansible (idempotent task)

Kubernetes CronJob (planned rotation)

GitHub Actions (ChatOps trigger)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects