Operations and → Management Execution Policies and Runtime Restrictions
Execution Policies and Runtime Restrictions
1) Purpose
Runtime policies make the behavior of services predictable, safe and economical: limit "noisy neighbors," prevent leaks and overheating, ensure compliance and retention of SLOs when the load increases.
Key objectives: isolation, equitable allocation of resources, controlled degradation, reproducibility, audit.
2) Scope
Computing and memory: CPU, RAM, GC pauses, thread limits.
Disk/storage: IOPS/throughput, quotas, fs-policies (read-only).
Сеть: egress/ingress, bandwidth shaping, network policies.
Processes/system calls: seccomp, capabilities, ulimit.
Orchestration: Kubernetes QoS, requests/limits, priorities, tains/affinity.
API/gateways: rate-limits, quotas, timeouts/retrays, circuit-breakers.
Data/ETL/streams: batch/stream concurrency, consumer lag budgets.
Security: AppArmor/SELinux, rootless, secrets/kofigi.
Policy-as-Code: OPA/Gatekeeper, Kyverno, Conftest.
3) Basic principles
Fail-safe by default: it is better to drop unnecessary requests than drop.
Budget-driven: Timeouts/Retrays fit into the request time budget and SLO error budget.
Small blast radius: namespace/pool/host/shard isolation.
Declarative & auditable: all restrictions - in code/repository + change log.
Multi-tenant fairness: no tenant/team can "suck out" the entire cluster.
4) Computing and memory
4. 1 Kubernetes и cgroup v2
requests/limits: requests guarantee the share of CPU/memory; limits include throttling/OOM-killer.
QoS classes: Guaranteed/Burstable/BestEffort - keep critical workflows in Guaranteed/Burstable.
CPU: `cpu. shares`, `cpu. max '(throttle), CPuset for pinning.
Memory: 'memory. max`, `memory. swap. max '(usually swap off) oom_score_adj for priority.
4. 2 Patterns
Headroom 20-30% on node, anti-affinity for duplication.
GC limits: JVM '-Xmx' <k8s memory limit; Go: `GOMEMLIMIT`; Node: `--max-old-space-size`.
ulimit: 'nofile', 'nproc', 'fsize' - by service profile.
5) Disk and storage
IOPS/Throughput quotas on PVC/cluster-storage; Log/data separation.
Read-only root FS, tmpfs for temporary files, size limit '/tmp '.
FS-watchdog: alerts for volume filling and inode growth.
6) Network and traffic
NetworkPolicy (ingress/egress) — zero-trust east-west.
Bandwidth limits: tc/egress-policies, QoS/DSCP for critical flows.
Egress controller: list of allowed domains/subnets, audit DNS.
mTLS + TLS policies - encryption and forced protocol version.
7) Process safety
Seccomp (allowlist syscalls), AppArmor/SELinux profiles.
Drop Linux capabilities (leave minimum), 'runAsNonRoot', 'readOnlyRootFilesystem'.
Rootless containers, signed images and attestations.
Secrets-only via Vault/KMS, tmp-tokens with short TTL.
8) Time policies: timeouts, retreats, budgets
Timeout budget: sum of all hops ≤ SLA end-to-end.
Retrai with backoff + jitter, maximum attempts in error class.
Circuit-breaker: open with error %/timeout p95 above threshold → fast failures.
Bulkheads: separate connection-pools/queues for critical paths.
Backpressure: limiting producers to lag consumers.
9) Rate-limits, quotas and priority
Algorithms: token/leaky bucket, GCRA; local + distributed (Redis/Envoy/global).
Granularity: API key/user/organization/region/endpoint.
Priority gradients: "payment/authorization" flows - gold, analytics - bronze.
Quotas per day/month, "burst" and "sustained" limits; 429 + Retry-After.
10) Orchestration and planner
PriorityClass: protection of P1 pods from displacement.
PodDisruptionBudget: downtime bounds on updates.
Tains/Tolerations, (anti) Affinity - isolation workloads.
RuntimeClass: gVisor/Firecracker/Wasm for sandboxes.
Horizontal/Vertical autoscaling with guard thresholds and max-replicas.
11) Data/ETL/Stream Policies
Concurrency per job/topic, max batch size, checkpoint interval.
Consumer lag budgets: warning/critical; DLQ and retray limit.
Freshness SLA for storefronts, a pause of heavy jobs at peaks of prod traffic.
12) Policy-as-Code and admission-control
OPA Gatekeeper/Kyverno: no pods without requests/limits, no 'readOnlyRootFilesystem', with 'hostNetwork', ': latest'.
Conftest for pre-commit Helm/K8s/Terraform checks.
Mutation policies: auto-adding sidecar (mTLS), annotations, seccompProfile.
yaml apiVersion: kyverno. io/v1 kind: ClusterPolicy metadata:
name: require-resources spec:
validationFailureAction: Enforce rules:
- name: check-limits match:
resources:
kinds: ["Pod"]
validate:
message: "We need resources. requests/limits for CPU and memory"
pattern:
spec:
containers:
- resources:
requests:
cpu: "?"
memory: "?"
limits:
cpu: "?"
memory: "?"
Example of OPA (Rego) - timeouts ≤ 800 ms:
rego package policy. timeout
deny[msg] {
input. kind == "ServiceConfig"
input. timeout_ms> 800 msg: = sprintf ("timeout% dms exceeds budget 800ms," [input. timeout_ms])
}
13) Observability and compliance metrics
Compliance%: percentage of podes with correct requests/limits/labels.
Security Posture: share of pods with seccomp/AppArmor/rootless.
Rate-limit hit%, shed%, throttle%, 429 share.
p95 timeouts/retraces, circuit-open duration.
OOM kills/evictions, CPU throttle seconds.
Network egress denied events, egress allowlist misses.
14) Checklists
Before laying out the service
- Requests/limits are written; QoS ≥ Burstable
- Timeouts and retrays fit into end-to-end SLAs
- Circuit-breaker/bulkhead enabled for external dependencies
- NetworkPolicy (ingress/egress) и mTLS
- Seccomp/AppArmor, drop capabilities, non-root, read-only FS
- Rate-limits and quotas on API gateway/service
- PDB/priority/affinity specified; autoscaling is configured
Monthly
- Audit policy exceptions (TTL)
- Review Time/Error Budgets
- Fire-drill test: shed/backpressure/circuit-breaker
- Rotating secrets/certificates
15) Anti-patterns
Without requests/limits: "burst" eats up neighbors → cascading crashes.
Global retreats without jitter: a storm in addictions.
Infinite timeouts: "hanging" connections and exhaustion of pools.
': latest' and mutable tags: unpredictable runtime builds.
Open egress: leaks and unmanaged dependencies.
No PDB: Updates knock out the entire pool.
16) Mini playbooks
A. CPU throttle% at payments-service
1. Check limits/requests and profile hot paths.
2. Temporarily raise requests, turn on autoscale by p95 latency.
3. Enable limits/rates cash-back, reduce complexity of queries.
4. Post-fix: denormalization/indices, revision of limits.
B. 429 growth and API complaints
1. Report on the keys/organizations → ran into the quota.
2. Enter hierarchical quotas (per- org→per -key), raise burst for gold.
3. Communication and guidance on backoff; enable adaptive limiting.
B. Mass OOM kills
1. Reduce concurrency, enable heap limit and profiling.
2. Recalculate Xmx/GOMEMLIMIT for real peak-usage.
3. Retrain GC/pools, add swap-off and soft-limit alerts.
17) Configuration examples
K8s container with secure settings (fragment):yaml securityContext:
runAsNonRoot: true allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities:
drop: ["ALL"]
Envoy rate-limit (fragment conceptually):
yaml rate_limit_policy:
actions:
- request_headers:
header_name: "x-api-key"
descriptor_key: "api_key"
Nginx ingress - timeouts and restrictions:
yaml nginx. ingress. kubernetes. io/proxy-connect-timeout: "2s"
nginx. ingress. kubernetes. io/proxy-read-timeout: "1s"
nginx. ingress. kubernetes. io/limit-rps: "50"
18) Integration with change and incident management
Any policy relaxation is via RFC/CAB and temporary exception with TTL.
Policy violation incidents → post-mortem and rule updates.
Compliance dashboards are connected to the release calendar.
19) The bottom line
Execution policies are a "railing" for the platform: they do not interfere with driving fast, they do not allow falling. Declarative constraints, automatic enforcement, good metrics, and exception discipline turn chaotic exploitation into a manageable and predictable system - with controlled cost and sustainable SLOs.