Strengthening the production environment and auditing
1) Objectives and area of responsibility
Production is not only the "most stable environment," but also the most attacked. Our task:- minimize attack area and Blast Radius;
- protect channels, accounts, secrets and artifacts of delivery;
- Detect and respond to incidents faster than MTTR targets
- Confirm compliance (GDPR/PCI DSS/local rules)
- preserve auditability of all critical actions.
Key principles: Zero Trust, Least Privilege, Segmentation, Everything-as-Code, Security-by-Default.
2) Network perimeter and segmentation
Segments: Edge (WAF, bot management, DDoS), DMZ (gateway), App (microservices), Data (DB/caches), Backoffice/Ops (CI/CD, observability).
L4/L7 policies: deny-by-default, explicit allow by services/namespaces/ports.
mTLS within a cluster TLS 1. 2 + on the perimeter, HSTS, secure ciphers.
Input filter: WAF (OWASP Top-10), anti-bot, rate limits, geo/ASN blocks, CAPTCHA on risk paths.
DDoS protection: always-on + auto-mitigation, separate profiles for API/static content.
Egress control: only necessary external hosts for providers (PSP/KYC/games).
3) Identities, access and privileges (IAM/PAM)
SSO (OIDC/SAML) + MFA for humans; OIDC tokens/Workload Identity for services.
RBAC/ABAC: roles with minimum required permissions; "break-glass" access under audit and TTL.
PAM: on-demand privileged session check-out, full recording and logging.
CIEM (clouds): search for excessive rights and dead roles, auto-remediation.
Access to production data: only through approved jump/proxy, with PII masking.
4) Secrets and cryptography
KMS/HSM: key storage, envelope encryption, rotation with notifications.
Secret manager: short-lived credits, exclude secrets from Git/logs.
Signatures: artifacts (cosign), webhooks (HMAC), service tokens.
PAN/PII fields: tokenization/encryption at-rest; masking in logs and previews.
Rotation policies: keys/certificates/passwords - routine and forced.
5) Containers and Kubernetes (CWPP/KSPM)
Base images: minimal, scanning vulnerabilities on CI; rootless wherever possible.
Admission policies (OPA/Gatekeeper/Kyverno): prohibit ': latest', 'privileged', hostPath; require image signatures.
NetworkPolicies: Service-to-service communication only when needed.
PodSecurity: limited capabilities, read-only FS, seccomp, AppArmor.
Secrets: from Secret Store CSI (KMS); no plain secret in manifests.
Runtime protection: behavioral rules (eBPF), alerts to anomalies.
rego package k8sadmission deny[msg] {
input. request. kind. kind == "Pod"
some c image:= input. request. object. spec. containers[c].image not startswith(image, "registry. company. com/signed/")
msg:= sprintf("Image must be signed and come from trusted registry: %v", [image])
}
6) Supply chain: trust but check
SBOM per build; storage and linking to release.
Image/manifest signatures, validation on admission controller.
SLSA certifications: provable origin of artifacts.
Policy-as-Code: Conftest/OPA on the Terraform/Helm/K8s before the merge.
Prohibition of "last-minute patching" on the product: all changes are only through the pipeline.
7) Vulnerability and patch management
SCA/SAST/DAST в CI; blocking thresholds for critical/high.
Weekly update batches (images, OS packages, libraries) + emergency unscheduled.
Corrections performed → tickets/releases linked to CVE/SBOM.
EASM: external view of the attack surface (subdomains, open ports, certificates).
Regular pen tests: at least once a year + targeted at critical flows (payments/CCM).
8) Logs, metrics, traces and storage of audit artifacts
Standardized logs (JSON) with 'trace _ id', 'request _ id', user/tenant/geo (pseudonymous), no PII/PAN.
Metrics: p50/p95/p99, error-rate, saturation, DLQ, retrai, business KPI (Time-to-Wallet).
OTel: end-to-end for critical routes (deposit/CCL/output).
SIEM: event correlation (authentication, role changes, admin actions, WAF/bot rules).
SOAR: auto-reactions (insulation of the hearth, token recall, IP/ASN block, release ban).
Retention: operating logs - 30-90 days hot storage, audit artifacts - longer, according to policies.
json
{
"ts":"2025-11-05T15:00:00Z",
"sev":"WARN",
"svc":"payments-api",
"route":"POST /v1/payments",
"trace_id":"2f9f...e1",
"user":"anon",
"tenant":"eu-casino-12",
"geo":"EU",
"event":"circuit_breaker_open",
"provider":"psp-1"
}
9) Anti-bots, scams and defensive scenarios
Bot management: signatures/behavior, device-fingerprint, dynamic challenges.
Rate limits/quotas: per-user/tenant/IP; adaptive in anomalies.
RASP sensors on critical endpoints (attempts to bypass webhooks signatures, clock drift, re-delivery).
Fraud signals: correlation by channels (logins, payments, KYC), auto-escalation.
10) Protection, DR and BCP
RTO/RPO targets are defined and tested (for example, RTO ≤ 1 hour, RPO ≤ 5 minutes for payment databases).
Backups: encrypted, periodically in offline storage; regular restore tests.
Geo-duplication: asset-liability/asset-asset by region; DNS failover with TTL control.
Directory of critical dependencies (PSP/KYC/game aggregators) and switching plans.
11) Incidents and response
Runbooks: for provider drop, latency growth, token compromise, DDoS.
On-call: 24/7, rotations and blast pages; joint "war-room" practice.
Communications: message templates for customers/partners and regulators.
Post-mortem (blameless): actions to prevent repetition, updating policies/playbooks.
12) Compliance and privacy
GDPR: data minimization, consent registers, right to delete/port; DPIA for new providers.
PCI DSS: PAN tokenization/isolation zones, network segments, strict access logs.
Local requirements (market jurisdictions): data storage in the region, reporting, update windows.
Data Lineage: where and how PII/PAN flow; schemes and DPIA in DevPortal.
13) Audit: Types, Artifacts and Cycle
Audit types:- Internal (quarterly): compliance with policies, control of changes, accesses, secrets, logs, pipelines.
- External (annually/by requirements): PCI/GDPR/local regulators, pen tests, SOC reports of providers.
- Security policies, role IAM matrix, exception list with expiration date.
- Infrastructure change logs (IaC), CI/CD reports (SBOM, signatures, tests).
- Register of providers (PSP/KYC/games), DPIA/vendor-risk assessments, contracts and SLAs.
- Sales access logs, secret rotation results, SIEM/SOAR reports.
- DR/BCP plans and protocols of recent restore tests.
- "Evidence-first": each practice is a verifiable artifact.
- "No humans in prod": maximum via pipelines and approved applications; all sessions - under the log.
- Trace everything - Map changes to incidents/metrics.
14) Guardrails-as-Code: Examples
Conftest for Terraform (public database ban):rego package terraform. deny deny[msg] {
input. resource. type == "aws_db_instance"
input. resource. publicly_accessible == true msg:= "RDS must not be public"
}
AdmissionPolicy (K8s): require security labels and resource limits
yaml apiVersion: kyverno. io/v1 kind: ClusterPolicy metadata:
name: enforce-security-labels-and-limits spec:
rules:
- name: require-labels match: {resources: {kinds: ["Deployment","StatefulSet"]}}
validate:
message: "security labels required"
pattern:
metadata:
labels:
security. tier: "?"
data. classification: "?"
- name: require-limits match: {resources: {kinds: ["Deployment","StatefulSet"]}}
validate:
message: "resources limits/requests required"
pattern:
spec:
template:
spec:
containers:
- resources:
limits:
cpu: "?"
memory: "?"
requests:
cpu: "?"
memory: "?"
15) Daily hygiene checklist
- WAF/bot policies active, signatures updated; anti-DDoS in always-on mode.
- Admission controllers in the cluster in the enforce state, not audit.
- All production images are signed; SBOM is available and tied to the release.
- Critical/high vulnerabilities - missing or fixed with date exceptions.
- Rotation of secrets/certificates - on schedule, no delays.
- SIEM correlates IAM/release entry/change events; SOAR playbooks are being tested.
- Backups passed, restore test on schedule; DR plan is valid.
- Access to food - only via SSO + MFA/PAM; all sessions are recorded.
- "No PII in logs" - validated by scanners; masking is enabled.
- Release gates and observability updated "as-code."
16) Maturity model (brief)
1. Basic - manual changes, single perimeter, partial monitoring.
2. Advanced - segmentation, IAM/RBAC, signed artifacts, WAF/DDoS, SIEM, regular patches.
3. Expert - Zero Trust, guardrails-as-code, SLSA-attestation, runtime-protection, SOAR-automation, "no humans in prod," continuous audit.
17) Implementation Roadmap
M0-M1 (MVP): network segmentation, WAF/DDoS, SSO + MFA, KMS, basic Admission-policy, standardized logs/metrics/trails, SIEM.
M2-M3: image signatures and admission verification, SBOM, Conftest/OPA on IaC, PAM, rotation plan, regular patches, first DR tests.
M4-M6: SOAR playbooks, eBPF/runtime detection, EASM, compliance package (PCI/GDPR), full set of audit artifacts, ring-DR by region.
M6 +: Zero-Trust network (mTLS everywhere), CIEM, automated audit trail reports, continuous "purple-team" testing.
Summary
Strong prod is not a set of "iron" rules, but a system: segmentation, strict identities and secrets, secure delivery, managed containers, observability, and automated response. Add verifiability (audit artifacts, SBOM/signatures, logs), and the production environment becomes predictable, manageable, and ready for external audits - without compromises on release speed and business SLOs.