Audit and logging tools
1) Why do you need it
Objectives:- Traceability of actions (who/what/when/where/why).
- Rapid incident investigations and forensics.
- Regulatory and customer compliance.
- Risk management and MTTR reduction in incidents.
- Support for risk, anti-fraud, compliance models (KYC/AML/RTBF/Legal Hold).
- Completeness of source coverage.
- Record immutability and integrity.
- Standardized event schemas.
- Search availability and correlation.
- Minimization of personal data and privacy control.
2) Instrument landscape
2. 1 Log management and indexing
Сбор/агенты: Fluent Bit/Fluentd, Vector, Logstash, Filebeat/Winlogbeat, OpenTelemetry Collector.
Storage and search: Elasticsearch/OpenSearch, Loki, ClickHouse, Splunk, Datadog Logs.
Streaming/tires: Kafka/Redpanda, NATS, Pulsar - for buffering and fan-out.
Parsing and normalization: Grok/regex, OTel processors, Logstash pipelines.
2. 2 SIEM/Detect & Respond
SIEM: Splunk Enterprise Security, Microsoft Sentinel, Elastic Security, QRadar.
UEBA/behavioral analysis: embedded modules in SIEM, ML detectors.
SOAR/orchestration: Cortex/XSOAR, Tines, Shuffle - playbook automation.
2. 3 Audit and immutability
Аудит подсистем: Linux auditd/ausearch, Windows Event Logs, DB-аудит (pgAudit, MySQL audit), Kubernetes Audit Logs, CloudTrail/CloudWatch/Azure Monitor/GCP Cloud Logging.
Immutable storage: WORM buckets (Object Lock), S3 Glacier Vault Lock, write-once volumes, logging with crypto signature/hash chain.
TSA/timestamps: binding to NTP/PTP, periodic anchoring of hashes in external trusted time.
2. 4 Observability and traces
Metrics/trails: Prometheus + Tempo/Jaeger/OTel, correlation of logs ↔ traces by trace_id/span_id.
Dashboards and alerts: Grafana/Kibana/Datadog.
3) Event sources (cover scope)
Infrastructure: OS (syslog, auditd), containers (Docker), orchestration (Kubernetes Events + Audit), network devices, WAF/CDN, VPN, IAM.
Applications and APIs: API gateway, service mash, web servers, backends, queues, schedulers, webhooks.
DB and vaults: queries, DDL/DML, access to secrets/keys, access to object storage.
Payment integrations: PSP/acquiring, chargeback events, 3DS.
Operations and processes: console/CI/CD inputs, admin panels, configuration/feature flag changes, releases.
Security: IDS/IPS, EDR/AV, vulnerability scanners, DLP.
User events: authentication, login attempts, KYC status change, deposits/outputs, bets/games (with anonymization if necessary).
4) Data schemes and standards
Unified event model: 'timestamp', 'event. category`, `event. action`, `user. id`, `subject. id`, `source. ip`, `http. request_id`, `trace. id`, `service. name`, `environment`, `severity`, `outcome`, `labels.`.
Стандарты схем: ECS (Elastic Common Schema), OCSF (Open Cybersecurity Schema Framework), OpenTelemetry Logs.
Correlation keys: 'trace _ id', 'session _ id', 'request _ id', 'device _ id', 'k8s. pod_uid`.
Quality: required fields, validation, deduplication, sampling for "noisy" sources.
5) Architectural reference
1. Collection on nodes/agents →
2. Pre-processing (parsing, PII-edition, normalization) →
3. Tire (Kafka) with retching ≥ 3-7 days →
4. Thread forks:- Online storage (search/correlation, hot storage 7-30 days).
- Immutable archive (WORM/Glacier 1-7 years for audit).
- SIEM (detection and incidents).
- 5. Dashboards/search (operations, security, compliance).
- 6. SOAR for reaction automation.
- Hot: SSD/indexing, fast search (rapid response).
- Warm: compression/less frequent access.
- Cold/Archive (WORM): cheap long-term storage, but unchangeable.
6) Immutability, integrity, trust
WORM/lock object - block deletion and modification for the duration of the policy.
Crypto signature and hash chain: by batches/chunks of logs.
Hash-anchoring: periodic publication of hashes in an external registry or trusted time.
Time synchronization: NTP/PTP, drift monitoring; recording'clock. source`.
Change control: four-eyed/dual control for retention/Legal Hold policies.
7) Privacy and compliance
PII minimization: store only the necessary fields, edit/mask in ingest.
Aliasing: 'user. pseudo_id', the storage of mapping is separate and limited.
GDPR/DSAR/RTBF: source classification, managed logical delete/hide in replicas, exceptions for legal retention duties.
Legal Hold: "freeze" tags, suspension of deletion in archives; journal of activities around Hold.
Standard mapping: ISO 27001 A.8/12/15, SOC 2 CC7, PCI DSS Req. 10, local market regulation.
8) Operations and processes
8. 1 Playbooks/Runbooks
Source loss: how to identify (heartbeats), how to restore (replay from the bus), how to compensate for gaps.
Increasing delays: queue checking, sharding, indexes, backpressure.
Investigation of event X: KQL/ES-query template + link to the trace context.
Legal Hold: who puts, how to shoot, how to document.
8. 2 RACI (in brief)
R (Responsible): Observation-team for collection/delivery; SecOps for detection rules.
A (Accountable): CISO/Head of Ops for policies and budget.
C (Consulted): DPO/Legal for privacy; Architecture for circuits.
I (Informed): Support/Product/Risk Management.
9) Quality Metrics (SLO/KPI)
Coverage:% of critical sources are connected (target ≥ 99%).
Ingest lag: p95 delivery delay (<30 sec).
Indexing success: proportion of events with no parsing errors (> 99. 9%).
Search latency: p95 <2 sec for typical window 24h requests.
Drop rate: loss of events <0. 01%.
Alert fidelity: Precision/Recall by rules, share of false positives.
Cost per GB: Storage/index cost per period.
10) Retention policies (example)
Policies are specified by Legal/DPO and local regulations.
11) Detection and alerts (skeleton)
Rules (rule-as-code):- Suspicious authentication (impossible movement, TOR, frequent errors).
- Escalation of privileges/roles.
- Configuration/secret changes outside the release schedule.
- Abnormal transaction patterns (AML/anti-fraud signals).
- Mass data uploads (DLP triggers).
- Fault tolerance: 5xx squall, latency degradation, multiple pod restarts.
- Enrichment with geo/IP reputation, linking to releases/feature flags, linking to tracks.
12) Log access security
RBAC and segregation of duties: separate roles for readers/analysts/admins.
Just-in-time access: temporary tokens, audit of all reads of "sensitive" indexes.
Encryption: in-transit (TLS), at-rest (KMS/CMK), key isolation.
Secrets and keys: rotation, limiting the export of events with PII.
13) Implementation Roadmap
MVP (4-6 weeks):1. Source directory + minimum schema (ECS/OCSF).
2. Agent on nodes + OTel Collector; centralized parsing.
3. Storage Hot (OpenSearch/Elasticsearch/Loki) + dashboards.
4. Basic alerts (authentication, 5xx, config changes).
5. Archive in Object Storage with a lock object (WORM).
Phase 2:- Kafka as a tire, replay, retray queue.
- SIEM + first correlation rules, SOAR playbooks.
- Crypto signature of batches, anchoring of hashes.
- Legal Hold policies, DSAR/RTBF procedures.
- UEBA/ML detection.
- Data Catalog, lineage.
- Cost optimization: sampling "noisy" logs, tiering.
14) Frequent mistakes and how to avoid them
Log noise without a scheme: → introduce mandatory fields and sampling.
No traces: → to implement trace_id in core services and proxies.
A single "monolith" of logs: → divided into domains and criticality levels.
Not immutable: → to enable WORM/Object Lock and signature.
Secrets in the logs: → filters/editors, token scanners, reviews.
15) Launch checklist
- Criticality Priority Source Register.
- Unified scheme and validators (CI for parsers).
- Agent strategy (daemonset in k8s, Beats/OTel).
- Splint and retention.
- Hot/Cold/Archive + WORM
- RBAC, encryption, access log.
- SOAR basic alerts and playbooks.
- Dashboards for Ops/Sec/Compliance.
- DSAR/RTBF/Legal Hold policies.
- KPI/SLO + storage budget.
16) Examples of events (simplified)
json
{
"timestamp": "2025-10-31T19:20:11.432Z",
"event": {"category":"authentication","action":"login","outcome":"failure"},
"user": {"id":"u_12345","pseudo_id":"p_abcd"},
"source": {"ip":"203.0.113.42"},
"http": {"request_id":"req-7f91"},
"trace": {"id":"2fe1…"},
"service": {"name":"auth-api","environment":"prod"},
"labels": {"geo":"EE","risk_score":72},
"severity":"warning"
}
17) Glossary (brief)
Audit trail - a sequence of unchangeable records that records the actions of the subject.
WORM - write-once, read-many storage mode.
SOAR - automation of response to incidents by playbooks.
UEBA - analysis of user behavior and entities.
OCSF/ECS/OTel - standards for log schemes and telemetry.
18) The bottom line
The audit and logging system is not a "log stack," but a managed program with a clear data schema, an unchangeable archive, correlation and reaction playbooks. Compliance with the principles in this article increases observability, speeds up investigations and closes key requirements of Operations and Compliance.