Observability and condition control
1) Goals and principles
Goal: To understand "what is happening" and "why" in real time to prevent incidents and recover quickly without violating SLO or inflating OPEX.
Principles: SLO-first, "golden signals" (latency, traffic, errors, saturation), a single telemetry standard (OpenTelemetry), minimally sufficient details, explainability, cost-aware observability.
2) Observability layers
1. Metrics: aggregates for SLI/SLO, capacity and trends (RED/USE models).
2. Traces: causal chains of requests, payment and game transactions.
3. Logs/events: detailed context and audit of operator/service actions.
4. Synthetics (black-box): external API/web path checks, PSP/KYC health pings.
5. RUM (real user): front-line metrics (TTFB, LCP, JS errors), geo/device slices.
6. Low-level telemetry: eBPF/CPU profiling/IO/alloc, network percentile delays.
3) SLI set and golden signals
Latency: p50/p95/p99 by critical paths (login, deposit, rate, withdrawal).
Errors: share of 5xx/timeout/decline (normalized by providers/banks).
Traffic/Throughput: RPS/TPS, active sessions, events/sec
Saturation: CPU/RAM/IO load, queue depth, pool-usage, replication lag.
Business SLI: successful deposits/% rates per window, KYC/PSP conversion deviations, chargeback share.
4) Telemetry architecture
Standardized injection: OpenTelemetry SDK/collector → normalization, sampling, privacy filters → storage (TSDB, traces, logs).
Correlation: trace-id/span-id in logs and metrics (exemplars); single correlation-id for payments/gaming events.
Topology: service graph, dependent external providers with live SLIs.
Cost management: retention levels, aggregations, dynamic sampling, "hot "/" cold "storage classes.
5) Metrics: Design and cardinality
Rules: a small number of labels, a ban on high-cardinality (userId, sessionId) in the time-series; such details - only in routes/logs.
RED/USE: Requests-Errors-Duration для API; Utilization-Saturation-Errors for infrastructure.
Exemplars: binding high percentiles to specific trace examples.
Business metrics: $/RPS, PSP bank/GEO conversion, provider resiliency.
6) Tracing: Depth and Sampling
Context: we throw the trace context through the front → API → brokers → processors → databases/PSP.
Sampling: basic 1-10%, with anomalies - dynamic increase according to the rules (tail-based).
Focus: payment flow (init → auth → capture/settle), game transactions (bet → settle), KYC (init → verify).
Annotations: PSP-code of response, bank-BIN/issuer-category, region, risk rate.
7) Logs and audits
Structured logs: JSON, level by profile (INFO on the prod, DEBUG in debug).
Privacy filters: PII masking, prohibition of raw KYC documents in logs.
Audit events: who/what/where/when/why, ticket ID, pre/post values for high-risk transactions (bonuses, limits, PSP routing).
Ineligibility: WORM/immutable, signature, retention by policy.
8) Condition control (health)
Liveness/Readiness/Startup: correct samples (do not check external dependencies in liveness).
Degraded-mode: explicit service degradation flags so that alerts and the status page are consistent.
Budget health: burn-rate error budget (fast/slow window), headroom by resources and queues.
9) Alerting and early warning
SLO alerts: according to the error budget (4-hour and 1-hour windows) instead of the "raw" p95.
Anomalies: STL/IQR/online detectors for 5xx bursts, PSP authorizations drop in a particular GEO/bank.
Root-cause hints: we associate alerts with the latest releases/phicheflags/planned work.
Runbooks: each alert has links to a playbook, graphs, "quick checks."
10) Dashboards (who sees what)
Exec: uptime/SLO, burn-rate, successful deposits/rates, provider status, capacity forecast and $/RPS.
SRE/platform: RED/USE by service, queues/lag, pool-usage, replication lag, CDN/WAF, eBPF profiles.
Payments/Risk: success of PSP/bank/GEO authorizations, soft/hard declines, KYC time, chargeback early-signals.
Support/CS: incident status panel, response SLAs, FAQ macros.
11) FinOps-Observability
Retention: 7-14 days for "raw" tracks, units longer; selectively - hot services.
Sampling/aggregation: dynamic sampling by anomaly, downsampling of old series.
Ingest policies: cut off noise (health pings, redundant logs), quotas for high-cardinality metrics.
KPI cost: $/GB ingest, $/trace, $/SLI dashboard; periodic reviews of top eaters.
12) Privacy and compliance
PII/Finance: masking, tokenization, data minimization in telemetry.
Geo-localization: storage and processing by jurisdiction; log export - only through approved workflow with encryption and TTL.
Audit access to telemetry: RBAC/ABAC, SoD for uploads, request log.
13) Integration with incident management and releases
Status page: automatic update feed from the incident card.
Release gate: SLI canary analysis, auto-stop release at burn-rate> threshold.
Post-mortem: timeline from trails/logs, actual SLIs and violation windows.
14) Implementation practice (8-12 weeks)
Ned. 1-2: inventory of critical paths and SLI; stack selection (OTel, TSDB, logs, traces); dependency map.
Ned. 3-4: OTel implementation in 3-5 key services (login/deposit/rate), basic RED/USE, trace context in logs.
Ned. 5-6: SLO and burn-rate alerts; synthetics according to PSP/KYC; the first runbooks; RUM to web/mobile.
Ned. 7-8: dynamic sampling, exemplars, service map; Exec/SRE/Payments dashboards.
Ned. 9-10: eBPF/hot bottleneck profiling; privacy filters; quotas/retentions.
Ned. 11-12: release gates and auto-rollback by SLI; Integration with the status page tabletop teachings.
15) Artifact patterns
SLO-card of the service: SLI, goals, windows, error budget, alerts, owners.
Alert Spec: metric/condition, thresholds, deadup/silence, recipients, runbook.
Dashboard Spec: audience, questions, 6-8 widgets, data source, refresh rate.
Telemetry Policy: what fields are allowed/prohibited, retention, masking, export.
Cost Review Pack: Top Series/Log Streams, Sampling Offer/TTL, Expected Savings.
16) Observability function KPI
MTTA/MTTR (improvement after SLO-alerting implementation).
% of incidents detected by synthetics/SLI prior to user complaints.
The proportion of releases that passed the gate via SLI without manual intervention.
Decrease in $/RPS per telemetry while maintaining diagnostics.
Trace coverage of critical paths (> 90%).
Accuracy of correlation "status update ↔ actual SLIs."
17) Antipatterns
"Log everything" → an explosion of cost and noise.
Alerts on "raw" metrics instead of SLO/burn-rate → pager-fatigue.
High cardinality of metrics (userId) → TSDB storms.
Trails without business context (PSP/bank/GEO) → no insight.
No association of observability with releases/incidents → telemetry lives separately.
Total
Observability and condition control is not a set of tools, but a managed system: correct SLI/SLO → standardized telemetry and correlation → SLO alert and runbooks → integration with releases and status communication → cost-aware operation and privacy. Such a loop gives early signals, fast RCA and business resilience even in extreme traffic peaks.