Observability and condition control

1) Goals and principles

Goal: To understand "what is happening" and "why" in real time to prevent incidents and recover quickly without violating SLO or inflating OPEX.
Principles: SLO-first, "golden signals" (latency, traffic, errors, saturation), a single telemetry standard (OpenTelemetry), minimally sufficient details, explainability, cost-aware observability.

2) Observability layers

1. Metrics: aggregates for SLI/SLO, capacity and trends (RED/USE models).
2. Traces: causal chains of requests, payment and game transactions.
3. Logs/events: detailed context and audit of operator/service actions.
4. Synthetics (black-box): external API/web path checks, PSP/KYC health pings.
5. RUM (real user): front-line metrics (TTFB, LCP, JS errors), geo/device slices.
6. Low-level telemetry: eBPF/CPU profiling/IO/alloc, network percentile delays.

3) SLI set and golden signals

Latency: p50/p95/p99 by critical paths (login, deposit, rate, withdrawal).
Errors: share of 5xx/timeout/decline (normalized by providers/banks).

Traffic/Throughput: RPS/TPS, active sessions, events/sec

Saturation: CPU/RAM/IO load, queue depth, pool-usage, replication lag.
Business SLI: successful deposits/% rates per window, KYC/PSP conversion deviations, chargeback share.

4) Telemetry architecture

Standardized injection: OpenTelemetry SDK/collector → normalization, sampling, privacy filters → storage (TSDB, traces, logs).
Correlation: trace-id/span-id in logs and metrics (exemplars); single correlation-id for payments/gaming events.
Topology: service graph, dependent external providers with live SLIs.
Cost management: retention levels, aggregations, dynamic sampling, "hot "/" cold "storage classes.

5) Metrics: Design and cardinality

Rules: a small number of labels, a ban on high-cardinality (userId, sessionId) in the time-series; such details - only in routes/logs.
RED/USE: Requests-Errors-Duration для API; Utilization-Saturation-Errors for infrastructure.
Exemplars: binding high percentiles to specific trace examples.
Business metrics: $/RPS, PSP bank/GEO conversion, provider resiliency.

6) Tracing: Depth and Sampling

Context: we throw the trace context through the front → API → brokers → processors → databases/PSP.
Sampling: basic 1-10%, with anomalies - dynamic increase according to the rules (tail-based).
Focus: payment flow (init → auth → capture/settle), game transactions (bet → settle), KYC (init → verify).
Annotations: PSP-code of response, bank-BIN/issuer-category, region, risk rate.

7) Logs and audits

Structured logs: JSON, level by profile (INFO on the prod, DEBUG in debug).
Privacy filters: PII masking, prohibition of raw KYC documents in logs.
Audit events: who/what/where/when/why, ticket ID, pre/post values for high-risk transactions (bonuses, limits, PSP routing).
Ineligibility: WORM/immutable, signature, retention by policy.

8) Condition control (health)

Liveness/Readiness/Startup: correct samples (do not check external dependencies in liveness).
Degraded-mode: explicit service degradation flags so that alerts and the status page are consistent.
Budget health: burn-rate error budget (fast/slow window), headroom by resources and queues.

9) Alerting and early warning

SLO alerts: according to the error budget (4-hour and 1-hour windows) instead of the "raw" p95.
Anomalies: STL/IQR/online detectors for 5xx bursts, PSP authorizations drop in a particular GEO/bank.
Root-cause hints: we associate alerts with the latest releases/phicheflags/planned work.

Runbooks: each alert has links to a playbook, graphs, "quick checks."

10) Dashboards (who sees what)

Exec: uptime/SLO, burn-rate, successful deposits/rates, provider status, capacity forecast and $/RPS.
SRE/platform: RED/USE by service, queues/lag, pool-usage, replication lag, CDN/WAF, eBPF profiles.
Payments/Risk: success of PSP/bank/GEO authorizations, soft/hard declines, KYC time, chargeback early-signals.
Support/CS: incident status panel, response SLAs, FAQ macros.

11) FinOps-Observability

Retention: 7-14 days for "raw" tracks, units longer; selectively - hot services.
Sampling/aggregation: dynamic sampling by anomaly, downsampling of old series.
Ingest policies: cut off noise (health pings, redundant logs), quotas for high-cardinality metrics.
KPI cost: $/GB ingest, $/trace, $/SLI dashboard; periodic reviews of top eaters.

12) Privacy and compliance

PII/Finance: masking, tokenization, data minimization in telemetry.
Geo-localization: storage and processing by jurisdiction; log export - only through approved workflow with encryption and TTL.
Audit access to telemetry: RBAC/ABAC, SoD for uploads, request log.

13) Integration with incident management and releases

Status page: automatic update feed from the incident card.
Release gate: SLI canary analysis, auto-stop release at burn-rate> threshold.
Post-mortem: timeline from trails/logs, actual SLIs and violation windows.

14) Implementation practice (8-12 weeks)

Ned. 1-2: inventory of critical paths and SLI; stack selection (OTel, TSDB, logs, traces); dependency map.
Ned. 3-4: OTel implementation in 3-5 key services (login/deposit/rate), basic RED/USE, trace context in logs.
Ned. 5-6: SLO and burn-rate alerts; synthetics according to PSP/KYC; the first runbooks; RUM to web/mobile.
Ned. 7-8: dynamic sampling, exemplars, service map; Exec/SRE/Payments dashboards.
Ned. 9-10: eBPF/hot bottleneck profiling; privacy filters; quotas/retentions.
Ned. 11-12: release gates and auto-rollback by SLI; Integration with the status page tabletop teachings.

15) Artifact patterns

SLO-card of the service: SLI, goals, windows, error budget, alerts, owners.
Alert Spec: metric/condition, thresholds, deadup/silence, recipients, runbook.
Dashboard Spec: audience, questions, 6-8 widgets, data source, refresh rate.
Telemetry Policy: what fields are allowed/prohibited, retention, masking, export.
Cost Review Pack: Top Series/Log Streams, Sampling Offer/TTL, Expected Savings.

16) Observability function KPI

MTTA/MTTR (improvement after SLO-alerting implementation).
% of incidents detected by synthetics/SLI prior to user complaints.
The proportion of releases that passed the gate via SLI without manual intervention.
Decrease in $/RPS per telemetry while maintaining diagnostics.
Trace coverage of critical paths (> 90%).

Accuracy of correlation "status update ↔ actual SLIs."

17) Antipatterns

"Log everything" → an explosion of cost and noise.
Alerts on "raw" metrics instead of SLO/burn-rate → pager-fatigue.
High cardinality of metrics (userId) → TSDB storms.
Trails without business context (PSP/bank/GEO) → no insight.
No association of observability with releases/incidents → telemetry lives separately.

Total

Observability and condition control is not a set of tools, but a managed system: correct SLI/SLO → standardized telemetry and correlation → SLO alert and runbooks → integration with releases and status communication → cost-aware operation and privacy. Such a loop gives early signals, fast RCA and business resilience even in extreme traffic peaks.

Observability and condition control

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects