Distributed Tracing: OpenTelemetry
Distributed Tracing: OpenTelemetry
1) Nima uchun OTel va u nima beradi
OpenTelemetry (OTel) - ochiq standart va OTLP yagona protokoli bilan telemetriya (treys, metrika, loglar) uchun SDK/agentlar/kollektorlar to’plami. Maqsadlar:- So’rov yo’llarining to’liq ko’rinishi (gateway → services → DB/kesh/navbatlar).
- Tezkor RCA/degradatsiyalar va relizlarni sozlash (kanareykalar/blue-green).
- SLO va avto-otkatlar bilan bog’lanish (ma’lumotlardagi operatsion echimlar).
- Vendor-agnostika: bitta APM bilan bog’lanmagan holda har qanday orqa tomonga eksport qilish.
Tayanch tamoyillari: standardize, sample smart, secure by default, correlate everything.
2) Asoslari: kontekst, spanlar, atributlar
Trace - daraxt/qoʻngʻiroqlar grafasi; Span - operatsiya (RPC, SQL, navbatni chaqirish).
Span Kind: `SERVER`, `CLIENT`, `PRODUCER`, `CONSUMER`, `INTERNAL`.
W3C Trace Context:’traceparent’,’tracestate’sarlavhalari; kontekst xizmatlararo ko’chiriladi.
Attributes - kalit qiymati (past kardinallik!), Events - vaqt belgilari, Status - kod/xato tavsifi.
Links - span aloqasi (async/fan-out/fan-in uchun muhim).
- HTTP:’HTTP {METHOD}’(atribut sifatida’GET/withdraw’)
- DB: `DB SELECT` / `DB INSERT`
- Queue: `QUEUE publish topic=X` / `QUEUE consume topic=X`
3) Semantik konvensiyalar (semconv)
Barqaror atribut sxemalaridan foydalaning:- HTTP/GRPC: `http. method`, `http. route`, `http. status_code`, `url. full`.
- DB: `db. system=postgresql`, `db. statement’(faqat xavfsiz siqish!),’db. name`.
- Messaging: `messaging. system=kafka`, `messaging. operation=receive`, `messaging. destination`.
- Cloud/K8s/Host: `cloud. region`, `k8s. pod. name`, `container. id`.
- Resource attributes (majburiy):’service. name`, `service. version`, `deployment. environment`.
Sxemaning barqarorligini’schemaUrl’orqali SDK/Collector resurslarida koʻrsating.
4) Sampling: head, tail, adaptive
Head-based (SDKda): oldindan, arzon hal qiladi; high-QPS uchun yaxshi, lekin «qiziqarli» trassalarni o’tkazib yuborishi mumkin.
Tail-based (Collector’da): trek tugaganidan keyin hal qiladi; maqomi, latentligi, atributlari bo’yicha qoidalarga imkon beradi.
Adaptive/Dinamik: p95 xato/o’sishda sampl ulushini oshiradi.
Prod-daraja retsepti: Head 1-5% global + Tail tanlov «muhim»:’status = ERROR’,’latency> p95’, «pul yo’nalishlari», PSP/KYC xatolari.
5) Korrelyatsiya: metriklar, loglar, treyslar
Exemplars: metrik gistogrammdagi’trace _ id’belgilari (trassaga tez sakrash).
Loglar:’trace _ id ’/’ span _ id’qoʻshing va loglardan trasa oʻting.
SpanMetrics (processor): SLO/alertlar uchun RED-metrika (’requests, errors, duration’) trassalaridan agregat qiladi.
6) Joylashtirish arxitekturasi
Agent (DaemonSet) har bir uzelda ilovalardan (OTLP) va forvarditlardan yigʻadi.
Gateway (Cluster/Region) - marshrutlash/sampling/boyitish payplaynlari bilan markaziy Collector (ko’p nusxalar).
OTLP: gRPC `4317`, HTTP `4318`; TLS/mTLS’ni yoqing.
«agent + gateway» ning afzalliklari: izolyatsiya, buferlash, lokal backpressure, soddalashtirilgan tarmoq.
7) OpenTelemetry Collector - asosiy shablon (gateway)
yaml receivers:
otlp:
protocols:
grpc: { endpoint: 0. 0. 0. 0:4317 }
http: { endpoint: 0. 0. 0. 0:4318 }
processors:
memory_limiter: { check_interval: 5s, limit_percentage: 75 }
batch: { timeout: 2s, send_batch_size: 8192 }
attributes:
actions:
- key: deployment. environment action: upsert value: prod resource:
attributes:
- key: service. namespace action: upsert value: core tail_sampling:
decision_wait: 5s policies:
- name: errors type: status_code status_code: { status_codes: [ERROR] }
- name: slow_traces type: latency latency: { threshold_ms: 800 }
- name: important_routes type: string_attribute string_attribute:
key: http. route values: ["/withdraw", "/deposit"]
- name: baseline_prob type: probabilistic probabilistic: { sampling_percentage: 5 }
exporters:
otlp/apm:
endpoint: apm-backend:4317 tls: { insecure: true }
prometheus:
endpoint: 0. 0. 0. 0:9464
extensions:
health_check: {}
pprof: { endpoint: 0. 0. 0. 0:1777 }
zpages: { endpoint: 0. 0. 0. 0:55679 }
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces: { receivers: [otlp], processors: [memory_limiter,attributes,resource,batch,tail_sampling], exporters: [otlp/apm] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
logs: { receivers: [otlp], processors: [batch], exporters: [] }
8) SLO uchun SpanMetrics va RED
Protsessorni qoʻshish:yaml processors:
spanmetrics:
metrics_exporter: prometheus histogram:
explicit:
buckets: [50ms,100ms,200ms,400ms,800ms,1600ms,3200ms]
service:
pipelines:
traces: { receivers: [otlp], processors: [spanmetrics,batch,tail_sampling], exporters: [otlp/apm] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
Endi SLO/alertlar uchun’traces _ spanmetrics _ calls {service, route, code}’va’duration _ bucket’mavjud.
9) K8s: Collector (DaemonSet + Deployment)
Agent (DaemonSet) parchasi:yaml apiVersion: apps/v1 kind: DaemonSet metadata: { name: otel-agent, namespace: observability }
spec:
template:
spec:
containers:
- name: otelcol image: otel/opentelemetry-collector:latest args: ["--config=/conf/agent. yaml"]
ports:
- { containerPort: 4317, name: otlp-grpc }
- { containerPort: 4318, name: otlp-http }
Gateway (Deployment) - bir nechta nusxalar, Service ClusterIP/Ingress, CPU/QPS boʻyicha HPA.
10) Xavfsizlik va maxfiylik
TLS/mTLS между SDK → Agent → Gateway → Backend.
Gateway kirish joyidagi autentifikatsiya (Basic/OAuth/Headers); kelib chiqishini cheklang.
PII tahriri: sifatlarni filtrlash/maskalash (’user. email’,’card.’) - Collector protsessorida.
Limitlar: SDKda hodisa oʻlchami/atributlar sonini cheklang (kardinallikdan himoya qilish).
RBAC backendda + loyihalar/tenantlarning alohida neyspeyslari.
yaml processors:
attributes/redact:
actions:
- key: user. email action: delete
- key: payment. card action: delete
11) Instrumentatsiya: tezkor startlar
Node. js
js import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes as R } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: "http://otel-agent. observability:4317" }),
resource: new Resource({
[R.SERVICE_NAME]: "payments-api",
[R.SERVICE_VERSION]: "1. 14. 2",
[R.DEPLOYMENT_ENVIRONMENT]: "prod"
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk. start();
Java (Spring)
java
// Gradle: io. opentelemetry. instrumentation:opentelemetry-spring-boot-starter
// application. yml otel:
service:
name: orders-api exporter:
otlp:
endpoint: http://otel-agent. observability:4317 traces:
sampler: parentbased_traceidratio sampler-arg: 0. 05
Python (FastAPI)
python from opentelemetry import trace from opentelemetry. sdk. resources import Resource from opentelemetry. exporter. otlp. proto. grpc. trace_exporter import OTLPSpanExporter from opentelemetry. sdk. trace import TracerProvider from opentelemetry. sdk. trace. export import BatchSpanProcessor
provider = TracerProvider(resource=Resource. create({"service. name":"fraud-scoring","deployment. environment":"prod"}))
provider. add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-agent. observability:4317", insecure=True)))
trace. set_tracer_provider(provider)
Go
go exp, _:= otlptracegrpc. New(ctx, otlptracegrpc. WithEndpoint("otel-agent. observability:4317"), otlptracegrpc. WithInsecure())
res:= resource. NewWithAttributes(semconv. SchemaURL, semconv. ServiceNameKey. String("gateway"), semconv. DeploymentEnvironmentKey. String("prod"))
tp:= sdktrace. NewTracerProvider(sdktrace. WithBatcher(exp), sdktrace. WithResource(res), sdktrace. WithSampler(sdktrace. ParentBased(sdktrace. TraceIDRatioBased(0. 05))))
otel. SetTracerProvider(tp)
12) Asinxron: navbatlar, shinalar, cron
PRODUCER/CONSUMER’links’orqali aloqaga ega.
Xabarlar sarlavhasi kontekstini targʻib qiling (’traceparent ’/’ baggage’).
Batch-consume’da xabar uchun span’lar yarating yoki’messaging’atributi bilan birlashtirish. batch. size`.
cron/joblar uchun: boshlangʻich hodisalar uchun + links uchun yangi trace (agar mavjud boʻlsa).
13) Baggage va targeting
Minimal barqaror kalitlarni (’tenant _ id’,’region’,’vip _ tier’) baggage’da saqlang; PII taqiqlansin.
Keyinchalik segmentlar bo’yicha metrlarni yig’ish uchun gateway/gateway logger orqali tashlang.
14) Relizlar va SLO-geyting bilan integratsiya
Kanar qadamlari → marshrutlar/uz-segmentlar bo’yicha’traces _ spanmetrics _’ni tekshiring.
Degradatsiyada (5xx/p95) - avto-stop va orqaga qaytish (Argo Rollouts AnalysisTemplate + PromQL).
Metriklardan olingan nusxalar to’g «ridan to’g» ri reliz oralig’idagi «yomon» trassalarga olib boradi.
15) Limitlar va unumdorlik
Ограничивайте: `OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT`, `OTEL_SPAN_EVENT_COUNT_LIMIT`, `OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT`.
Ehtimollik/chastotaga qarab/stacktrace istisnolarini semple qiling.
SDK va Collector’dagi Batch-protsessor; portlashlarda yo’lni yo’qotmaslik uchun navbat tuting.
16) Muvofiqlik va migratsiya
Propagatorlar: W3C dan foydalaning; migratsiyada B3/X-Ray oʻqishni qoʻllab-quvvatlang (dual-propagation).
Eksport: OTLP → APM (Jaeger/Tempo/Elastic/X-Ray va boshqalar).
Semconv ning barqaror versiyalari -’schemaUrl’ni tuzating va yangilanishlarni rejalashtiring.
17) Anti-patternlar
Atributlarning yuqori kardinalligi (label’da’user _ id’, dinamik kalitlar).
’trace _ id’ → bilan bogʻlanish yoʻq.
To’g’ridan-to’g’ri dasturlardan Internet-APMga eksport qilish (gateway’siz, TLS/mTLS’siz).
Oziq-ovqat mahsulotining atigi 100 foizini yigʻish qimmat va maʼnosiz.
’db’ dan foydalanuvchi maʼlumotlari bilan SQL soʻrovlari dampalari. statement`.
Muvofiqlashtirilmagan servis nomi/versiyasi - metrika «parchalanmoqda».
18) Joriy etish chek-varaqasi (0-45 kun)
0-10 kun
SDK/avtoinstrumentlashni 2-3 ta tanqidiy xizmatlarda yoqish.
Agent (DaemonSet) + Gateway (Deployment), OTLP 4317/4318 ni TLS bilan moslash.
’service’ qoʻshish. name`, `service. version`, `deployment. environment’hamma joyda.
11-25 kun
Tail-sampling xato/yashirin/» pul» yo’nalishlari bo’yicha.
SpanMetrics → Prometheus, Exemplars va RED/SLO dashbordlarini o’z ichiga oladi.
W3C’ni API-shlyuz/NGINX/mesh orqali targ’ib qilish; loglarni muvofiqlashtirish.
26-45 kun
Navbatlarni/DB/keshni qoplash; async uchun links.
Collector’da PII-tahririyat siyosati; SDKdagi atributlar limitlari.
SLO-geyting relizlari va avto-qaytishni integratsiyalash.
19) Etuklik metrikasi
Kiruvchi soʻrovlarni trassirovka qilish ≥ 95% (sampling head/tail hisobga olingan holda).
Exemplars bilan metrikalarning ulushi ≥ 80%.
Metrikadan trassaga RCA vaqti ≤ 2 daqiqa (p50).
0 atributlar/hodisalarda PII sizib chiqishi (skaner).
Barcha servislar’service’ga ega. name/version/environment’va kelishilgan semantika.
20) Ilovalar: foydali parchalar
NGINX targ’iboti:nginx proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_set_header baggage $http_baggage;
Prometheus с Exemplars (Grafana):
histogram_quantile(0. 95, sum(rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m])) by (le))
Policy: PII atributlarni taqiqlash (psevdo-linter)
yaml forbid_attributes:
- user. email
- payment. card
- personal.
21) Xulosa
OpenTelemetry kuzatishni standartlashtirilgan, boshqariladigan konturga aylantiradi: yagona semantika, xavfsiz targ’ibot, aqlli sampling va metrik va logli kuchli korrelyatsiya. Agent + gateway tuzing, tail-sampling, spanmetrics va Exemplars qo’shing, PII va kardinallikka rioya qiling - va traska nafaqat tuzatish uchun, balki SRE/Release avtomatlashtirilgan yechimlari uchun ham vosita bo’ladi, MTTR va har chiqarilishdagi xavflarni kamaytiradi.