Kuzatish steki
1) Kuzatuv steki nima uchun kerak?
Tezkor RCA va MTTR pasayishi: simptomdan sababga daqiqalarda.
SLO-boshqaruv: xato/latentlikni o’lchash, noto’g "ri byudjet bo’yicha alerting.
Relizlarni nazorat qilish: kanareykali inshootlar, metriklar bo’yicha avto-rollback.
Xavfsizlik va audit: kirish yo’llari, anomaliyalar, Legal Hold.
FinOps-shaffoflik: saqlash/so’rovlar qiymati, cost-per-SLO.
Metodologiyalar: Golden Signals (latency/traffic/errors/saturation), RED, USE.
2) Stekning bazaviy arxitekturasi
Qatlamlar boʻyicha komponentlar
Yig’ish/agentlar: Exporters, Promtail/Fluent Bit, OTel SDK/Auto-Instr, Blackbox-probes.
Шина/ingest: Prometheus remote_write → Mimir/Thanos, Loki distributors/ingesters, Tempo/Jaeger ingesters.
Omborxonalar: obʼekt S3/GCS/MinIO (uzoq vaqt sovuq), SSD (issiq qatorlar).
So’rovlar/vizualizatsiya: Grafana (panellar, SLO-vidjetlar), Kibana (agar ELK bo’lsa).
Boshqaruv: Alertmanager/Grafana-alertlar, servis-katalog, RBAC, sir-menejer.
Joylashtirish patternlari
Managed (Grafana Cloud/bulut xizmatlari) - hajmlarda tez va qimmatroq.
Self-hosted to’liq nazorat K8s, FinOps ham foydalanishga muhtoj.
3) Ma’lumotlar standartlari: yagona «kuzatish sxemasi»
3. 1 Metrika (Prometheus/OpenMetrics)
Majburiy leybllar:’env’,’region’,’cluster’,’namespace’,’service’,’version’,’tenant’(agar ko’p tenant bo’lsa),’endpoint’.
Nomlanishi:’snake _ case’,’_ total’,’_ seconds’,’_ bytes’qoʻshimchalari.
Gistogrammalar: o’rnatilgan’buckets’(SLO-yo’naltirilgan).
Kardinalligi:’user _ id’,’request _ id’yorliqlarga kiritilmaydi.
3. 2 Logi
Formati: JSON; ’ts’,’level’,’service’,’env’,’trace _ id’,’span _ id’,’msg’majburiy maydonlari.
PII: agentda niqoblash (PAN, tokenlar, e-mail va boshqalar).
Loki-yorliqlar: faqat past kardinallik (’app’,’namespace’,’level’,’tenant’).
3. 3 Trassalar
OTel semantikasi:’service. name`, `deployment. environment`, `db. system`, `http.`.
Sampling: maqsadli yo’llar p99 -’always _ on ’/tail-sampling, qolganlari -’parent/ratio’.
ID oʻrnatish:’trace _ id/span _ id’ni loglar va metriklarga (labels/fields) tashlang.
4) M-L-T korrelyatsiyasi (Metrics/Logs/Traces)
Alert grafikasidan (metrika) → filtrlangan loglar’trace _ id’→ aniq tras.
Trassadan (sekin span) → span oralig’idagi aniq orqa tomonning metrik so’rovi.
Panellardagi Drilldown tugmalari: oʻzgaruvchilar (’$env’,’$service’,’$trace_id’).
5) OpenTelemetry Collector: etalon paypline
yaml receivers:
otlp:
protocols: { http: {}, grpc: {} }
prometheus:
config:
scrape_configs:
- job_name: kube-nodes static_configs: [{ targets: ['kubelet:9100'] }]
processors:
batch: {}
memory_limiter: { check_interval: 1s, limit_mib: 512 }
attributes:
actions:
- key: deployment. environment value: ${ENV}
action: insert tail_sampling:
decision_wait: 5s policies:
- name: errors type: status_code status_code: { status_codes: [ERROR] }
- name: important-routes type: string_attribute string_attribute: { key: http. target, values: ["/payments","/login"] }
- name: probabilistic type: probabilistic probabilistic: { sampling_percentage: 10 }
exporters:
otlphttp/mimir: { endpoint: "https://mimir/api/v1/push" }
otlphttp/tempo: { endpoint: "https://tempo/api/traces" }
loki:
endpoint: https://loki/loki/api/v1/push labels:
attributes:
env: "deployment. environment"
service: "service. name"
service:
pipelines:
metrics: { receivers: [prometheus, otlp], processors: [memory_limiter, batch], exporters: [otlphttp/mimir] }
logs: { receivers: [otlp], processors: [batch], exporters: [loki] }
traces: { receivers: [otlp], processors: [memory_limiter, attributes, tail_sampling, batch], exporters: [otlphttp/tempo] }
6) Alerting: SLO va multi-burn
G’oya: alertim «CPU> 80%» darajasida emas, balki Error Budget iste’molida.
PromQL namunalari:promql
5-minute error rate err_ratio_5m =
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
Quick burn (1m window)
(err_ratio_1m / (1 - SLO)) > 14. 4
Slow burn (30m)
(err_ratio_30m / (1 - SLO)) > 2
Latentlik (gistogrammalar):
promql latency_p95 =
histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
7) Dashbordlar: jildlar tuzilishi
00_Overview platforma: SLO, p95, 5xx%, capacity, faol hodisalar.
10_Services - RPS, p95/p99, xatolar, relizlar (annotatsiyalar) bo’yicha.
20_Infra - K8s/nodlar/storij/tarmoq, etcd, nazoratchilar.
30_DB/Queues — PostgreSQL/Redis/Kafka/RabbitMQ.
40_Edge/DNS/CDN/WAF - ingress, LB, WAF qoidalari.
50_Synthetic - aptaym va headless stsenariylari.
60_Cost/FinOps - saqlash, so’rovlar, issiq/sovuq, prognoz.
Har bir panel: tavsif, birliklar, egasi, runbook-havola, drilldown.
8) Logi: LogQL praktikum
logql
API errors
{app="api", level="error"} = "Exception"
Nginx 5xx in 5 minutes
{app="nginx"} json status=~"5.." count_over_time([5m])
Extract Fields
{app="payments"} json code!="" unwrap duration avg()
9) Trass: TraceQL va fokuslar
Eng sekin uyqularni topish:
{ service. name = "api" } duration > 500ms
«Sekin so’rovda sekin SQL» sendvichi:
{ name = "HTTP GET /order" } child. span. name = "SELECT" & child. duration > 50ms
10) Sintetika va aptaym
Blackbox-exporter: ≥ 3 ta mintaqadan HTTP/TCP/TLS/DNS namunalari/ASN.
Headless: login/deposit.
Quorum-alertlar: Agar 2 ta hudud ≥ ko’rsa, ishga tushirish.
Maqom sahifasi: avtomatik yangilanishlar + qo’lda sharhlar.
11) Saqlash va retenshn
Metrika: issiq 7-30 kun (tez qatorlar), downsampling/recording rules, sovuq - obyekt ombori (oylar).
Logi: issiq 3-7 kun, keyin - indeks bilan S3/GCS (Loki chunk store/ELK ILM).
Trassalar: 3-7 kun’always _ on’+ tanlov uchun uzoq vaqt saqlash (tail-sampled/rad etish).
- Rollover hajmi va vaqti bo’yicha; so’rovlar uchun budjet (kvotalar/limitlar).
- Prod/stage va xavfsizlik maʼlumotlari uchun alohida siyosatlar.
12) Multi-tenantlik va kirish
’tenant ’/’ namespace ’/Spaces, indeks-pattern va ruxsatnomalarni ajrating.
Billing uchun manbalarni belgilang:’tenant’,’service’,’team’.
Import dashbordlari/alertlari - aniq buyruqlar bo’shliqlarida.
13) Xavfsizlik va komplayens
TLS/mTLS agentlardan bekendgacha, HMAC uchun xususiy health.
Barcha so’rovlar va alertlarni o’qish/yozish, audit uchun RBAC.
chetidagi PII tahriri; loglarda sirlarni taqiqlash; DSAR/Legal Hold.
Izolyatsiya: sezgir domenlar uchun alohida klaster/nomspeyslar.
14) FinOps: kuzatish qiymati
Biz yorliqlar va mantiqni ingest (so’rovlarda emas) da kamaytiramiz.
Kritik yo’llar uchun sampling trassasi + maqsadli always-on.
Og’ir agregatsiyalar uchun Downsampling/recording rules.
Sovuq obyektga kamdan-kam kirishni arxivlash.
Метрики: `storage_cost_gb_day`, `query_cost_hour`, `cost_per_rps`, `cost_per_9`.
15) CI/CD va kuzatuv testlari
Linting metrik/logov v CI: «portlash» kardinallikni taqiqlash, gistogrammalar/birliklarni tekshirish.
Kuzatish uchun contract-testlar: majburiy metrika/log maydonlari,’trace _ id’middleware.
Canary: relizlar izohlari, SLO-avto-rollback.
16) Misollar: tez so’rovlar
Xatolik bo’yicha top-endpointlar:promql topk(10, sum by (route) (rate(http_requests_total{status=~"5.."}[5m])))
CPU throttling:
promql sum by (namespace, pod) (rate(container_cpu_cfs_throttled_seconds_total[5m])) > 0
Kafka lag:
promql max by (topic, group) (kafka_consumergroup_lag)
Loglardan trassalarga (Loki → Tempo):’trace _ id’ni Tempo UI/dashbordga linka sifatida uzating.
17) Stek sifati: chek-varaq
- Metrik/loglar/trassalar va o’lchov birliklarining kelishilgan sxemalari.
- ’trace _ id’log va metriklarda, drilldown panellarda.
- Multi-burn SLO-alertlar flappingsiz (quorum/multi-window).
- Downsampling, so’rovlar kvotalari, qadam/diapazon chegaralari.
- Retenshn va saqlash sinflari hujjatlashtirilgan va qo’llaniladi.
- RBAC/audit/PII-tahriri kiritilgan.
- Dashbordlar: egasi, runbooks, ≤ 2-3 ekran, tezkor javob.
- FinOps-dashboard (hajmlar, qiymatlar, top-so’zlashuvchilar).
18) Joriy etish rejasi (3 ta iteratsiya)
1. MVP (2 hafta): Prometheus → Mimir, Loki, Tempo; OTel Collector; bazaviy dashbordlar va SLO-alertlar; blackbox-namunalar.
2. Scale (3-4 hafta): tail-sampling, downsampling, ko’p mintaqa ingest, RBAC/Spaces, FinOps-dashbordlar.
3. Pro (4 + hafta): SLO bo’yicha auto-rollback, asosiy yo’llarning headless-sintetikasi, Legal Hold, SLO portfeli va hisobot.
19) Anti-patternlar
«SLOsiz chiroyli grafiklar» - hech qanday harakat yo’q → foydasi yo’q.
Yuqori kardinallikka ega leybllar (’user _ id’,’request _ id’) - xotira va qiymat portlashi.
JSON’siz va’trace _ id’siz loglar - hech qanday korrelyatsiya yoʻq.
Simptomlar o’rniga resurslar bo’yicha alertlar - shovqin va on-call yonishi.
Retenshn siyosatining yo’qligi - xarajatlarning nazoratsiz o’sishi.
20) Mini-FAQ
Nima tanlash kerak: Loki yoki ELK?
murakkab qidiruv/fasetlar uchun ELK; Loki - grep kabi stsenariylar uchun arzon va tezroq. Koʻpincha duragaydan foydalaniladi.
Hamma uchun trek kerakmi?
Ha, hech bo’lmaganda asosiy yo’llarda (login, checkout, payments) tail-sampling - bu RCAni tezlashtiradi.
Noldan qanday boshlash mumkin?
OTel Collector → Mimir/Loki/Tempo → bazaviy SLO va blackbox-namunalar → keyin dashbordlar va burn-alertlar.
Jami
Kuzatish steki - bu tarqoq vositalar to’plami emas, balki kelishilgan tizim: ma’lumotlarning yagona standartlari → M-L-T korrelatsiyasi → SLO-alerting va sintetika → xavfsizlik va FinOps. Sxemalar, yorliqlar va retenshnlarni tuzating, OTelni ulang, drilldown va auto-rollback qo’shing - va siz tushunarli qiymatga ega boshqariladigan ishonchlilikka ega bo’lasiz.