ETL/ELT jarayonlari
1) Vazifasi va konteksti
ETL/ELT-konveyerlar hisobot (GGR/NGR, regulyatorlar), analitiklar/ML va operativ panellar uchun ma’lumotlarni oldindan aytib bo’ladigan yuklash, transformatsiya qilish va e’lon qilishni ta’minlaydi.
ETL: DWH/Lakehouse (kamdan-kam hollarda zamonaviy steklarda).
ELT: Avval Lakehouse (Bronze/Silver) ga yuklaymiz, so’ngra SQL/dvigatellarga aylantiramiz (tavsiya etilgan).
2) Etalon arxitekturasi
1. Ingest/Edge: HTTP/gRPC/Batch, OLTP dan CDC, provayder S3/FTP tushirish.
2. Bronze (raw, append-only): o’zgarmas payload’lar, sana/bozor/tenant bo’yicha partiyalar.
3. Silver (clean/conform): normallashtirish, dedup, maʼlumotnomalar, SCD, FX/taymzonlar.
4. Gold (serve): BI/regulyator/model ostidagi denormallashtirilgan vitrinalar.
5. Orkestr: Airflow/Dagster/Prefect (DAG’i, SLA, retray, siljish).
6. DQ/Contracts: Schema Registry + DQ-как-код, consumer-driven tests.
7. Kuzatilishi: payplaynlar metrikasi, lineage, loglar, cost-dashbordlar.
3) ETL vs ELT tanlash
Amaliyot: iGaming - ELT + CDC: tezda yuklaymiz, keyin standartlashtiramiz va hisoblaymiz.
4) Inkrementlar va CDC
Deltalarga yondashuvlar:- CDC (Debezium/log replikatsiyasi): OLTP → Bronze → MERGE ning Silverdagi oʻzgarishlari.
- Watermark vaqti boʻyicha:’updated _ at> max_loaded_ts'.
- Hash-diff: o’zgarishlarni aniqlash uchun’md5 (row)’taqqoslash.
- Upsert/MERGE: yuklashning idempotentligi.
sql
MERGE INTO silver. payments s
USING stage. payments_delta d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;
5) Kontraktlar va sxemalar
Schema-first: JSON/Euro/Protobuf in Registry;’schema _ version’.
Evolyutsiya: back-compatible (nullable qoʻshimchalar); breaking - ’/v2’+ qoʻshaloq yozuv.
Majburiy maydonlar:’event _ time (UTC)’,’event _ id’,’trace _ id’,’user _ pseudo _ id’,’market’.
6) DQ-kod sifatida (minimal to’plam)
yaml table: silver. payments owner: data-payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: unique_tx # uniqueness of transactions type: unique columns: [transaction_id]
severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
severity: major
- name: amount_positive type: range column: amount_base min: 0. 01 severity: critical
- name: fk_user type: foreign_key column: user_pseudo_id ref_table: dim. users_scd severity: critical
7) Orkestratsiya: DAG’i, qaramlik, SLA
DAG-dizayn: manbalardan vitrinalarga; vazifalar o’rtasidagi aniq bog’liqliklar.
Retray va idempotentlik: backoff, «sof» takrorlash, checkpoint’lar.
Siljishlar (catchup): o’tkazib yuborilgan davrlarning ehtiyotkorlik bilan o’rnatilishi.
SLA: masalan, Gold. daily mahalliy vaqt bilan soat 06:00 gacha tayyor; qoidabuzarliklar to’g "risida xabardor qilish.
Parametrlash: bozorlar/tenantlar/sanalar vars orqali; yagona job namunasi.
8) Idempotentlik va exactly-once
ingest: dublikatlar mumkin → dedup po’(event_id, source)’.
Ishlov berishda: upsert/merge; transformatsiyalarning «sof» funksiyalari.
V sink: tranzaksion kommitalar yoki idempotent writes; «ikki marta hisobga olish» nazorati.
Outbox/Inbox: OLTP domen voqealarini tranzaksion nashr etish.
9) Backfill и reprocessing
Backfill: birlamchi toʻldirish/tarixiy diapazonlar.
Reprocessing: mantiq oʻzgarganda/tuzatilganda qayta hisoblash.
Guardrails: diapazon chegaralari, kvotalar, vaqt oynalari, metrlarni taqqoslash bilan dry-run.
Markalash:’logic _ version’,’reprocessed _ at’,’recalc _ reason’.
10) Silver/Gold modellashtirish
Silver (3NF/BCNF): faktlar’fact _ bets/payments/payouts’, o’lchovlar’dim _ users/games/providers/markets (SCD II)’, valyutalar/taymzonlarni standartlashtirish.
Gold: BI/regulyator/model ostidagi denormallashtirilgan vitrinalar; o’zgarmas eksport paketlari (WORM) + imzo.
Gold misoli: GGR Daily
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
b. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
11) Xususiy va rezidentlik
PII-minimallashtirish: tokenlashtirish; izolyatsiya qilingan konturda real ID mappinglari.
RLS/CLS: rollar/yurisdiksiyalar bo’yicha foydalanish siyosati, yashirish.
Residency: EEA/UK/BR uchun alohida kataloglar/kalitlar; asossiz kross-mintaqaviy join’onlarni taqiqlash.
DSAR/RTBF & Legal Hold: selektiv tahrirlash, hisobot uchun WORM arxivlari, eksport auditi.
12) Kuzatuv va SLO
SLI/SLO belgilari:- Freshness Silver p95 ≤ 15 min; Gold daily soat 06:00 gacha tayyor. vaqt.
- Completeness ≥ 99. 5%, Validity (sxema) ≥ 99. 9%.
- Job’larning muvaffaqiyati ≥ 99. 0%, MTTR hodisalar ≤ 24-48 soat
Dashbordlar: Freshness heatmap, DQ-yo’qotish hunisi, cost/query & cost/GB, lineage-graf.
13) Unumdorlik va qiymat
Partiyalashtirish: sana/bozor/tenant; klaster/Z-order filtrlar bo’yicha.
Formatlar: Parquet + ACID (Delta/Iceberg/Hudi), siqish va statistika.
Kompaksiya: small files (OPTIMIZE/VACUUM) ga qarshi kurash.
Materiallashtirish: barqaror agregatlar; ulkan on-the-fly join’olardan qochish.
Chargeback: budjetlar, replay/backfill uchun kvotalar; past yuklamali derazalarda rejalashtirish.
14) DAGning namunaviy vazifalari misollari (Airflow psevdokodi)
python with DAG("elt_payments_daily", schedule="@daily", start_date=..., catchup=True) as dag:
extract = BashOperator(task_id="extract_cdc", bash_command="run_cdc_to_bronze. sh {{ ds }}")
load = BashOperator(task_id="load_to_silver", bash_command="sql/run_merge_silver. sql {{ ds }}")
dq = BashOperator(task_id="dq_checks", bash_command="dq/run_checks. sh silver. payments {{ ds }}")
gold = BashOperator(task_id="build_gold_ggr", bash_command="sql/build_gold_ggr. sql {{ ds }}")
export = BashOperator(task_id="export_regulator", bash_command="export/run_worm_pack. sh {{ ds }}")
extract >> load >> dq >> gold >> export
15) Jarayonlar va RACI
R (Responsible): Data Engineering (DAG’i, Silver/Gold modellari), Data Platform (infra, Registry, DQ).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/Legal/DPO (PII/residency/Legal Hold), Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Mahsulot/Marketing/Operatsiyalar.
16) Joriy etish yo’l xaritasi
MVP (3-5 hafta):1. Lakehouse Bronze/Silver (ACID) + CDC/Payments/Gameplay uchun inkrementlar.
2. DQ-kod sifatida (10-15 qoidalar) va Freshness/Completeness bazaviy dashbordlari.
3. Birinchi Gold-vitrin (GGR Daily) SLA «06:00 gacha», WORM-eksport imzosi bilan.
4. DAG orkestratsiyasi va SLA/DQ alertasi.
2-faza (5-10 hafta):- Domenlarni kengaytirish, users/games/providers uchun SCD II.
- Metriklarning semantik qatlami; lineage/impact-tahlil; backfill/reprocessing tartib-taomillari.
- Hududlashtirish (EEA/UK), RLS/CLS, qiymatni nazorat qilish (kvotalar/chargeback).
- Repley-simulyator (what-if), vitrin/metrik hujjatlarining avtogeneratsiyasi.
- Cost-optimallashtirish (klaster, materiallashtirish, TTL, kompaksiya).
- DR-mashqlar va time-travel tiklash.
17) Sotishdan oldingi chek-varaq
- Registridagi kontraktlar/sxemalar, muvofiqlik testlari yashil rangda.
- CDC/inkrementlar va MERGE idempotentdir; dedup ingest.
- DQ qoidalari faol (critical → fail + DLQ), SLA dashbordlari sozlangan.
- Gold-vitrinalar hujjatlashtirilgan, semantik qatlamdagi metrik formulalar.
- RBAC/ABAC, shifrlash, rezidentlik, DSAR/RTBF/Legal Hold tekshirildi.
- Jadval bo’yicha kompaksiya/OPTIMIZE/VACUUM; backfill/replay limitlari.
- Runbook’va hodisalar va reprocessing, eksport auditi (WORM + hash).
18) Anti-patternlar va xavflar
Full reload: CDC/inkrementlardan foydalaning.
Xom va hisobot ma’lumotlarini aralashtirish: Bronze/Silver/Gold ni alohida saqlang.
DQ va lineage yo’qligi: isbotlanishi va takrorlanishi yo’q.
PII analitik qatlamlarda: mappinglarni izolyatsiya qiling, CLS/RLS qo’llang.
Monolit «tungi» joblar: partiyalar bo’yicha parallel bo’ling.
Qiymat ignori: small files-ni kuzatib boring, agregatlarni materiallashtiring, kvotalarni kiriting.
19) Lugʻat (qisqacha)
ETL/ELT - chiqarish/transformatsiya/yuklash (yuklashdan oldin/keyin).
CDC - oʻzgarishlarni ushlash.
SCD - o’lchovlarni tarixlashtirish (I/II/III).
WORM - hisobot paketlarining o’zgarmas saqlanishi.
Time-travel - jadvallarning tarixiy versiyalarini o’qish.
20) Jami
Zamonaviy ETL/ELT - bu skriptlar emas, balki boshqariladigan platforma: kontraktlar va DQ, idempotent inkrementlar/CDC, Bronze/Silver/Gold qatlamlari intizomi, kuzatuv va SLO, maxfiylik va tejamkorlik. Ushbu yoʻl-yoʻriqqa amal qilib, siz hisobotni barqaror taʼminlaydigan, mahsulot va modellarni kutilmagan hodisalarsiz takrorlanadigan va tinglanadigan konveyerlarga ega boʻlasiz.