Thread prioritization
1) Why prioritization is needed
With an increase in the load, "everything is important" turns into "we do not have time for anything." Thread prioritization is a system way to allocate limited resources (CPU, I/O, network, budget) between threads/jobs/tenants so that critical SLOs are performed and the cost remains controlled. The result is predictable window freshness, trouble-free alerts and stable recount windows.
2) Flow taxonomy and importance criteria
Classification axes:- Time: real/near-real-time (seconds-minutes), interactive (minutes), offline/batch (hours).
- Criticality: financial/regulatory, incident, product, research.
- Dependencies: sources for other storefronts (upstream) vs downstream.
- Cost of downtime: damage per minute/hour of delay (SLO breach cost).
- Tenancy: internal team, partner, external client.
Practice: each class - Business Priority (BP) and Technical Priority (TP); total - composite priority'P = w1BP + w2TP + w3CostRisk '.
3) SLA/SLO/SI model for flows
SLA: contractual guarantee (e.g. "financial showcase T + 15 min, 99. 9%»).
SLO: engineering targets (p95 freshness ≤ 10 min; p99 delay ≤ 60 seconds).
SI (Saturation Index): ratio of current load to limits; used by the scheduler.
Guardrails: guardrail metrics (e.g. validation errors, omissions) may temporarily increase the priority of repair flows.
4) Classes of Service (QoS) and Policies
Gold (business-critical): payments, anti-fraud, regulatory reports, incident alerts.
Silver (product-critical): showcases for dashboards of management, campaigns, risk scoring.
Bronze (best-effort): research batches, long re-build and backfill wide windows.
- Strict Priority (SP): Gold always ahead; risk of starvation of the lower.
- Weighted Fair Queuing (WFQ): Weights on Traffic/Jobs, Fairness Control.
- Deficit Round-Robin (DRR): Portion processing quotas, good for network/streaming hosts.
- Deadline-aware: Tasks with a close deadline get a boost.
- Cost-aware: recalculation delayed if "expensive hour" and SLO allows.
5) Schedulers and queues (at levels)
Receive/ingest level (event bus):- Topics/queues are divided into QoS classes; producer limits; backpressure through quotas.
- Policy rate limit + burst tokens for bursts (token bucket).
- Resource pools/clusters by class: separate executors for Gold.
- Preemption: selection of resources from the lowest at deficit (with frequency limitation).
- Admission control: input filter by budget and SLO; rejection of "expensive" jobs without a window.
- Competitive I/O and priority request queues.
- Materialized views: Gold - incremental, Silver - periodic, Bronze - scheduled/night windows.
6) Backpressure, limits and system protection
Backpressure signals: from consumer to producer (lag/latency/queue depth).
Request/job limits: bytes scanned, rows returned, wall-time caps.
Circuit Breakers: under overload - degradation to simplified units or "warm" snapshots.
Shed-load: resetting/trimming best-effort flows to rescue critical ones.
7) Multi-tenancy and "justice"
Quotas for tenants: CPU/IO/cost per unit of time.
Weights for query classes: analytics, reports, ML features - different limits.
Budget envelopes: weekly/monthly ceilings; when exhausted - lowering the priority, transferring to off-peak.
8) Cost and "prioritization economics"
Cost-to-Freshness: How much it costs 1 min to improve freshness.
Cost-aware planning: Bronze shifts to off-peak; backfill - in "cheap hours."
Spot/Preemptible: for low-priority - use of preemptible resources.
Query profiling: blacklists of "expensive" templates; auto-rewriting.
9) Batch prioritization
Window calendar: Fix windows for Gold before Silver/Bronze.
Dependency-aware DAG: Upstream Gold models get an early slot to unlock the cascade.
Incremental first: first incremental parties, then "cold" re-build.
Checkpointing - to prevent preemption from losing progress.
10) Prioritization for streaming
Priority parties: more consumer instances on Gold topics.
Watermarks by class: for Gold - narrow lateness windows; for Bronze - wider (higher tolerance for late events).
Dedup and idempotent sinks: for Gold - strict; for Bronze - heuristic.
Alerts: Gold alerts go through a separate channel with increased QoS.
11) Signals and automatic priority change
Event triggers: spike traffic, incident, promotional campaign → temporary Gold/Silver boost.
SLA threat: forecast of freshness breakdown → auto-boost of a specific showcase.
Data Quality: mass doubles/losses → increasing the priority of repair streams.
Financial risk: chargeback growth → scoring/alert priority.
12) Observability: what to monitor
Queues/lag: length, waiting time, p95/p99 delays by class.
SLO board: freshness/latency/errors per layer (ingest→curated→marts).
Cost: cost per class/tenant; deviations from the budget.
Preemption: frequency, loss of progress, data MTTR.
Priority arrhythmetics: current 'P', reasons for boosts, history of scheduler decisions.
13) Policy Management
Policies in config code (policy-as-code), versioning and review.
Dry-run before application: how the schedule/cost will change.
Canary inclusion: part of the clusters moves to new weights/rules.
Runbooks: what to do when overloaded, how to temporarily lower the class, how to return.
14) Antipatterns
"Everything is Gold." Prioritization loses its meaning; wars for resources begin.
Strict SP without fasting protection. Silver/Bronze never complete.
No admission control. "Expensive" requests enter the system and drop everyone.
Lack of cost-aware. We perform heavy backfill at "expensive hours."
OLTP/OLAP mix. Critical transactions suffer from analytics.
Hybrid data without RLS/CLS. Repair/priority accidentally exposes sensitive fields.
15) Implementation Roadmap
1. Discovery: inventory of threads, dependencies and owners; assessing SLO and downtime costs.
2. QoS classes: define Gold/Silver/Bronze, weights and base limits; create a policy-as-code.
3. Scheduler and pools: split clusters/resource pools, enable admission control.
4. Monitoring: SLO boards/lag/cost; alerts to the threat of SLO and budget-breach.
5. Auto-boost: integration of signals (incidents, campaigns, DQ) into priority change.
6. Cost-aware: off-peak schedules, spot resources, profiling "expensive" requests.
7. Hardening: preemption-safe checkpoints, runbooks, canary policies, chaos tests.
16) Pre-release checklist
- QoS class, owner, SLO, and downtime cost are defined for all flows.
- Configured pools/clusters and admission control, CPU/IO/scan limits.
- Backpressure and rate limits on ingest/consumers are enabled.
- Prioritization policies are designed as code; there is a dry-run and a review.
- Lags, freshness, cost, preemption/errors are monitored; alerts in on-call.
- Configured auto-boost on signals (SLA threat, DQ, incident, campaign).
- Documented degradation runbooks; checked chaos scenarios.
- For Bronze, streams are migrated to off-peak/spot without the risk of cascading delays.
17) Sample policies (pseudo-YAML)
17. 1 Gold class with deadline and budget
yaml policy: gold_finance_stream priority_base: 90 deadline_slo: freshness<=10m boost_on:
- dq_violation: duplicates_in_txn_id>0
- incident: "chargeback_spike"
limits:
max_scan_mb: 20480 max_concurrency: 32 budget:
max_hourly_cost: 200 preemption:
can_preempt_classes: [silver, bronze]
17. 2 Cost-aware backfill для Bronze
yaml policy: bronze_backfill priority_base: 20 schedule: offpeak(22:00-06:00)
limits:
max_concurrency: 4 iops_cap: low fallback:
pause_if_cluster_si>0. 8
18) The bottom line
Thread prioritization is a manageable combination of business priorities, technical SLOs, and economic constraints implemented through queues, schedulers, limits, and system feedback. When QoS classes, auto-boost signals, and cost-aware policies work together, data remains fresh and reliable, critical insights arrive on time, and infrastructure billing is predictable.