Thread prioritization

1) Why prioritization is needed

With an increase in the load, "everything is important" turns into "we do not have time for anything." Thread prioritization is a system way to allocate limited resources (CPU, I/O, network, budget) between threads/jobs/tenants so that critical SLOs are performed and the cost remains controlled. The result is predictable window freshness, trouble-free alerts and stable recount windows.

2) Flow taxonomy and importance criteria

Classification axes:

Time: real/near-real-time (seconds-minutes), interactive (minutes), offline/batch (hours).
Criticality: financial/regulatory, incident, product, research.
Dependencies: sources for other storefronts (upstream) vs downstream.
Cost of downtime: damage per minute/hour of delay (SLO breach cost).
Tenancy: internal team, partner, external client.

Practice: each class - Business Priority (BP) and Technical Priority (TP); total - composite priority'P = w1BP + w2TP + w3CostRisk '.

3) SLA/SLO/SI model for flows

SLA: contractual guarantee (e.g. "financial showcase T + 15 min, 99. 9%»).
SLO: engineering targets (p95 freshness ≤ 10 min; p99 delay ≤ 60 seconds).
SI (Saturation Index): ratio of current load to limits; used by the scheduler.

Guardrails: guardrail metrics (e.g. validation errors, omissions) may temporarily increase the priority of repair flows.

4) Classes of Service (QoS) and Policies

Gold (business-critical): payments, anti-fraud, regulatory reports, incident alerts.
Silver (product-critical): showcases for dashboards of management, campaigns, risk scoring.
Bronze (best-effort): research batches, long re-build and backfill wide windows.

Politicians:

Strict Priority (SP): Gold always ahead; risk of starvation of the lower.
Weighted Fair Queuing (WFQ): Weights on Traffic/Jobs, Fairness Control.
Deficit Round-Robin (DRR): Portion processing quotas, good for network/streaming hosts.
Deadline-aware: Tasks with a close deadline get a boost.
Cost-aware: recalculation delayed if "expensive hour" and SLO allows.

5) Schedulers and queues (at levels)

Receive/ingest level (event bus):

Topics/queues are divided into QoS classes; producer limits; backpressure through quotas.
Policy rate limit + burst tokens for bursts (token bucket).

Calculation level (Spark/Flink/DBT/SQL):

Resource pools/clusters by class: separate executors for Gold.
Preemption: selection of resources from the lowest at deficit (with frequency limitation).
Admission control: input filter by budget and SLO; rejection of "expensive" jobs without a window.

Storage/OLAP layer:

Competitive I/O and priority request queues.
Materialized views: Gold - incremental, Silver - periodic, Bronze - scheduled/night windows.

6) Backpressure, limits and system protection

Backpressure signals: from consumer to producer (lag/latency/queue depth).
Request/job limits: bytes scanned, rows returned, wall-time caps.
Circuit Breakers: under overload - degradation to simplified units or "warm" snapshots.
Shed-load: resetting/trimming best-effort flows to rescue critical ones.

7) Multi-tenancy and "justice"

Quotas for tenants: CPU/IO/cost per unit of time.
Weights for query classes: analytics, reports, ML features - different limits.
Budget envelopes: weekly/monthly ceilings; when exhausted - lowering the priority, transferring to off-peak.

8) Cost and "prioritization economics"

Cost-to-Freshness: How much it costs 1 min to improve freshness.

Cost-aware planning: Bronze shifts to off-peak; backfill - in "cheap hours."

Spot/Preemptible: for low-priority - use of preemptible resources.
Query profiling: blacklists of "expensive" templates; auto-rewriting.

9) Batch prioritization

Window calendar: Fix windows for Gold before Silver/Bronze.
Dependency-aware DAG: Upstream Gold models get an early slot to unlock the cascade.
Incremental first: first incremental parties, then "cold" re-build.
Checkpointing - to prevent preemption from losing progress.

10) Prioritization for streaming

Priority parties: more consumer instances on Gold topics.
Watermarks by class: for Gold - narrow lateness windows; for Bronze - wider (higher tolerance for late events).
Dedup and idempotent sinks: for Gold - strict; for Bronze - heuristic.
Alerts: Gold alerts go through a separate channel with increased QoS.

11) Signals and automatic priority change

Event triggers: spike traffic, incident, promotional campaign → temporary Gold/Silver boost.
SLA threat: forecast of freshness breakdown → auto-boost of a specific showcase.
Data Quality: mass doubles/losses → increasing the priority of repair streams.
Financial risk: chargeback growth → scoring/alert priority.

12) Observability: what to monitor

Queues/lag: length, waiting time, p95/p99 delays by class.
SLO board: freshness/latency/errors per layer (ingest→curated→marts).
Cost: cost per class/tenant; deviations from the budget.
Preemption: frequency, loss of progress, data MTTR.
Priority arrhythmetics: current 'P', reasons for boosts, history of scheduler decisions.

13) Policy Management

Policies in config code (policy-as-code), versioning and review.
Dry-run before application: how the schedule/cost will change.
Canary inclusion: part of the clusters moves to new weights/rules.
Runbooks: what to do when overloaded, how to temporarily lower the class, how to return.

14) Antipatterns

"Everything is Gold." Prioritization loses its meaning; wars for resources begin.
Strict SP without fasting protection. Silver/Bronze never complete.
No admission control. "Expensive" requests enter the system and drop everyone.

Lack of cost-aware. We perform heavy backfill at "expensive hours."

OLTP/OLAP mix. Critical transactions suffer from analytics.
Hybrid data without RLS/CLS. Repair/priority accidentally exposes sensitive fields.

15) Implementation Roadmap

1. Discovery: inventory of threads, dependencies and owners; assessing SLO and downtime costs.
2. QoS classes: define Gold/Silver/Bronze, weights and base limits; create a policy-as-code.
3. Scheduler and pools: split clusters/resource pools, enable admission control.
4. Monitoring: SLO boards/lag/cost; alerts to the threat of SLO and budget-breach.
5. Auto-boost: integration of signals (incidents, campaigns, DQ) into priority change.
6. Cost-aware: off-peak schedules, spot resources, profiling "expensive" requests.
7. Hardening: preemption-safe checkpoints, runbooks, canary policies, chaos tests.

16) Pre-release checklist

QoS class, owner, SLO, and downtime cost are defined for all flows.
Configured pools/clusters and admission control, CPU/IO/scan limits.
Backpressure and rate limits on ingest/consumers are enabled.
Prioritization policies are designed as code; there is a dry-run and a review.
Lags, freshness, cost, preemption/errors are monitored; alerts in on-call.
Configured auto-boost on signals (SLA threat, DQ, incident, campaign).
Documented degradation runbooks; checked chaos scenarios.
For Bronze, streams are migrated to off-peak/spot without the risk of cascading delays.

17) Sample policies (pseudo-YAML)

17. 1 Gold class with deadline and budget

yaml policy: gold_finance_stream priority_base: 90 deadline_slo: freshness<=10m boost_on:
- dq_violation: duplicates_in_txn_id>0
- incident: "chargeback_spike"
limits:
max_scan_mb: 20480 max_concurrency: 32 budget:
max_hourly_cost: 200 preemption:
can_preempt_classes: [silver, bronze]

17. 2 Cost-aware backfill для Bronze

yaml policy: bronze_backfill priority_base: 20 schedule: offpeak(22:00-06:00)
limits:
max_concurrency: 4 iops_cap: low fallback:
pause_if_cluster_si>0. 8

18) The bottom line

Thread prioritization is a manageable combination of business priorities, technical SLOs, and economic constraints implemented through queues, schedulers, limits, and system feedback. When QoS classes, auto-boost signals, and cost-aware policies work together, data remains fresh and reliable, critical insights arrive on time, and infrastructure billing is predictable.

Thread prioritization

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects