Task Queues and Balancing
1) Why task queues
The job queue/work queue disconnects manufacturers and performers by time and speed:- Smoothes peaks: buffer between the front and heavy subsystems.
- Stabilizes SLA: priorities and isolation of load classes.
- Simplifies fault tolerance: retrays, DLQ, re-staging.
- Scales horizontally: Add workers without changing the API.
Typical domains: payment processing, notifications, report/media generation, ETL/ML postprocessing, integration with external APIs.
2) Model and basic concepts
Producer: publishes the task (payload + metadata: idempotency key, priority, deadline).
Queue/topic: buffer/log of tasks.
Worker: takes a task, processes, confirms (ack) or returns with an error.
Visibility Timeout/Lease: "rent" tasks for the duration of processing, after - auto-redelivery.
DLQ (Dead Letter Queue): "burying" tasks after the limit of attempts/fatal errors.
Rate Limit/Concurrency: per-worker/per-queue/per-tenant consumption limits.
- Pull: the worker himself requests the task (doses the load).
- Push: Broker fluffs; need protection against "filling" weak workers.
3) Delivery and confirmation semantics
At-most-once: no retrays; faster, but possible loss.
At-least-once (default for most queues): duplicates are possible → handler idempotency is required.
Effectively exactly-once: achieved at the application level (idempotency, dedup, transactions/outbox). A broker can help, but not a "magic bullet."
Confirmations:- Ack/Nack: clear result.
- Requeue/Retry: с backoff + jitter.
- Poison message - send to DLQ.
4) Balancing and planning
4. 1 Sequence and algorithms
FIFO: Simple and predictable.
Priority Queue: priority classes (P0... P3).
WRR/WSR (Weighted Round-Robin/Random): CPU shares/transput between classes.
WFQ/DRR (analogous to "fair" queues in networks): shares per tenant/client.
Deadline/EDF: for tasks with deadlines.
Fair Share: limiting "noisy neighbors" (per-tenant quotas).
4. 2 Processing flows
Single-flight/Coalescing: Combine duplicate key tasks.
Concurrency caps: strict limits on parallelism by task type/integration (external APIs).
4. 3 Geo and Shardening
Shards by key (tenant/id) → locality of data, stable order within shards.
Sticky caches/resources: hash routing to workers with an "attached" state.
5) Retrai, backoff and DLQ
Exponential backoff + jitter: 'base 2 ^ attempt ± random'.
Maximum attempts and total deadline (time-to-die) per task.
Classification of errors: 'retryable' (network/limit), 'non-retryable' (validation/business prohibition).
Parking/Delay Queue: deferred tasks (for example, repeat after 15 minutes).
DLQ policy: be sure to indicate where and under what conditions the "poisonous" message gets; provide a reprocessor.
6) Idempotency and deduplication
Idempotency-Key in the task; store (Redis/DB) with TTL for the last N keys:- seen → skip/merge/result-cache.
- Natural keys: Use'order _ id/ payment_id' instead of random UUIDs.
- Outbox - record the fact of the task and its status in one database transaction with a business transaction.
- Exactly-once in blue: 'UPSERT' by key, versioning, "at-least-once" in queue + idempotency in database.
7) Multi-tenancy and SLA classes
Separate queues/streams by class: 'critical', 'standard', 'bulk'.
Quotas and priorities per tenant (Gold/Silver/Bronze).
Isolation: dedicate pools of workers under P0; background - in a separate cluster/nodes.
Admission control: do not accept more than you can process in deadlines.
8) Autoscaling workers
Metrics for scaling: queue depth, arrival rate, processing time, SLA deadlines.
KEDA/Horizontal Pod Autoscaler: SQS/Rabbit/Kafka lag depth triggers.
Restraining factors: external rate limits APIs, database (do not destroy the back end).
9) Technology options and patterns
9. 1 RabbitMQ/AMQP
Prefetch (QoS) regulates "how many tasks are on the worker."
Exchanges: direct/topic/fanout; Queues с ack/ttl/DLQ (dead-letter exchange).
DLX Example:ini x-dead-letter-exchange=dlx x-dead-letter-routing-key=jobs.failed x-message-ttl=60000
9. 2 SQS (and analogues)
Visibility Timeout, DelaySeconds, RedrivePolicy (DLQ).
Idempotence - on the application (dedup table).
Limits: butches 1-10 posts; focus on idempotent bruises.
9. 3 Kafka/NATS JetStream
For large-scale pipelines: high throughput, retention/replay.
Task queue over logs: one task = one message; one worker per key control through/subject partitioning.
Retrai: individual topics/subject-suffixes with backoff.
9. 4 Redis queues (Sidekiq/Resque/Bull/Celery-Redis)
Very low latency; watch for stability (RDB/AOF), retry keys and lock-keys for single-flight.
Suitable for "light" tasks, not for long-term retension.
9. 5 Frameworks
Celery (Python), Sidekiq (Ruby), RQ/BullMQ (Node), Huey/Resque - ready-made retrays, schedules, middleware, metrics.
10) Routing and balancing schemes
Round-Robin: Evenly but does not take into account the "severity" of tasks.
Weighted RR: distribution by worker capacity/pool.
Fair/Backpressure-aware: The worker only picks up a new task when ready.
Priority lanes: separate queues per class; workers read in order [P0→... →Pn] if available.
Hash-routing: 'hash (key)% shards' - for stateful/cached processing.
11) Timeouts, deadlines and SLAs
Per-task timeout: internal "lease" of work (in the worker code) ≤ the broker's Visibility Timeout.
Global deadline: the task does not make sense after T time - NACK→DLQ.
Budget-aware: reduce work (brownout) when the deadline approaches (partial results).
12) Observability and management
12. 1 Metrics
`queue_depth`, `arrival_rate`, `service_rate`, `lag` (Kafka), `invisible_messages` (SQS).
`success/failed/retired_total`, `retry_attempts`, `dlq_in_total`, `processing_time_ms{p50,p95,p99}`.
`idempotency_hit_rate`, `dedup_drops_total`, `poison_total`.
12. 2 Logs/Tracing
Correlation: 'job _ id', 'correlation _ id', deduplication key.
Mark 'retry/backoff/dlq' as events; linking from span initial request.
12. 3 Dashboards/alerts
Triggers: depth> X, p99> SLO, DLQ growth, stuck tasks (visibility expired> N), hot keys.
13) Safety and compliance
Encryption on transport and/or "at rest."
Tenant isolation: individual queues/key spaces, ACLs, quotas.
PII minimization in payload; hash/ID instead of crude PII.
Secrets: do not put tokens in the task body, use vault/refs.
14) Anti-patterns
Retrai without idempotency → duplicate operations/money "twice."
One giant queue "for everything →" no isolation, unpredictable delays.
Endless retrai without DLQ → eternal "poisonous" tasks.
Visibility Timeout <processing time → cascaded duplicates.
Large payload in queue → network/memory pressure; it is better to store in an object stor and transfer a link.
Push model without backpressure → workers choke.
Mixing critical and bulk tasks in one pool of workers.
15) Implementation checklist
- Classify tasks by SLA (P0/P1/P2) and volume.
- Select a broker/framework with the desired semantics and retention.
- Design keys, priorities, and routing (hash/shards/priority lanes).
- Enable backoff + jitter retrays and DLQ policy.
- Implement idempotency (keys, upsert, deadstore with TTL).
- Set the per-task, visibility, and general deadline timeouts.
- Limit concurrency and rate by integrations/tenants.
- Depth/lag auto-scaling with fuses.
- Metrics/tracing/alerts; runbooks on "storm" and DLQ overflow.
- Tests for fails: the fall of the worker, the "poisonous" message, overload, long tasks.
16) Sample Configurations and Code
16. 1 Celery (Redis/Rabbit) - base flow
python app = Celery("jobs", broker="amqp://...", backend="redis://...")
app.conf.task_acks_late = True # ack после выполнения app.conf.broker_transport_options = {"visibility_timeout": 3600}
app.conf.task_default_retry_delay = 5 app.conf.task_time_limit = 300 # hard timeout
@app.task(bind=True, autoretry_for=(Exception,), retry_backoff=True, retry_jitter=True, max_retries=6)
def process_order(self, order_id):
if seen(order_id): return "ok" # идемпотентность do_work(order_id)
mark_seen(order_id)
return "ok"
16. 2 RabbitMQ — DLQ/TTL
ini x-dead-letter-exchange=dlx x-dead-letter-routing-key=jobs.dlq x-message-ttl=600000 # 10 минут x-max-priority=10
16. 3 Kafka - Retrays by level
orders -> orders.retry.5s -> orders.retry.1m -> orders.dlq
(Transfer with delayed delivery via scheduler/cron-consumer.)
16. 4 NATS JetStream — consumer с backoff
bash nats consumer add JOBS WORKERS --filter "jobs.email" \
--deliver pull --ack explicit --max-deliver 6 \
--backoff "1s,5s,30s,2m,5m"
17) FAQ
Q: When to choose push versus pull?
A: Pull gives natural backpressure and "honest" balancing; push is easier at low speeds and when minimal TTFB is needed, but requires limiters.
Q: How to avoid a hot key?
A: Shard by composite key ('order _ id% N'), buffer and batch-process, enter limits per-key.
Q: Is it possible to "exactly-once"?
A: Practically - through idempotence and transactional outbox. Fully "mathematical" exactly-once is rarely achievable and expensive all the way.
Q: Where to store large task attachments?
A: In object storage (S3/GCS), and in the task - link/ID; reduces pressure on the broker and network.
Q: How to choose TTL/visibility?
A: Visibility ≥ p99 processing time × stock 2-3 ×. TTL tasks - less business deadline.
18) Totals
A strong queuing system is a balance between delivery semantics, priorities, and constraints. Design keys and routing, ensure idempotency, retray with backoff and DLQ, allocate resources to SLA classes and monitor metrics. Then your background processes will be predictable, stable and scalable - no surprises under the peaks.