GH GambleHub

Task Queues and Balancing

1) Why task queues

The job queue/work queue disconnects manufacturers and performers by time and speed:
  • Smoothes peaks: buffer between the front and heavy subsystems.
  • Stabilizes SLA: priorities and isolation of load classes.
  • Simplifies fault tolerance: retrays, DLQ, re-staging.
  • Scales horizontally: Add workers without changing the API.

Typical domains: payment processing, notifications, report/media generation, ETL/ML postprocessing, integration with external APIs.


2) Model and basic concepts

Producer: publishes the task (payload + metadata: idempotency key, priority, deadline).
Queue/topic: buffer/log of tasks.
Worker: takes a task, processes, confirms (ack) or returns with an error.
Visibility Timeout/Lease: "rent" tasks for the duration of processing, after - auto-redelivery.
DLQ (Dead Letter Queue): "burying" tasks after the limit of attempts/fatal errors.
Rate Limit/Concurrency: per-worker/per-queue/per-tenant consumption limits.

Dispensing models:
  • Pull: the worker himself requests the task (doses the load).
  • Push: Broker fluffs; need protection against "filling" weak workers.

3) Delivery and confirmation semantics

At-most-once: no retrays; faster, but possible loss.
At-least-once (default for most queues): duplicates are possible → handler idempotency is required.

Effectively exactly-once: achieved at the application level (idempotency, dedup, transactions/outbox). A broker can help, but not a "magic bullet."

Confirmations:
  • Ack/Nack: clear result.
  • Requeue/Retry: с backoff + jitter.
  • Poison message - send to DLQ.

4) Balancing and planning

4. 1 Sequence and algorithms

FIFO: Simple and predictable.
Priority Queue: priority classes (P0... P3).
WRR/WSR (Weighted Round-Robin/Random): CPU shares/transput between classes.
WFQ/DRR (analogous to "fair" queues in networks): shares per tenant/client.
Deadline/EDF: for tasks with deadlines.
Fair Share: limiting "noisy neighbors" (per-tenant quotas).

4. 2 Processing flows

Single-flight/Coalescing: Combine duplicate key tasks.
Concurrency caps: strict limits on parallelism by task type/integration (external APIs).

4. 3 Geo and Shardening

Shards by key (tenant/id) → locality of data, stable order within shards.
Sticky caches/resources: hash routing to workers with an "attached" state.


5) Retrai, backoff and DLQ

Exponential backoff + jitter: 'base 2 ^ attempt ± random'.
Maximum attempts and total deadline (time-to-die) per task.
Classification of errors: 'retryable' (network/limit), 'non-retryable' (validation/business prohibition).
Parking/Delay Queue: deferred tasks (for example, repeat after 15 minutes).
DLQ policy: be sure to indicate where and under what conditions the "poisonous" message gets; provide a reprocessor.


6) Idempotency and deduplication

Idempotency-Key in the task; store (Redis/DB) with TTL for the last N keys:
  • seen → skip/merge/result-cache.
  • Natural keys: Use'order _ id/ payment_id' instead of random UUIDs.
  • Outbox - record the fact of the task and its status in one database transaction with a business transaction.
  • Exactly-once in blue: 'UPSERT' by key, versioning, "at-least-once" in queue + idempotency in database.

7) Multi-tenancy and SLA classes

Separate queues/streams by class: 'critical', 'standard', 'bulk'.
Quotas and priorities per tenant (Gold/Silver/Bronze).
Isolation: dedicate pools of workers under P0; background - in a separate cluster/nodes.
Admission control: do not accept more than you can process in deadlines.


8) Autoscaling workers

Metrics for scaling: queue depth, arrival rate, processing time, SLA deadlines.
KEDA/Horizontal Pod Autoscaler: SQS/Rabbit/Kafka lag depth triggers.
Restraining factors: external rate limits APIs, database (do not destroy the back end).


9) Technology options and patterns

9. 1 RabbitMQ/AMQP

Prefetch (QoS) regulates "how many tasks are on the worker."

Exchanges: direct/topic/fanout; Queues с ack/ttl/DLQ (dead-letter exchange).

DLX Example:
ini x-dead-letter-exchange=dlx x-dead-letter-routing-key=jobs.failed x-message-ttl=60000

9. 2 SQS (and analogues)

Visibility Timeout, DelaySeconds, RedrivePolicy (DLQ).
Idempotence - on the application (dedup table).
Limits: butches 1-10 posts; focus on idempotent bruises.

9. 3 Kafka/NATS JetStream

For large-scale pipelines: high throughput, retention/replay.
Task queue over logs: one task = one message; one worker per key control through/subject partitioning.
Retrai: individual topics/subject-suffixes with backoff.

9. 4 Redis queues (Sidekiq/Resque/Bull/Celery-Redis)

Very low latency; watch for stability (RDB/AOF), retry keys and lock-keys for single-flight.
Suitable for "light" tasks, not for long-term retension.

9. 5 Frameworks

Celery (Python), Sidekiq (Ruby), RQ/BullMQ (Node), Huey/Resque - ready-made retrays, schedules, middleware, metrics.


10) Routing and balancing schemes

Round-Robin: Evenly but does not take into account the "severity" of tasks.
Weighted RR: distribution by worker capacity/pool.
Fair/Backpressure-aware: The worker only picks up a new task when ready.
Priority lanes: separate queues per class; workers read in order [P0→... →Pn] if available.
Hash-routing: 'hash (key)% shards' - for stateful/cached processing.


11) Timeouts, deadlines and SLAs

Per-task timeout: internal "lease" of work (in the worker code) ≤ the broker's Visibility Timeout.
Global deadline: the task does not make sense after T time - NACK→DLQ.
Budget-aware: reduce work (brownout) when the deadline approaches (partial results).


12) Observability and management

12. 1 Metrics

`queue_depth`, `arrival_rate`, `service_rate`, `lag` (Kafka), `invisible_messages` (SQS).
`success/failed/retired_total`, `retry_attempts`, `dlq_in_total`, `processing_time_ms{p50,p95,p99}`.
`idempotency_hit_rate`, `dedup_drops_total`, `poison_total`.

12. 2 Logs/Tracing

Correlation: 'job _ id', 'correlation _ id', deduplication key.
Mark 'retry/backoff/dlq' as events; linking from span initial request.

12. 3 Dashboards/alerts

Triggers: depth> X, p99> SLO, DLQ growth, stuck tasks (visibility expired> N), hot keys.


13) Safety and compliance

Encryption on transport and/or "at rest."

Tenant isolation: individual queues/key spaces, ACLs, quotas.
PII minimization in payload; hash/ID instead of crude PII.
Secrets: do not put tokens in the task body, use vault/refs.


14) Anti-patterns

Retrai without idempotency → duplicate operations/money "twice."

One giant queue "for everything →" no isolation, unpredictable delays.
Endless retrai without DLQ → eternal "poisonous" tasks.
Visibility Timeout <processing time → cascaded duplicates.
Large payload in queue → network/memory pressure; it is better to store in an object stor and transfer a link.
Push model without backpressure → workers choke.
Mixing critical and bulk tasks in one pool of workers.


15) Implementation checklist

  • Classify tasks by SLA (P0/P1/P2) and volume.
  • Select a broker/framework with the desired semantics and retention.
  • Design keys, priorities, and routing (hash/shards/priority lanes).
  • Enable backoff + jitter retrays and DLQ policy.
  • Implement idempotency (keys, upsert, deadstore with TTL).
  • Set the per-task, visibility, and general deadline timeouts.
  • Limit concurrency and rate by integrations/tenants.
  • Depth/lag auto-scaling with fuses.
  • Metrics/tracing/alerts; runbooks on "storm" and DLQ overflow.
  • Tests for fails: the fall of the worker, the "poisonous" message, overload, long tasks.

16) Sample Configurations and Code

16. 1 Celery (Redis/Rabbit) - base flow

python app = Celery("jobs", broker="amqp://...", backend="redis://...")
app.conf.task_acks_late = True        # ack после выполнения app.conf.broker_transport_options = {"visibility_timeout": 3600}
app.conf.task_default_retry_delay = 5 app.conf.task_time_limit = 300        # hard timeout

@app.task(bind=True, autoretry_for=(Exception,), retry_backoff=True, retry_jitter=True, max_retries=6)
def process_order(self, order_id):
if seen(order_id): return "ok"      # идемпотентность do_work(order_id)
mark_seen(order_id)
return "ok"

16. 2 RabbitMQ — DLQ/TTL

ini x-dead-letter-exchange=dlx x-dead-letter-routing-key=jobs.dlq x-message-ttl=600000   # 10 минут x-max-priority=10

16. 3 Kafka - Retrays by level


orders -> orders.retry.5s -> orders.retry.1m -> orders.dlq

(Transfer with delayed delivery via scheduler/cron-consumer.)

16. 4 NATS JetStream — consumer с backoff

bash nats consumer add JOBS WORKERS --filter "jobs.email" \
--deliver pull --ack explicit --max-deliver 6 \
--backoff "1s,5s,30s,2m,5m"

17) FAQ

Q: When to choose push versus pull?
A: Pull gives natural backpressure and "honest" balancing; push is easier at low speeds and when minimal TTFB is needed, but requires limiters.

Q: How to avoid a hot key?
A: Shard by composite key ('order _ id% N'), buffer and batch-process, enter limits per-key.

Q: Is it possible to "exactly-once"?
A: Practically - through idempotence and transactional outbox. Fully "mathematical" exactly-once is rarely achievable and expensive all the way.

Q: Where to store large task attachments?
A: In object storage (S3/GCS), and in the task - link/ID; reduces pressure on the broker and network.

Q: How to choose TTL/visibility?
A: Visibility ≥ p99 processing time × stock 2-3 ×. TTL tasks - less business deadline.


18) Totals

A strong queuing system is a balance between delivery semantics, priorities, and constraints. Design keys and routing, ensure idempotency, retray with backoff and DLQ, allocate resources to SLA classes and monitor metrics. Then your background processes will be predictable, stable and scalable - no surprises under the peaks.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.