GH GambleHub

Scheduler and Background Tasks

(Section: Operations and Management)

1) Purpose

The scheduler and background tasks ensure non-user operation of the platform: periodic calculations, publications of artifacts, clearing and queue replays. The objectives are determinism, fault tolerance and auditability.


2) Task taxonomy

Time-based: scheduled (cron/calendar): clearing, closing RTP windows, uploads, archiving.
Event-driven: triggers from the bus (PaymentsSettled, PriceListUpdated).
One-off/Ad-hoc: one-off jabs with TTL.
Long-running: Backoff/sagas, streaming compacts.
Maintenance: key rotations, repackage, indexes, cache warm-up.


3) Architecture (reference)

Components:

1. Scheduler (control-plane): stores schedules, CAL/cron, maintenance windows, timezones, limiters.

2. Dispatcher: plan → queue (per-priority/tenant/region), puts deadlines, idempotent keys.

3. Workers: static/autoscale for task pools; heartbeats, leases.

4. Queue/Bus: FIFO/priority, DLQ, deferred messages.

5. Locker/Coordination: distributed locks (leases), leader-election (Raft/ZK/Consul).

6. Vault/KMS: JIT secrets, short TTL.

7. Observability: traces/metrics/logs, dashboards, alerts.

8. Audit/WORM: immutable receipts of execution, Merkle-slices.

Patterns: outbox/CDC, idempotency, compensation (sagas), backpressure, circuit-breakers.


4) Schedules: cron and calendars

Cron v3: second/minute/hour/day/month/day-week; support for "/5, "ranges, lists.
Calendars/exceptions: business calendar, silence windows, holidays/DST.
Timezones: keep 'tz' on task; Local time start of the tenant.
Multi-region: Copies of per-region schedules or "lead region + followers" with drain/reselection.


5) Queues, priorities, SLAs

Priority classes: P0 (critical), P1, P2, P3; individual worker pools.
SLA/deadlines: 'must _ start _ by', 'must _ finish _ by'; skip - escalation/retray.
Quotas and fairness: caps for tasks/min/tenant, tokens for "bursts," noisy-neighbors isolation.
Delay/visibility timeout.


6) Competitiveness and blockages

Leases: rental of work with auto-extension (heartbeat); by timeout - revocation.
Mutex/semaphores: per-resource (for example, "price list x writes only one worker").
Sharding: by 'tenant/region/hash (key)'; sticky-routing for cache and data locality.
Leader-election: one leader publishes "system" jobs (for example, "close all RTP windows"), followers - hot standby.


7) Reliability: Retrai, idempotency, deadup

Idempotent key: '(task_type, business_id, window)'; repeats → same receipt.
Retrai: exponential back-off + jitter, limit of attempts, on-error strategy (retry/cancel/compensate).
Poison-pill: fast transfer to DLQ after N failures, alert to owner.
Dedup: seen-cache (in-memory + KV) on TTL windows.
Exactly-once effects: confirmation of side effects via transaction log/receipts.


8) Managing long and heavy tasks

Chunking: breakdown into batches, checkpoints/continuation.
Time-boxing: CPU/IO/network egress limitation; interrupt with progress saved.
Sagas/compensations: "undo" semantics for inter-service steps.
Concurrency-caps: limits of simultaneous tasks per type/tenant/region.


9) Observability and metrics

Traces: 'trace _ id', saga steps, external calls.

Metrics (SLI):
  • Lag to start, queue (length, age p95).
  • Success Rate, error-rate, retry-rate.
  • Latency p50/p95, time-to-complete.
  • Cost per 1k tasks, egress/ingress.
  • DLQ rate, poison-pill rate.
SLO (example):
  • P0 start ≤ 60 s, P1 ≤ 5 min; Success ≥ 99. 5%; DLQ ≤ 0. 1%; Freshness ≤ 30 s p95.

10) Audit and provability

Receipts: 'receipt _ hash' for start/success/error, DSSE signatures for critical types (payments, price lists, RTP).
WORM: storing execution logs and task manifests.
Chain-of-custody: who delivered/approved/changed the schedule; SoD checks.


11) Security and access

RBAC/ABAC/ReBAC: who creates/approves/runs; SoD: "create payment" ≠ "approve."

JIT secrets: the worker requests tokens with a short TTL over the scope of the problem.
Isolation: worker pools per-tenant/region/grid; sandbox-execution.
PII hygiene: masking/tokenization, prohibition of logging the primary.


12) FinOps and cost

Budgets/cap-alerts on compute/storage/egress.
Autoscale workers by queues and SLO.
Storage classes: hot (7-30 days) → OLAP (6-24 months) → archive.
Cost-aware planning: launch window at "cheap hours," egress limits.


13) Data model (simplified)

`schedule` `{id, tenant, region, tz, croncalendar, window, enabled, owner, policy_version}`
`job` `{id, schedule_id?, type, payload_hash, idempotency_key, priority, must_start_by, attempts, status, receipt_hash}`
`lease` `{job_id, worker_id, acquired_at, ttl}`
`run_log` `{job_id, started_at, finished_at, outcome, trace_id, metrics{}, receipts[]}`
`dlq_item` `{job_id, reason, attempts, last_error, owner_notified}`

14) API contracts (management/integration)

'POST/schedules' - create a schedule (cron/cal, tz, windows).
'POST/jobs' - put ad-hoc; return 'job _ id', 'receipt _ hash'.
'GET/jobs/{ id} '- status/log/receipts.
'POST/jobs/{ id }/cancel '- cancel with compensation.
'GET/queues/stats' - lengths, lags, p95.
Вебхуки: `JobStarted`, `JobSucceeded`, `JobFailed`, `JobDroppedToDLQ`, `SLOViolated`.


15) Playbooks (typical scenarios)

Retry-storm: enable global back-off, raise dependency timeouts, enable circuit-breaker, split batches.
DLQ avalanche: stop reception, prioritize DLQ parsing, buffer new tasks.
The leader fell: reselection, verification of "double publications" by idempotency, audit.
Hung provider (PSP/KYC): route to the reserve, reduce the frequency of polling/webhooks, transfer transactions to quarantine.
Leaked worker secrets: key revocation, rotation, search for "abnormal" launches in 30 days, rights review.


16) Specificity of iGaming/fintech

Payments/payouts: asynchronous jobs with receipts, quarantine of "gray" transactions, replays of queues with deduplication.
RTP windows/limits: calendar closure, observed vs theoretical RTP, auto-pause promo when drifting.
Price lists/FX/Tax: scheduled publications, artifact versions, cache force disability.
Affiliates: reconciliation of conversions, dedup webhooks, acts/signatures, escrow disputes.


17) Quality metrics (sample set)

Schedule Adherence: the share of tasks started in the window ≥ 99%.
Queue Lag p95: P0 ≤ 60 c, P1 ≤ 5 min.
Success/Retry/DLQ Rate: ≥ 99. 5% / ≤ 0. 4% / ≤ 0. 1%.
Idempotency Errors: ≤ 0. 01%.
Cost/1k jobs and Egress/job - within budget.
Audit Completeness: 100% critical tasks with receipts.


18) RACI

AreaRACI
Scheduler architecturePlatform/SRECTOData, SecurityProduct
Policies/SoD/CalendarCompliance/IAMCCO/CISOLegal, OpsAll
Observability/SLOSREHead of EngData, FinOpsSupport
Economy/quotasFinOpsCFO/CTOSRE, ProductBU Leads
Critical playbooksIR TeamCOOPartners, LegalAudit

19) Implementation checklist

  • Highlight task classes, priorities, and SLAs; Define calendars and timezones.
  • Deploy Scheduler/Dispatcher/Queue/Workers with Leader Electing and Sharding.
  • Introduce idempotency, retrays, DLQ, compensations (sagas).
  • Configure RBAC/ABAC/ReBAC, SoD and JIT secrets for workers.
  • Enable traces/metrics/logs, dashboards and alerts; SLO и error-budget.
  • Signed bills (DSSE) and WORM logs for critical types.
  • Autoscale and cap-alerts (compute/storage/egress).
  • Playbooks: retry-storm, DLQ avalanche, leader failure, provider degradation.
  • Tests: GameDay per playbook, delay/error injections.
  • Regular reviews of schedules, queue blockages, and automation ROIs.

20) FAQ

Why is cron not enough?
Without queues, idempotency, locks and auditing, cron breaks down on crashes and time zones.

Can time-based and event-driven be combined?
Yes: cron - insurance for catch-up; events - for reactivity.

How to achieve "exactly once"?
Key dedup, transactional effects log, receipts and idempotent side effects.

What to do with "long" jobs?
Chunk, checkpoints, time-boxing, the ability to interrupt and continue.

How not to "eat" the budget?
Autoscale in queues and SLOs, cheap watches for heavy jobs, hard caps egress/compute.


Summary: The scheduler and background tasks are the production pipeline of the platform. By embedding schedules and queues, idempotence, locks and observability, adding receipts/audits, tenant isolation and FinOps control, you get predictable deadlines, fast recovery and legally consistent operations in any region and load.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.