Operations and → Management Automated workflows
Automated workflows
1) Why do you need it
Automated workflows reduce manual operations, speed up "idea-to-money time" and reduce the risk of mistakes. In iGaming/fintech, it is critical for deposits/withdrawals, KYC/AML, bonus/jackpot management, content updates, incident-reactions, and back-office tasks.
Objectives:- Robust, transparently observed processes from trigger to outcome.
- Minimum manual steps predictable by process SLOs.
- Error control: retrays, compensatory actions, clear escalations.
- Scaling by events and load without storms and duplicates.
2) Basic terminology
Workflow (WF): a chain of steps (tasks) to achieve a business result.
Orchestration: The central coordinator manages the steps and their order.
Choreography: steps react to events, there is no "central brain."
Compensation: reverse actions in partial failure (sagas).
HITL (Human-in-the-loop): controlled "manual" solutions within WF.
SLO of the process: target time of completion/success of a specific WF (for example, "95% of deposits ≤ 3 seconds").
3) Where to apply (examples)
Payment flow: deposits, anti-fraud, posting in accounting, notifications.
KYC/AML: collection of documents, checks by providers, escalation to compliance.
Content/limit management: publishing games, quotas, geo-rules.
Bonuses/jackpots: accruals, deductions, calculation of conditions, payments.
Incidents: auto-diagnostics, abbreviated checklists, communications.
Data/ETL: report uploads, reconciliation, archiving.
4) Orchestration vs Choreography
Orchestration is suitable when: complex branch logic, strict SLOs, explicit deadlines/timeouts, a visual "process map" is needed by business.
Choreography - when: high event, weak connectivity, many independent consumers of one event.
Hybrid: Long-lived sagas are controlled by an orchestrator, and local reactions are performed through events.
5) Architectural principles
Idempotency: each step must be safely repeated (idempotency-key, dedup by message-ID).
Explicit timeouts and retreats: backoff + jitter, try limits, retreats for safe mistakes only.
Compensations (sagas): Chain rollbacks on partial failure.
Isolation of steps: bulkhead (individual pools/limits on external downstreams).
Contracts: OpenAPI/AsyncAPI for all external calls, CDC tests.
WF versioning: changing the schema of input/output data without "mass" drops of old instances.
6) Event and trigger model
Trigger types:- domain event ('deposit. requested`),
- schedule (cron),
- manual start (operator/support),
- signal from alert (incident-auto-workflow).
- Context: correlation 'trace _ id', 'workflow _ instance _ id', user/region, phicheflag version.
- Cheap input filters: early validation and cut-off of takes.
7) Step design (tasks)
Each step is described: entry, exit, SLO, timeout, attempts, retray conditions, compensation, rights/secrets.
Pseudo step description:
task: call_psp input: { user_id, amount, currency, idempotency_key }
timeout: 200ms retries:
max: 2 on: [5xx, connect_error]
backoff: exponential jitter: true compensation: reverse_authorization secrets: [PSP_TOKEN]
sla: p99 <= 300ms
8) Compensation and sagas
Local transaction + event "save intent → publish event."
Compensation: cancellation of authorization, return of bonus, balance recalculation, ticket closure.
Compensation idempotence: repeated cancellation should not break invariants.
9) Security and secrets
KMS/Secrets Manager: token storage, rotation, role access.
Least privileges: the WF engine is given exactly the right scopes.
Webhook/Kolbek signature: HMAC/JWS, timestamp check.
Data policies: PII masking in logs/traces, encryption.
10) Observability and SLO
Process metrics: 'workflow _ started/completed', 'success _ rate', 'aborted', 'mean/p95/p99 duration', hanging instances, 'dead letter'.
Step metrics: 'task _ latency', 'error _ rate', 'retry _ count', 'open _ circuit', 'cost _ per _ 1k _ calls'.
Traces: span for each step, tags' workflow. name`, `step`, `attempt`.
SLO: for example, "95% of deposits ≤ 3 seconds, 99% ≤ 5 seconds; abort ≤ 0. 3 %/day."
Dashboards: thermal step map, bottlenecks, dependency maps.
11) Human-in-circuit (HITL)
Criteria: controversial cases (risk/AML), manual confirmation of large payments.
Deadlines: timeout waiting for a decision, reminders/escalation.
Audit: who/when/what decided, justification, bundle with a ticket.
12) Change Management and Releases
Workflow versions: 'v1' and 'v2' in parallel; instance migration is not possible - terminate old instances naturally, new traffic to 'v2'.
Canary traffic: 1% → 10% → 100%, comparison of metrics' success/p95/abort '.
Ficheflags: A quick rollback to a previous step/branch implementation.
CDC/contracts: Gate in CI to keep step changes from breaking consumers/providers.
13) Testing
Unit steps: positive/negative + idempotency.
Contract tests: against moka/stage provider.
WF simulations: happy-path + timeouts, 4xx/5xx, "slow provider," loss of events, partial errors.
Game-days: injection of glitches (PSP/KYC drop, queue lag, closed breaker).
Replay: Replay historical events to validate migrations.
14) Incidents and auto-reactions
Incident auto-workflow: collecting metrics, checking downstreams, notifications, preparing workaround (switching provider, degradation).
Runbook steps: how to "untangle" hung instances when manual abort/force-complete is allowed.
15) Cost management
Quotas and "soft-cap": limits on expensive steps/providers.
Cache/dedup: do not make repeated external calls unnecessarily.
Reports: 'cost _ per _ 1k _ workflows', "cost of success" by WF type.
16) Mini-template workflow (pseudo-YAML)
workflow: deposit_v1 trigger:
event: deposit. requested filters: [amount > 0, currency in [USD,EUR,TRY]]
sla:
p95_ms: 3000 abort_rate_daily: 0. 3%
steps:
- name: reserve_funds timeout_ms: 150 retries: {max: 2, on: [5xx, connect_error], backoff: exponential, jitter: true}
compensation: release_reserve
- name: call_psp timeout_ms: 200 retries: {max: 2, on: [5xx, connect_error]}
circuit_breaker: {error_rate: 0. 05, window_s: 10, open_s: 30}
- name: post_ledger type: async topic: ledger. post
- name: notify_user channel: push hitl:
when: amount > 10000 or risk_score > 0. 8 timeout_m: 30 escalate_to: "compliance@oncall"
observability:
emit_metrics: true trace: true security:
secrets: [PSP_TOKEN, PUSH_API_KEY]
17) Retray and timeout policies (recommendations)
Step timeout = 70-80% of its latency budget.
Retrai ≤ 2-3, only for idempotent operations and network failures.
Jitter is mandatory; Ban retreats from bottleneck timeouts without a follbeck.
Compensation - as separate steps, also idempotent.
18) Dashboards (minimum)
WF Overview: launches/success/abort, p95/p99 duration, hangs/grandfathers.
Step Drilldown: Top slow/mistake steps, retreats, open breakers.
Provider Panel: outgoing p95/error-rate/quotas/cost.
HITL Board: "pending decision," timeline, compliance SLAs.
19) Implementation checklist
- Key WF map and owners (on-call, chat, repo).
- Description of steps: in/out, SLO, timeouts, retrays, compensations, secrets.
- OpenAPI/AsyncAPI + CDC contracts.
- Idempotence/deadup at the entrance and at the steps.
- Dashboards, traces, alerts (SLO process and steps).
- Canary + phicheflags for WF releases.
- Runbook: How to "treat" hung/partially executed WFs.
- Degradation plan: alternative providers, switching off "heavy" branches.
- Secret/access/audit policies.
- Game-days/xaoc-scenarios once a sprint.
20) Examples of alerts (ideas)
ALERT WorkflowSLOBreached
IF workflow_p95_duration_ms{name="deposit_v1"} > 3000 FOR 15m
LABELS {severity="critical", team="payments"}
ALERT WorkflowAbortRateHigh
IF rate(workflow_aborted_total{name="deposit_v1"}[30m]) > 0. 005
LABELS {severity="warning", team="payments"}
ALERT StepRetryStorm
IF step_retry_count{name="call_psp"} > 2 baseline_1w FOR 10m
LABELS {severity="warning", team="integrations"}
ALERT StuckInstances
IF workflow_in_progress_age_p95_m{name="kyc_v2"} > 60
LABELS {severity="warning", team="risk"}
21) Anti-patterns
"Large monolithic WF" with 100 + steps and rigid connectivity - breaks difficult and noisy.
Retrays for non-idempotent transactions (double charges/charges).
Timeouts "longer than life" of the user's request → hangmen and "zombies."
Lack of compensation → manual fixes and long post-mortems.
No WF versioning → releases break old instances.
Secrets inside configs/variables without rotation and audit.
22) Workflow Quality KPI
Success rate and Abort rate by WF type.
p95/p99 duration of steps and process.
MTTD/MTTR on process incidents.
Retry storm count/month (target → 0).
Cost per 1k WF and "cost of success."
Share of automation:% of cases without HITL.
23) Fast start (defaults)
Start with 3-5 critical WF (deposit, withdrawal, KYC).
Orchestrate long-lived sagas; local reactions - events.
Step timeout ≤ 80% of the budget; retrai ≤ 2 with backoff + jitter.
Compensations are determined in writing and tested.
Turn on the canary for 5-10% of traffic with a comparison dashboard.
Each WF has an owner, a runbook and SLO alerts.
24) FAQ
Q: What to choose: orchestrator or events?
A: If you need a visual map, deadlines and long sagas are an orchestrator. If simple reactions to events and a lot of consumers prevail, choreography. Often the best option is a hybrid.
Q: How do you avoid duplicates?
A: Idempotency-key at WF input, dedup by 'message _ id' and storage of "seen-events." Steps are idempotent.
Q: Does it need a man-in-the-circuit?
A: Yes, for controversial/expensive cases. But measure and reduce HITL share through better automation and rules.