GH GambleHub

Saga pattern and distributed transactions

Saga pattern and distributed transactions

1) Why sagas are needed

Classic 2PC (two-phase latching) is poorly scalable, complex under failures and blocks resources. The saga breaks down the overall business process into a sequence of local transactions (steps), each of which commits independently. In case of failure, the subsequent steps are canceled, and those that have already been completed are compensated by reverse operations.
The result: managed eventual consistency without global blocking, high survivability and a clear recovery protocol.

2) Basic models

2. 1 Orchestration

A dedicated saga coordinator manages the steps: sends commands, waits for responses/events, initiates compensations.
Pros: centralized control, simple observability, explicit deadlines. Cons: Optional component.

2. 2 Choreography

No coordinator; services respond to each other's events ("OrderPlaced →" "PaymentCaptured" → "InventoryReserved...").
Pros: Weak connectivity. Cons: harder to trace, risk of "dance of death" without clear rules.

2. 3 TCC (Try-Confirm/Cancel)

Option with "freezing" resources:

1. Try - preparation/reserve,

2. Confirm - fixation,

3. Cancel - rollback.

Guarantees are higher, but contracts and reserve timeouts are more complicated.

3) Step and compensation contracts

Each step = local transaction + compensation (idempotent, allows repetition).
Compensation is not required to fully "return the world" - domain equivalence is enough (for example, "return payment" instead of "delete payment").
Define invariants: for money - the balance does not go into minus; for orders - no "hung" status.
Enter deadlines/TTL reserves and a "garbage collector" for overdue attempts.

4) Consistency and delivery semantics

Message delivery: at-least-once (default) → all operations must be idempotent.
Order: important by correlation key (e.g. 'order _ id', 'player _ id').
Exactly-once is not the goal of the saga; we achieve effective uniformity through idempotent keys, outbox/inbox and correct commiting.

5) The state of the saga and its log

What to store:
  • 'saga _ id ',' correlation _ id ', current status (Running/Completed/Compensating/Compensated/Failed),
  • step and its variables (payment/reserve IDs),
  • history (log) of events/decisions, timestamps, deadlines, the number of retrays.
Where to store:
  • A separate Saga Store (table/document) available to the coordinator.
  • For choreography - local "agents" of the saga, publishing status events in a common topic.

6) Reliable publishing patterns: outbox/inbox

Outbox: the step commits the change and writes the event/command to the outbox table in one transaction; the worker publishes into the tire.
Inbox: the consumer maintains a table of processed 'message _ id' → dedup + idempotency.
After a successful side effect commit offset/ACK (Kafka/RabbitMQ) - not earlier.

7) Designing saga steps

7. 1 Example (iGaming/e-commerce purchase)

1. PlaceOrder → status'PENDING '.
2. AuthorizePayment (Try) → `payment_hold_id`.
3. ReserveInventory → `reservation_id`.
4. CapturePayment (Confirm).
5. FinalizeOrder → `COMPLETED`.

Compensations:
  • if (3) the'CancelPaymentHold '→ fails;
  • (4) failed after (3) → 'ReleaseInventory ';
  • if (5) the'RefundPayment 'and'ReleaseInventory' → fails.

7. 2 Deadlines/Retreats

Maximum N retrays with exponential delay + jitter.
After exceeding - go to'Compensating '.
Store next_attempt_at and deadline_at for each step.

8) Orchestrator vs platform

Options:
  • Lightweight home orchestrator (microservice + Saga table).
  • Platforms: Temporal/Cadence, Camunda, Netflix Conductor, Zeebe - give timers, retrays, long-lived workflows, visibility and a web console.
  • For choreography, use an event catalog and a strict status/key convention.

9) Integration protocols

9. 1 Asynchronous (Kafka/RabbitMQ)

Commands: 'payments. authorize. v1`, `inventory. reserve. v1`.
Events: 'payments. authorized. v1`, `inventory. reserved. v1`, `payments. captured. v1`, `payments. refunded. v1`.
Part key = 'order _ id '/' player _ id' for order.

9. 2 Synchronous (HTTP/gRPC) within a step

Valid for "short" steps, but always with timeouts/retrays/idempotency and fallback to asynchronous compensation.

10) Idempotence and keys

In command and compensation requests, pass'idempotency _ key '.

Side effects (writing to the database/writing off) are performed conditionally: "perform if you have not yet seen 'idempotency _ key'."

Compensation is also idempotent: repeating 'RefundPayment (id = X)' is safe.

11) Error handling

Classes:
  • Transient (networks/timeouts) → retray/backoff.
  • Business (insufficient funds, limits) → immediate compensation/alternative path.
  • Irrecoverable → manual intervention, manual compensation.
  • Build a solution matrix: error type → action (retry/compensate/escalate).

12) Observability and SLO sag

SLI/SLO:
  • End-to-end latency of the saga (p50/p95/p99).
  • Success rate.
  • Mean time to compensate и compensation rate.
  • Orphaned sagas and time to GC.
  • Trace: 'trace _ id '/' saga _ id' as span link between steps; burn-rate metrics for error budgets.

Logs: each saga status change = structured record with cause.

13) Examples (pseudocode)

13. 1 Orchestrator (idea)

python def handle(OrderPlaced e):
saga = Saga. start(e. order_id)
saga. run(step=authorize_payment, compensate=cancel_payment)
saga. run(step=reserve_inventory, compensate=release_inventory)
saga. run(step=capture_payment, compensate=refund_payment)
saga. run(step=finalize_order, compensate=refund_and_release)
saga. complete()

def run(step, compensate):
try:
step () # local transaction + outbox except Transient:
schedule_retry()
except Business as err:
start_compensation(err)

13. 2 Outbox (table idea)


outbox(id PK, aggregate_id, event_type, payload, created_at, sent_at NULL)
inbox(message_id PK, processed_at, status)
saga(order_id PK, state, step, next_attempt_at, deadline_at, context JSONB)
saga_log(id PK, order_id, time, event, details)

13. 3 Choreography (theme ideas)

`orders. placed '→ consumers: ' payments. authorize`, `inventory. reserve`

`payments. authorized` + `inventory. reserved` → `orders. try_finalize`

Any failure of → 'orders. compensate '→ initiated' payments. cancel/refund`, `inventory. release`.

14) Comparison with 2PC and ES

2PC: strong consistency, but blockages, bottlenecks, copper pipes.
Saga: eventual consistency, you need a discipline of compensation and telemetry.
Event Sourcing: stores events as a source of truth; sagas on it are natural, but add complexity to migrations/snapshots.

15) Safety and compliance

Transport security (TLS/mTLS), ACL per topic/queue.
In events - at least PII, encryption of sensitive fields, tokenization.
Audit access to sagas and compensation logs.
SLA with external providers (payments/delivery) = deadline and retray limit parameters.

16) Implementation checklist (0-45 days)

0-10 days

Select the candidate processes (multi-service, compensated).
Select the model (orchestration/choreography/TCC) and correlation key.
Describe steps/offsets, invariants and deadlines. Raise the tables' saga ',' outbox ',' inbox '.

11-25 days

Include outbox/inbox, idempotency, and backoff retraces.
Deploite first sagas; add SLI/SLO dashboards and trace.
Write a runbook of compensations (including manual) and escalations.

26-45 days

Automate GC "hanging" sagas, periodic restarts/continuations on deadline.
Spend game-day: step failure, deadline excess, broker unavailability.

Standardize event contracts (versions, compatibility), set up a "saga directory."

17) Anti-patterns

"Compensation = delete from database" instead of domain-correct reverse action.
No outbox/inbox → loss of events/double effects.
Retrai without jitter → self-DDoS dependencies.

Expecting strong consistency on reading without "processing in progress...."

One giant orchestrator for all → control monolith.
Total choreography without visibility and SLA → uncontrollable dance.
Ignoring deadlines → eternal reserves/holds.

18) Maturity metrics

≥ 90% of critical processes are covered by sagas/compensations and have the described invariants.
Outbox/inbox are integrated for all Tier-0/1 producers/consumers.
SLO: p95 end-to-end saga is normal, success rate is stable, orphaned <target.
Transparent tracing and dashboards "in steps," burn-rate alerts.
Quarterly game-day and manual runbook compensation check.

19) Conclusion

Saga is a practical contract of coherence for distributed systems: clear steps and reverse actions, publishing discipline (outbox/inbox), deadlines and retrays, observability, and compensation processes. Choose a model (orchestration/choreography/TSS), fix invariants and keys, make handlers idempotent - and your multiservice business processes will become predictable and stable without expensive 2PC.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.