Saga pattern and distributed transactions
Saga pattern and distributed transactions
1) Why sagas are needed
Classic 2PC (two-phase latching) is poorly scalable, complex under failures and blocks resources. The saga breaks down the overall business process into a sequence of local transactions (steps), each of which commits independently. In case of failure, the subsequent steps are canceled, and those that have already been completed are compensated by reverse operations.
The result: managed eventual consistency without global blocking, high survivability and a clear recovery protocol.
2) Basic models
2. 1 Orchestration
A dedicated saga coordinator manages the steps: sends commands, waits for responses/events, initiates compensations.
Pros: centralized control, simple observability, explicit deadlines. Cons: Optional component.
2. 2 Choreography
No coordinator; services respond to each other's events ("OrderPlaced →" "PaymentCaptured" → "InventoryReserved...").
Pros: Weak connectivity. Cons: harder to trace, risk of "dance of death" without clear rules.
2. 3 TCC (Try-Confirm/Cancel)
Option with "freezing" resources:1. Try - preparation/reserve,
2. Confirm - fixation,
3. Cancel - rollback.
Guarantees are higher, but contracts and reserve timeouts are more complicated.
3) Step and compensation contracts
Each step = local transaction + compensation (idempotent, allows repetition).
Compensation is not required to fully "return the world" - domain equivalence is enough (for example, "return payment" instead of "delete payment").
Define invariants: for money - the balance does not go into minus; for orders - no "hung" status.
Enter deadlines/TTL reserves and a "garbage collector" for overdue attempts.
4) Consistency and delivery semantics
Message delivery: at-least-once (default) → all operations must be idempotent.
Order: important by correlation key (e.g. 'order _ id', 'player _ id').
Exactly-once is not the goal of the saga; we achieve effective uniformity through idempotent keys, outbox/inbox and correct commiting.
5) The state of the saga and its log
What to store:- 'saga _ id ',' correlation _ id ', current status (Running/Completed/Compensating/Compensated/Failed),
- step and its variables (payment/reserve IDs),
- history (log) of events/decisions, timestamps, deadlines, the number of retrays.
- A separate Saga Store (table/document) available to the coordinator.
- For choreography - local "agents" of the saga, publishing status events in a common topic.
6) Reliable publishing patterns: outbox/inbox
Outbox: the step commits the change and writes the event/command to the outbox table in one transaction; the worker publishes into the tire.
Inbox: the consumer maintains a table of processed 'message _ id' → dedup + idempotency.
After a successful side effect commit offset/ACK (Kafka/RabbitMQ) - not earlier.
7) Designing saga steps
7. 1 Example (iGaming/e-commerce purchase)
1. PlaceOrder → status'PENDING '.
2. AuthorizePayment (Try) → `payment_hold_id`.
3. ReserveInventory → `reservation_id`.
4. CapturePayment (Confirm).
5. FinalizeOrder → `COMPLETED`.
- if (3) the'CancelPaymentHold '→ fails;
- (4) failed after (3) → 'ReleaseInventory ';
- if (5) the'RefundPayment 'and'ReleaseInventory' → fails.
7. 2 Deadlines/Retreats
Maximum N retrays with exponential delay + jitter.
After exceeding - go to'Compensating '.
Store next_attempt_at and deadline_at for each step.
8) Orchestrator vs platform
Options:- Lightweight home orchestrator (microservice + Saga table).
- Platforms: Temporal/Cadence, Camunda, Netflix Conductor, Zeebe - give timers, retrays, long-lived workflows, visibility and a web console.
- For choreography, use an event catalog and a strict status/key convention.
9) Integration protocols
9. 1 Asynchronous (Kafka/RabbitMQ)
Commands: 'payments. authorize. v1`, `inventory. reserve. v1`.
Events: 'payments. authorized. v1`, `inventory. reserved. v1`, `payments. captured. v1`, `payments. refunded. v1`.
Part key = 'order _ id '/' player _ id' for order.
9. 2 Synchronous (HTTP/gRPC) within a step
Valid for "short" steps, but always with timeouts/retrays/idempotency and fallback to asynchronous compensation.
10) Idempotence and keys
In command and compensation requests, pass'idempotency _ key '.
Side effects (writing to the database/writing off) are performed conditionally: "perform if you have not yet seen 'idempotency _ key'."
Compensation is also idempotent: repeating 'RefundPayment (id = X)' is safe.
11) Error handling
Classes:- Transient (networks/timeouts) → retray/backoff.
- Business (insufficient funds, limits) → immediate compensation/alternative path.
- Irrecoverable → manual intervention, manual compensation.
- Build a solution matrix: error type → action (retry/compensate/escalate).
12) Observability and SLO sag
SLI/SLO:- End-to-end latency of the saga (p50/p95/p99).
- Success rate.
- Mean time to compensate и compensation rate.
- Orphaned sagas and time to GC.
- Trace: 'trace _ id '/' saga _ id' as span link between steps; burn-rate metrics for error budgets.
Logs: each saga status change = structured record with cause.
13) Examples (pseudocode)
13. 1 Orchestrator (idea)
python def handle(OrderPlaced e):
saga = Saga. start(e. order_id)
saga. run(step=authorize_payment, compensate=cancel_payment)
saga. run(step=reserve_inventory, compensate=release_inventory)
saga. run(step=capture_payment, compensate=refund_payment)
saga. run(step=finalize_order, compensate=refund_and_release)
saga. complete()
def run(step, compensate):
try:
step () # local transaction + outbox except Transient:
schedule_retry()
except Business as err:
start_compensation(err)
13. 2 Outbox (table idea)
outbox(id PK, aggregate_id, event_type, payload, created_at, sent_at NULL)
inbox(message_id PK, processed_at, status)
saga(order_id PK, state, step, next_attempt_at, deadline_at, context JSONB)
saga_log(id PK, order_id, time, event, details)
13. 3 Choreography (theme ideas)
`orders. placed '→ consumers: ' payments. authorize`, `inventory. reserve`
`payments. authorized` + `inventory. reserved` → `orders. try_finalize`
Any failure of → 'orders. compensate '→ initiated' payments. cancel/refund`, `inventory. release`.
14) Comparison with 2PC and ES
2PC: strong consistency, but blockages, bottlenecks, copper pipes.
Saga: eventual consistency, you need a discipline of compensation and telemetry.
Event Sourcing: stores events as a source of truth; sagas on it are natural, but add complexity to migrations/snapshots.
15) Safety and compliance
Transport security (TLS/mTLS), ACL per topic/queue.
In events - at least PII, encryption of sensitive fields, tokenization.
Audit access to sagas and compensation logs.
SLA with external providers (payments/delivery) = deadline and retray limit parameters.
16) Implementation checklist (0-45 days)
0-10 days
Select the candidate processes (multi-service, compensated).
Select the model (orchestration/choreography/TCC) and correlation key.
Describe steps/offsets, invariants and deadlines. Raise the tables' saga ',' outbox ',' inbox '.
11-25 days
Include outbox/inbox, idempotency, and backoff retraces.
Deploite first sagas; add SLI/SLO dashboards and trace.
Write a runbook of compensations (including manual) and escalations.
26-45 days
Automate GC "hanging" sagas, periodic restarts/continuations on deadline.
Spend game-day: step failure, deadline excess, broker unavailability.
Standardize event contracts (versions, compatibility), set up a "saga directory."
17) Anti-patterns
"Compensation = delete from database" instead of domain-correct reverse action.
No outbox/inbox → loss of events/double effects.
Retrai without jitter → self-DDoS dependencies.
Expecting strong consistency on reading without "processing in progress...."
One giant orchestrator for all → control monolith.
Total choreography without visibility and SLA → uncontrollable dance.
Ignoring deadlines → eternal reserves/holds.
18) Maturity metrics
≥ 90% of critical processes are covered by sagas/compensations and have the described invariants.
Outbox/inbox are integrated for all Tier-0/1 producers/consumers.
SLO: p95 end-to-end saga is normal, success rate is stable, orphaned <target.
Transparent tracing and dashboards "in steps," burn-rate alerts.
Quarterly game-day and manual runbook compensation check.
19) Conclusion
Saga is a practical contract of coherence for distributed systems: clear steps and reverse actions, publishing discipline (outbox/inbox), deadlines and retrays, observability, and compensation processes. Choose a model (orchestration/choreography/TSS), fix invariants and keys, make handlers idempotent - and your multiservice business processes will become predictable and stable without expensive 2PC.