Sagas and distributed transactions
A saga is a long-term business transaction broken down into a sequence of local steps across different services/repositories. Each step has a compensating action that rolls back the step effect in partial failure. Unlike 2PC/3PC, sagas do not hold global locks and are suitable for microservices, multi-regions and high loads, where eventual consistency is acceptable.
1) When to choose sagas (and when not)
Fit:- Long/multi-step business processes (order → payment → reserve → delivery).
- Different domains and repositories where there is no common transaction.
- Need high availability and scale-out.
- Solid ACID atomicity is critical (for example, transferring large amounts within the same registry).
- There is no clear compensability (you cannot "reserve" or cancel the effect).
- Legal/regulatory restrictions require strict isolation and an "instant" invariant.
2) Sagas models
1. Saga Orchestrator-The central coordinator manages steps and compensations.
Pros: explicit flow, error control, simplified telemetry.
Cons: centralization point, risk of "fat" coordinator.
2. Choreography (Choreography): no center - steps are initiated by events ("service A did X → service B reacts").
Pros: weak connectivity, simple scaling.
Cons: it is more difficult to track/debug the flow, the risk of "sprawl" of rules.
3. TCC (Try-Confirm/Cancel) - Each step is "Try," then Confirm or Cancel.
Pros: closer to pseudo-two-phase protocol, managed resources.
Cons: more expensive in the implementation of interfaces; requires "Try" holder timeouts.
3) Pitch and compensation design
Invariants: clearly state what should be true "before/after" the step (for example, "remainder ≥ 0").
Compensation ≠ reverse transaction: this is a logical action that cancels the business effect (refund, release, restore).
Idempotence: both step and compensator must be safely repeated (by 'operation _ id').
Timeouts: each step has a deadline; delay triggers compensation.
Non-return effects: record them separately (notifications, e-mail) and allow "best effort."
4) Consistency and order
Eventual consistency: users can see time discrepancies; UX - with "wait "/spinners/statuses.
Order by key - Group the switching steps by business key (order_id) to order the events.
Deduplication - Store the processing log ('operation _ id' → status) with the TTL.
5) Transportation and reliability
Outbox pattern-Writes the event to the local outbox table within the same transaction, and then asynchronously publishes it to the bus.
Inbox/Idempotency store: on the consumer side - a log of messages already processed.
Exactly-once effectively: "outbox + idempotent consumer" gives a practical "exactly once."
DLQ: for "poisonous" messages with rich meta-information and secure redrive.
6) Error, retray, backoff policies
We repeat only idempotent steps; write operations - with'Idempotency-Key '.
Exponential backoff + jitter; limiting attempts and summary deadline of the saga.
With systemic degradation - Circuit Breaker and graceful degradation (for example, cancel the secondary fiction part of the saga).
Business conflicts ('409') - retry after reconciliation or compensate and end.
7) Orchestrator: Responsibilities and Structure
Functions:- Tracking the status of the saga: 'PENDING → RUNNING → COMPENSATING → DONE/FAILED'.
- Planning steps, deadlines, timeouts, retreats.
- Event routing and compensation launch.
- Idempotency of coordinator operations (command log).
- Observability: 'saga _ id' correlation in logs/traces/metrics.
- Tables' saga ',' saga _ step ',' commands', 'outbox'.
- Indexes on 'saga _ id', 'business _ key', 'status', 'next _ run _ at'.
8) Choreography: rules and protection against the "snowball"
Event contracts: schemes and versioning (Avro/Proto/JSON Schema).
Clear semantics: "event fact" vs "command."
Stopping the chain: the service, having detected a mismatch, publishes a'Failed '/' Compensate' event.
Alarms and alerts on "endless loops."
9) TCC: practical details
Try: resource reserve with TTL.
Confirm: commit, release temporary locks.
Cancel: reserve rollback (without side effects).
Garbage collection: automatic cancellation of Try after TTL (idempotent Cancel).
Confirm/Cancel idempotent: repeat is safe.
10) Example (word scheme) - "Order with payment and delivery"
1. CreateOrder (local) → outbox: 'OrderCreated'.
2. PaymentService: 'Try' reserve (TCC); if → 'PaymentReserved'succeeds, if'PaymentFailed' → fails.
3. InventoryService: product reserve; out of → 'InventoryFailed '.
4. ShippingService - Create delivery slot (cancelable).
5. If any'Failed 'step → the orchestrator starts compensations in the reverse order:' CancelShipping '→' ReleaseInventory '→' PaymentCancel '.
6. If all ok → 'PaymentConfirm' → 'OrderConfirmed'.
11) Orchestrator pseudocode
pseudo startSaga(saga_id, order_id):
steps = [ReservePayment, ReserveInventory, BookShipment, ConfirmPayment]
for step in steps:
res = execWithRetry(step, order_id)
if!res.ok:
compensateInReverse(steps_done(order_id))
return FAIL return OK
execWithRetry(step, key):
for attempt in 1..MAX:
try:
return step.run(key) # идемпотентно catch RetryableError:
sleep(backoff(attempt))
catch NonRetryableError:
return FAIL return FAIL
compensateInReverse(done_steps):
for step in reverse(done_steps):
step.compensate() # идемпотентно
12) Observability and operational SLOs
Tracing: single 'saga _ id', annotations' step ',' attempt ',' decision '(run/compensate/skip).
Metrics:- Success/error of sagas (%), average duration, p95/p99.
- The share of compensated sagas, top reasons for compensation.
- Queues/outbox lags, retrays in steps.
- Logs/audits: orchestrator solutions, resource identifiers, business keys.
13) Testing and chaos
Mass sagas → checking WFQ/DRR and caps in queues, the absence of "head-of-line blocking."
Injecting errors into each step: timeouts, '5xx', business conflicts.
Out-of-order events, duplicates, drops.
Long tails of latency → checking deadlines and compensations.
Redrave from DLQ in steps and in a whole saga.
14) Multi-tenancy, regions, compliance
Tags' tenant _ id/plan/region'in events and saga repositories.
Residency: data/events do not leave the region; cross-regional sagas are designed as federations of local sagas + aggregating events.
Prioritization: VIP sagas carry greater quota weight; isolation of workers per tenant.
15) Pre-sale checklist
- Each step has a clear compensator, both are idempotent.
- Template selected: orchestration/choreography/TSS; the responsibility limits are described.
- Outbox/Inbox implemented, deduplication by 'operation _ id'.
- Retray policies: backoff with jitter, try limits and overall saga deadline.
- Event contracts are versioned, there is validation of the scheme.
- DLQ and Secure Release are configured.
- Telemetry: metrics, tracing, correlation 'saga _ id'.
- Operational playbooks: manual cancel/force-confirm, unlocking "hung" sagas.
- Chaos and load tests pass, SLO/error budget defined.
16) Typical errors
There is no compensator or it is "unclean" (has side effects).
There is no idempotence/dedup - doubles and "swings" of states.
"Saga in Saga" without explicit boundaries - cycles and mutual locks.
There are no deadlines → "eternal" sagas and resource leaks.
The orchestrator stores the state "in memory" without a stable store.
Choreography without a telemetry center → "invisible" failures.
Opaque UX: Users do not see intermediate statuses.
17) Quick recipes
SaaS classics: orchestration + outbox/inbox, exponential backoff, DLQ, saga statuses in UI.
Strong resource invariants: TCC with reserve TTL and GC Cancel.
High volume/load: event choreography + strict idempotency and key metrics.
Multi-region: local sagas + final aggregates; avoid global locks.
Conclusion
Sagas are a way to get predictable consistency in distributed systems without global locks. Clear compensators, idempotence, reliable delivery (outbox/inbox), timeout and retray discipline, plus telemetry and playbooks are the key to ensuring that complex business processes remain stable and readable with increasing load, number of services and geographies.