Sagas and distributed transactions

A saga is a long-term business transaction broken down into a sequence of local steps across different services/repositories. Each step has a compensating action that rolls back the step effect in partial failure. Unlike 2PC/3PC, sagas do not hold global locks and are suitable for microservices, multi-regions and high loads, where eventual consistency is acceptable.

1) When to choose sagas (and when not)

Fit:

Long/multi-step business processes (order → payment → reserve → delivery).
Different domains and repositories where there is no common transaction.
Need high availability and scale-out.

Not suitable:

Solid ACID atomicity is critical (for example, transferring large amounts within the same registry).
There is no clear compensability (you cannot "reserve" or cancel the effect).
Legal/regulatory restrictions require strict isolation and an "instant" invariant.

2) Sagas models

1. Saga Orchestrator-The central coordinator manages steps and compensations.

Pros: explicit flow, error control, simplified telemetry.
Cons: centralization point, risk of "fat" coordinator.

2. Choreography (Choreography): no center - steps are initiated by events ("service A did X → service B reacts").

Pros: weak connectivity, simple scaling.
Cons: it is more difficult to track/debug the flow, the risk of "sprawl" of rules.

3. TCC (Try-Confirm/Cancel) - Each step is "Try," then Confirm or Cancel.

Pros: closer to pseudo-two-phase protocol, managed resources.
Cons: more expensive in the implementation of interfaces; requires "Try" holder timeouts.

3) Pitch and compensation design

Invariants: clearly state what should be true "before/after" the step (for example, "remainder ≥ 0").
Compensation ≠ reverse transaction: this is a logical action that cancels the business effect (refund, release, restore).
Idempotence: both step and compensator must be safely repeated (by 'operation _ id').
Timeouts: each step has a deadline; delay triggers compensation.

Non-return effects: record them separately (notifications, e-mail) and allow "best effort."

4) Consistency and order

Eventual consistency: users can see time discrepancies; UX - with "wait "/spinners/statuses.
Order by key - Group the switching steps by business key (order_id) to order the events.
Deduplication - Store the processing log ('operation _ id' → status) with the TTL.

5) Transportation and reliability

Outbox pattern-Writes the event to the local outbox table within the same transaction, and then asynchronously publishes it to the bus.
Inbox/Idempotency store: on the consumer side - a log of messages already processed.

Exactly-once effectively: "outbox + idempotent consumer" gives a practical "exactly once."

DLQ: for "poisonous" messages with rich meta-information and secure redrive.

6) Error, retray, backoff policies

We repeat only idempotent steps; write operations - with'Idempotency-Key '.
Exponential backoff + jitter; limiting attempts and summary deadline of the saga.
With systemic degradation - Circuit Breaker and graceful degradation (for example, cancel the secondary fiction part of the saga).
Business conflicts ('409') - retry after reconciliation or compensate and end.

7) Orchestrator: Responsibilities and Structure

Functions:

Tracking the status of the saga: 'PENDING → RUNNING → COMPENSATING → DONE/FAILED'.
Planning steps, deadlines, timeouts, retreats.
Event routing and compensation launch.
Idempotency of coordinator operations (command log).
Observability: 'saga _ id' correlation in logs/traces/metrics.

Storage:

Tables' saga ',' saga _ step ',' commands', 'outbox'.
Indexes on 'saga _ id', 'business _ key', 'status', 'next _ run _ at'.

8) Choreography: rules and protection against the "snowball"

Event contracts: schemes and versioning (Avro/Proto/JSON Schema).

Clear semantics: "event fact" vs "command."

Stopping the chain: the service, having detected a mismatch, publishes a'Failed '/' Compensate' event.

Alarms and alerts on "endless loops."

9) TCC: practical details

Try: resource reserve with TTL.
Confirm: commit, release temporary locks.
Cancel: reserve rollback (without side effects).
Garbage collection: automatic cancellation of Try after TTL (idempotent Cancel).
Confirm/Cancel idempotent: repeat is safe.

10) Example (word scheme) - "Order with payment and delivery"

1. CreateOrder (local) → outbox: 'OrderCreated'.
2. PaymentService: 'Try' reserve (TCC); if → 'PaymentReserved'succeeds, if'PaymentFailed' → fails.
3. InventoryService: product reserve; out of → 'InventoryFailed '.
4. ShippingService - Create delivery slot (cancelable).
5. If any'Failed 'step → the orchestrator starts compensations in the reverse order:' CancelShipping '→' ReleaseInventory '→' PaymentCancel '.
6. If all ok → 'PaymentConfirm' → 'OrderConfirmed'.

11) Orchestrator pseudocode

pseudo startSaga(saga_id, order_id):
steps = [ReservePayment, ReserveInventory, BookShipment, ConfirmPayment]
for step in steps:
res = execWithRetry(step, order_id)
if! res. ok:
compensateInReverse(steps_done(order_id))
return FAIL return OK

execWithRetry(step, key):
for attempt in 1..MAX:
try:
return step. run (key) # idempotently catch RetryableError:
sleep(backoff(attempt))
catch NonRetryableError:
return FAIL return FAIL

compensateInReverse(done_steps):
for step in reverse(done_steps):
step. compensate () # idempotently

12) Observability and operational SLOs

Tracing: single 'saga _ id', annotations' step ',' attempt ',' decision '(run/compensate/skip).

Metrics:

Success/error of sagas (%), average duration, p95/p99.
The share of compensated sagas, top reasons for compensation.
Queues/outbox lags, retrays in steps.
Logs/audits: orchestrator solutions, resource identifiers, business keys.

13) Testing and chaos

Injecting errors into each step: timeouts, '5xx', business conflicts.
Out-of-order events, duplicates, drops.
Long tails of latency → checking deadlines and compensations.

Mass sagas → checking WFQ/DRR and caps in queues, the absence of "head-of-line blocking."

Redrave from DLQ in steps and in a whole saga.

14) Multi-tenancy, regions, compliance

Tags' tenant _ id/plan/region'in events and saga repositories.
Residency: data/events do not leave the region; cross-regional sagas are designed as federations of local sagas + aggregating events.
Prioritization: VIP sagas carry greater quota weight; isolation of workers per tenant.

15) Pre-sale checklist

Each step has a clear compensator, both are idempotent.
Template selected: orchestration/choreography/TSS; the responsibility limits are described.
Outbox/Inbox implemented, deduplication by 'operation _ id'.
Retray policies: backoff with jitter, try limits and overall saga deadline.
Event contracts are versioned, there is validation of the scheme.
DLQ and Secure Release are configured.
Telemetry: metrics, tracing, correlation 'saga _ id'.
Operational playbooks: manual cancel/force-confirm, unlocking "hung" sagas.
Chaos and load tests pass, SLO/error budget defined.

16) Typical errors

There is no compensator or it is "unclean" (has side effects).
There is no idempotence/dedup - doubles and "swings" of states.
"Saga in Saga" without explicit boundaries - cycles and mutual locks.
There are no deadlines → "eternal" sagas and resource leaks.
The orchestrator stores the state "in memory" without a stable store.
Choreography without a telemetry center → "invisible" failures.
Opaque UX: Users do not see intermediate statuses.

17) Quick recipes

SaaS classics: orchestration + outbox/inbox, exponential backoff, DLQ, saga statuses in UI.
Strong resource invariants: TCC with reserve TTL and GC Cancel.
High volume/load: event choreography + strict idempotency and key metrics.
Multi-region: local sagas + final aggregates; avoid global locks.

Conclusion

Sagas are a way to get predictable consistency in distributed systems without global locks. Clear compensators, idempotence, reliable delivery (outbox/inbox), timeout and retray discipline, plus telemetry and playbooks are the key to ensuring that complex business processes remain stable and readable with increasing load, number of services and geographies.

Sagas and distributed transactions

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects