Outbox-pattern
Outbox is an architectural pattern in which a domain service writes a business change and the corresponding event in one local transaction to its repository. Publishing the event to the external bus/queue is performed asynchronously by a separate secure process (publisher) that reads the'outbox 'table and relays the records. This approach eliminates the race "first to the database, then to the bus" and provides reliable delivery even in case of failures.
1) When to apply
Fit:- Microservices and modular monoliths with events between contexts.
- It is required to ensure that "the state is fixed ↔ the event cannot be lost."
- We need idempotence and controlled re-delivery.
- Tough global transactions on several resources are critical (better than TCC/sagas with explicit contracts).
- There is no dedicated source of truth (state is not stored where the event is generated).
2) Objectives and properties
Atomic write: domain record + outbox - in one transaction.
At-least-once publication: we allow repetition, exclude loss.
Consumer idempotence: protection against takes on the subscriber side.
Effective exactly-once: achieved by the combination of outbox + idempotent consumer + dedup.
Clear telemetry - Correlate business transactions and events.
3) Data schema (example)
sql
-- Domain table (example: orders)
CREATE TABLE orders (
id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
status TEXT NOT NULL,
total_amount NUMERIC(12,2) NOT NULL,
updated_at TIMESTAMP NOT NULL DEFAULT now()
);
-- Outbox
CREATE TABLE outbox (
id UUID PRIMARY KEY, -- event_id aggregate_type TEXT NOT NULL, -- 'order'
aggregate_id UUID NOT NULL, -- order_id tenant_id TEXT NOT NULL,
type TEXT NOT NULL, -- 'OrderCreated'
payload JSONB NOT NULL, -- serialized headers event JSONB NOT NULL DEFAULT '{}':: jsonb,
occurred_at TIMESTAMP NOT NULL, -- time in domain transaction available_at TIMESTAMP NOT NULL, -- earliest publish time (backoff)
published_at TIMESTAMP, - is filled by the attempts INT NOT NULL DEFAULT 0,
error TEXT
);
CREATE INDEX ON outbox (available_at) WHERE published_at IS NULL;
CREATE INDEX ON outbox (tenant_id, available_at) WHERE published_at IS NULL;
4) Application layer
pseudo begin tx domainChange () # INSERT/UPDATE in domain table insert into outbox (event) # event with aggregate/tenant commit tx keys
If the commit is successful, the event in the outbox is guaranteed to exist. If the application drops after a commit, the publisher will catch up.
5) Publisher (reader → publisher)
Tasks:- Periodically read unpublished events ('published _ at IS NULL' and 'available _ at <= now ()'), batches.
- Try to publish to the bus/queue; if successful, mark 'published _ at'.
- In case of error - increase 'attempts', put 'available _ at' for the future (exponential backoff), write 'error'.
- Respect the limits on tenants/keys (fairness), do not block the product.
pseudo loop:
events = select from outbox where published_at is null and available_at <= now()
order by occurred_at limit BATCH_SIZE for update skip locked
for e in events:
try:
broker. publish(topicFor(e), serialize(e. payload), headers(e))
markPublished(e. id, now())
except Retryable:
backoff = computeBackoff(e. attempts)
reschedule(e. id, now()+backoff, attempts+1, last_error)
except NonRetryable:
moveToDLQ (e) or markError (e) # by sleep (POLL_INTERVAL) policy
6) Idempotency and deduplication
On the consumer side (Inbox/Idempotency store):sql
CREATE TABLE inbox (
consumer_name TEXT,
event_id UUID,
processed_at TIMESTAMP NOT NULL,
PRIMARY KEY (consumer_name, event_id)
);
Algorithm: when receiving an event, first try to'INSERT 'in' inbox '; if there is a key conflict, the event has already been handled → no-op. Next is business logic.
On the publisher side: 'Idempotency-Key' in headers (for example, 'event _ id') so that the bus/broker/proxy can filter duplicates.
7) Order and causality
Local order by'agregate _ id'is provided by sorting'occured _ at' and publishing "by key."
For log-buses with partitioning - partition with the'agregate _ id '/' tenant _ id 'key so that the events of one aggregate are in the same partitioning.
If order is critical, avoid cross-flow single-key publisher races.
8) CDC (Change Data Capture)
Instead of an active publisher, you can use CDC: the engine reads the database transaction log and translates the 'outbox' lines to the bus. Pros - minimal load on the database, exact sequence, no polling. Disadvantages - complication of operation and a tie to the specifics of the DBMS. Both approaches are valid; choose by competencies and SLO.
9) Errors, DLQ and Redrive
Retryable (network, limits) - increase the'attempts', postpone the'available _ at '(exponential backoff + jitter).
Non-retryable (invalid scheme/contract) - transferred to DLQ/Dead-Letter Topic with rich metadata.
Safe Redrive: Batches, Rate-Limit, Validation of the Scheme, Priority Below Production Traffic.
10) Multi-tenancy and limits
Required tags: 'tenant _ id', 'plan', 'region' - in 'outbox. headers`.
Per-tenant fairness: the publisher distributes the "windows" of publications and the limits of attempts to tenants.
Residency: store outbox in the same region as domain data; interregional publication - aggregates/summaries only.
11) Safety and compliance
PII edition in payload/headers on tenant/region policy.
Signature/encryption of the payload if the bus is foreign.
Audit all state transitions: created, published, error, redrave.
12) Observability
Metrics:- Publication lag ('now - occurred_at' p50/p95/p99).
- Success rate, error rate, cause distribution.
- Outbox size (number of unpublished), retries/sec
- Per-tenant graphs throughput and lag.
- Correlation 'event _ id '/' aggregate _ id '/' saga _ id'; spans "db-tx," "publish," "retry."
- Annotations: 'attempt', 'backoff _ ms', 'dlq = true'.
- Short records for success; full details per error/redrave.
13) Testing and chaos
Atomicity test: artificially "fall" after committing a domain transaction before publication - the event must be released later.
Duplicate test: we publish the same event several times - the consumer performs exactly one effect (inbox).
Order test: batch of events by one aggregate - sequence/idempotence check.
Chaos: broker failure, increase in database latency, split-brain publishers, clock-skew.
14) Configuration templates (example)
yaml outbox:
poll_interval_ms: 200 batch_size: 200 order_by: occurred_at backoff:
strategy: exponential_full_jitter initial_ms: 250 max_ms: 10_000 max_attempts: 20 fairness:
per_tenant_parallelism: 4 per_key_serial: true
publisher:
rate_limit_per_sec: 500 headers:
idempotency_key: event_id schema_version: v3 dlq:
enabled: true topic: myapp. events. dlq include_metadata:
- error
- attempts
- source_table
- tenant_id
- aggregate_id
15) Integration with sagas and retreats
Outbox - "security transport" for saga steps: local transaction writes effect and command/event; publication - reliable and dosed.
Repeat and backoff policies must be consistent with 'Retry-After' and Circuit Breaker; avoid the "retray storm."
16) Typical errors
They write an event after the domain state commit - loss during a fall is possible.
No indexes/archive in'outbox '→ publishing latency increase.
Publisher without 'SKIP LOCKED' or without sharding - competition and blocking.
Lack of idempotency among consumers - duplicates and side effects.
PII mixing without masking in DLQ/logs.
A single global publishing queue without fairness - a "noisy" tenant slows down everyone.
Lack of lag monitoring → latent degradation.
17) Quick strategy choice
Starting level: polling from the database, 100-500 batches, full-jitter backoff, inbox for consumers.
High load: CDC from the transaction log, sharding by 'tenant _ id/aggregate _ id', WFQ by tenant.
Strict order by aggregate: serial publication per key (mutex), partition of the topic with a key.
Compliance/PII: payload encryption, DLQ edition, regional outbox.
18) Pre-sale checklist
- Domain changes and writes to'outbox 'occur in the same transaction.
- The publisher handles batches, uses' SKIP LOCKED ', backoff with jitter and limits.
- Consumers are idempotent (table 'inbox '/deadup log).
- DLQ and Secure Release are configured.
- Lag/error and alert metrics on p95/p99 thresholds.
- Key order is guaranteed (batches/serials).
- Archive/retention'outbox 'and clear published records.
- PII policies and state transition auditing.
- Drop tests between commit and publish, duplicates and order.
- Event contract documentation (schemas/versions/compatibility).
Conclusion
The outbox pattern turns the "fragile" bundle of "DB ↔ bus" into a reliable pipeline: atomic state fixation, guaranteed (albeit "at least once") publication, idempotent subscribers and controlled redrave. With proper telemetry, limits, and schema discipline, it gives practical exactly-once behavior, reducing the complexity of distributed transactions and increasing the system's resilience to crashes and peak loads.