Event architecture
Event Architecture (EDA)
1) What is an event and why EDA
Event - an unchanging fact that has already occurred in the domain ("PlayerVerified," "PaymentCaptured"). EDA builds integrations around the publication of these facts and the reactions to them:- weak connectivity of services,
- scaling consumers independently,
- replay/rearrangement of projections,
- transparent audit.
EDA does not cancel synchronous APIs - it complements them by bringing cross-service dependencies into the asynchronous layer.
2) Event types
Domain: significant business facts (OrderPlaced, BonusGranted).
Integration: "snapshots "/changes for external systems (UserUpdated, WalletBalanceChanged).
Technical: life cycle and telemetry (Heartbeat, PipelineFailed).
Commands (not events, but nearby): "do X" (CapturePayment) instructions.
Recommendation: domain events are primary; integration is formed by projections for specific consumers.
3) Event contracts and schemas
Схема: Avro/Protobuf/JSON Schema + Schema Registry; compatibility strategy: 'BACKWARD' for consumer evolution, 'FULL' on critical topics.
CloudEvents (id, source, type, time, subject, datacontenttype) - uniform headers.
Required metadata: 'event _ id' (ULID/UUID), 'occurred _ at', 'producer', 'schema _ version', 'correlation _ id '/' causation _ id', 'idempotency _ key'.
Versioning: add-only fields, prohibition of renaming/semantic breaks; new types - new themes/types.
json
{
"type":"record","name":"PaymentCaptured","namespace":"events.v1",
"fields":[
{"name":"event_id","type":"string"},
{"name":"occurred_at","type":{"type":"long","logicalType":"timestamp-micros"}},
{"name":"payment_id","type":"string"},
{"name":"amount","type":{"type":"bytes","logicalType":"decimal","precision":18,"scale":2}},
{"name":"currency","type":"string"},
{"name":"player_id","type":"string"}
]
}
4) Delivery, order and consistency
At-least-once as default → handler idempotency is needed.
Order: guaranteed within a party (Kafka) or queue (RabbitMQ), but may be broken by retreats; the event key must reflect a domain granule of order (for example, 'player _ id').
Consistency: for money/loans - only through magazines/sagas/compensation; avoid LWW.
Reading model: Projections and caches can be eventual - show "update in progress..." and use RNOT strategies for strict paths.
5) Outbox/Inbox и CDC
Outbox: the service writes a fact to its database and to the outbox table in one transaction → the worker publishes to the bus.
Inbox: Consumer stores' event _ id'with processing result for deduplication.
CDC (Change Data Capture): flow of changes from the database (binlog/WAL) to the bus to build integrations without application changes.
Idempotency: processing by 'idempotency _ key '/' event _ id', do not change the outside world until fixed.
6) CQRS и Event Sourcing
CQRS: separate write model and read projections; projections are constructed from events and can lag behind.
Event Sourcing: aggregate state = rollup of its events. Pros: full audit/replay; cons: complexity of migrations/schemes/snapshots.
Practice: ES - not everywhere, but where history and compensation are important; CQRS - almost always in EDA.
7) Sagas: Orchestration and Choreography
Orchestration: the coordinator sends commands and waits for response events; convenient for complex processes (KYC→Deposit→Bonus).
Choreography: services react to each other's events; easier but harder to trace.
Always define compensations and step deadlines.
8) Topology Design (Kafka/RabbitMQ)
Kafka
Topic per domain event: 'payments. captured. v1`, `players. verified. v1`.
Partitioning key: 'player _ id '/' wallet _ id' - where order is important.
`replication. factor=3`, `min. insync. replicas = 2 ', producer' acks = all '.
Retention: by time (e.g. 7-90 days) and/or compaction (last state by key).
Topics for retry and DLQ with backoff.
RabbitMQ
Exchanges: `topic`/`direct`, routing key `payments. captured. v1`.
For a wide fan-out - 'topic' + several queues; for RPC/commands - separate queues.
Quorum Queues for HA; TTL + dead-letter exchange for retrays.
9) Observability and SLO EDA
SLI/SLO:- End-to-end latency (occurred_at → processed): p50/p95/p99.
- Lag/age: consumer lag (Kafka consumer lag, Rabbit backlog age).
- Throughput publishing/processing.
- DLQ-rate and proportion of repeats.
- Success of business transactions (e.g. "deposit confirmed ≤ 5c").
- Correlation of events via 'trace _ id '/' correlation _ id' (OTel).
- Instances from alignment → metrics.
- Dashboards "Producer→Broker→Consumer" with burn-rate alerts.
10) Replay, retention and backfill
Replay to rebuild projections/fix bugs: drive to a new projection/space, then switch reading.
Retention: Legal/Business Requirements (GDPR/PCI); sensitive fields - encrypt and/or tokenize.
Backfill: one-off themes/queues, clear RPS limits to avoid stifling the prod.
11) Safety and compliance
TLS in-transit, mTLS for internal clients.
Authorization: per-topic/per-exchange ACL; multitenancy via namespace/vhost.
PII: minimize fields in the event; envelope metadata separately, payloads encrypted if necessary.
Audit access to events, prohibit "all-powerful" keys.
Retention and Right to Delete (GDPR) policies: either store data references or tombstone events and deletes in projections.
12) Testing in EDA
Contract tests: consumers validate their expectations of schemes (consumer-driven).
Replay tests: run historical sampling through a new handler/schema version.
Chaos scenarios: broker delay/loss, node drop, consumer lag → SLO remain within.
Smoke in CI: a short end-to-end pipeline on time themes.
13) Migration of "CRUD integrations → EDA"
1. Identify domain facts.
2. Embed outbox in source services.
3. Publish minimal domain events and connect 1-2 projections.
4. Gradually disable point synchronous integrations, replacing them with subscriptions.
5. Type Schema Registry and a compatibility policy.
6. Extend add-only events with fields; breaks - only through new types.
14) Anti-patterns
Events = "DTO API" (too fat, depend on internal model) - break consumers.
Lack of Schema Registry and compatibility - "fragile" integrations.
Publishing from code and writing to the database are not atomic (no outbox) - you lose events.
"Exactly-once everywhere" - high price without benefit; better at-least-once + idempotency.
One "universal" partition key → a hot partition.
Replay straight into the production projection - breaks online SLOs.
15) Implementation checklist (0-45 days)
0-10 days
Identify domain events and their keys (granules of order).
Deploy Schema Registry and approve the compatibility strategy.
Add outbox/inbox to 1-2 services; minimal CloudEvents-envelope.
11-25 days
Enter retry/DLQ, backoff, idempotency of handlers.
Dashboards: lag/age/end-to-end; burn-rate alerts.
Event documentation (catalog), owners and schema review processes.
26-45 days
Replay/rearrangement of the first projection; runbook replay and backfill.
Security policies (TLS, ACL, PII), retention, GDPR procedures.
Regular chaos and game days for the broker and consumers.
16) Maturity metrics
100% of domain events are described by schemes and registered.
Outbox/inbox covers all Tier-0/1 producers/consumers.
SLO: p95 end-to-end latency and consumer lag within targets ≥ 99%.
Replay/Backfill are feasible without downtime; there are verified runbooks' and.
Versioning: new fields - without breaking; old consumers are not falling.
Security: TLS + mTLS, ACL per topic, access logs, PII/retention policy.
17) Mini snippets
Kafka Producer (reliable publication, ideas):properties acks=all enable.idempotence=true max.in.flight.requests.per.connection=1 compression.type=zstd linger.ms=5
Consumer handler (idempotency, pseudocode):
python if inbox.contains(event_id): return # дедуп process(event) # побочные эффекты детерминированы inbox.commit(event_id) # atomically with side-effect commit_offset()
RabbitMQ Retry via DLX (idea):
- `queue: tasks` → on nack → DLX `tasks. retry. 1m '(TTL = 60s) → return to'tasks'; further '5m/15m'.
18) Conclusion
EDA turns integrations into a flow of business facts with clear contracts and managed consistency. Build the foundation: schemas + registry, outbox/inbox, order keys, idempotent handlers, SLO and observability, safe retention and replay. Then events will become your "source of truth" for scaling, analytics and new features - without fragile connections and night migrations.