Replication and eventual consistency
Replication and eventual consistency
1) Why eventual consistency
When the system is distributed by zones/regions, synchronous recording everywhere gives high latency and low availability in case of network failures. Eventual consistency (EC) allows temporary misalignment of replicas for the sake of:- low recording delay (local reception),
- better availability during network partitions,
- horizontal scaling.
The key task is controlled lax consistency: the user sees "fairly fresh" data, domain invariants are preserved, conflicts are detected and resolved predictably.
2) Consistency models - what we promise the customer
Strong: Reading immediately sees the last entry.
Bounded stale/read-not-older-than (RNOT): read not older than mark (LSN/version/time).
Causal: A "causal" relationship (A to B) is maintained.
Read-Your-Writes: A client sees his recent recordings.
Monotonic Reads: Every next read is not "rolled back."
Session: a set of guarantees in one session.
Eventual: if there are no new entries, all replicas converge.
Practice: Combine Session + RNOT on critical paths and Eventual on storefronts/caches.
3) Replication: mechanics and anti-entropy
Synchronous (quorum/RAFT): the record is considered successful after confirmation by N nodes; minimum RPO, above p99.
Asynchronous: leader commits locally, distributes log later; low latency, RPO> 0.
Physical (WAL/binlog): fast, homogeneous.
Logical/CDC: row/event level change flow, flexible routing, filters.
Anti-entropy: periodic reconciliation and repair (Merkle trees, hash comparison, background re-sync).
4) Version identifiers and causality orders
Monotone versions: increment/LSN/epoch; simple, but do not encode parallelism.
Lamport timestamp: partial order by logical clock.
Vector clock: fixes parallel branches and allows you to detect conflicting updates (concurrent).
Hybrid/TrueTime/Clock-SI: "Not before T" logic for global order.
Recommendation: for CRDT/conflicting updates - vector clock; for "not older" - LSN/GTID.
5) Conflicts: Discovery and Resolution
Typical situations: recording from two regions to the same object.
Strategies:1. Last-Write-Wins (LWW) by hour/logical stamp - simple, but may "lose" updates.
2. Merge functions by domain logic:- counter fields are added (G-Counter/PN-Counter),
- sets are combined with "add-wins/remove-wins,"
- amounts/balances - only through transaction journals, not through a simple LWW.
- 3. CRDT (convergent types): G-Counter, OR-Set, LWW-Register, RGA for lists.
- 4. Operational transformations (rarely for databases, more often for editors).
- 5. Manual resolution: conflict in "inbox," the user selects the correct version.
Rule: Domain invariants dictate strategy. For money/balances - avoid LWW; Use compensated transactions/events.
6) Record warranties and idempotency
Idempotent keys on commands (payment, withdraw, create) → retry is safe.
Inbox and outbox deduplication by idempotence key/serial number.
Exactly-once is unattainable without strong premises; practice at-least-once + idempotency.
Outbox/Inbox pattern: writing to the database and publishing the event is atomic (local transaction), the recipient processes by idempotency-key.
7) No Older X Reads (RNOT)
Technicians:- LSN/GTID gate: the client transmits the minimum version (from the write response), the router/proxy sends to the replica that caught up with LSN ≥ X, otherwise - to the leader.
- Time-bound: "not older than 2 seconds" - a simple SLA without versions.
- Session pinning: after recording N seconds, we read only the leader (Read-Your-Writes).
8) Change flows and cache negotiation
CDC → event bus (Kafka/Pulsar) → consumers (caches, indexes, storefronts).
Cache disability: topics' invalidate: {ns}: {id} '; idempotent processing.
Rebuild/Backfill: If out of sync, reassemble the projections from the event log.
9) Sagas and compensations (inter-service transactions)
In the EC world, long-lived operations are divided into steps with compensatory actions:- Orchestration: The coordinator calls the steps and their compensations.
- Choreography: steps react to events and publish the following themselves.
Invariants (example): "balance ≥ 0" - check at step boundaries + compensation for deviation.
10) Multi-region and network partitions
Local-write, async-replicate: write to local region + deliver to other (EC).
Geo-fencing: data is "glued" to the region (low latency, fewer conflicts).
Quorum databases (Raft) for CP data; caches/storefronts - AP/EC.
Split-brain plan: if communication is lost, the regions continue to operate within domain limits (write fencing, quotas), then reconcile.
11) Observability and SLO
Metrics:- Replica lag: time/LSN-distance/offset (p50/p95/p99).
- Staleness: The percentage of responses that are above the threshold (for example,> 2s or LSN
- Conflict rate: rate of conflicts and successful merge.
- Convergence time: the convergence time of the replicas after the peak.
- Reconcile backlog: volume/time of lagging batches.
- RPO/RTO by data category (CP/AP).
- Lag> target, increase in conflicts, "long" windows of infeasibility.
12) EC Data Scheme Design
Explicit version/vector in each entry (columns' version ',' vc ').
Append-only logs for critical invariants (balances, accruals).
Event identifiers (snowflake/ULID) for order and deduplication.
Commutative fields (counters, sets) → CRDT candidates.
API design: PUT with if-match/etag, PATCH with precondition.
13) Storage and reading patterns
Read model/CQRS: writing to the "source," reading from projections (may lag behind → display "updated...").
Stale-OK routes (catalog/tape) vs Strict (wallet/limits).
Sticky/Bounded-stale flags in the request (header 'x-read-consistency').
14) Implementation checklist (0-45 days)
0-10 days
Categorize data: CP-critical (money, orders) vs EU/steel-OK (catalogs, search indexes).
Define Steele SLOs (e.g. "not older than 2s"), target lags.
Enable object versioning and idempotency-keys in API.
11-25 days
Implement CDC and outbox/inbox, cache disability routes.
Add RNOT (LSN gate) and session pinning to write-critical paths.
Implement at least one merge strategy (LWW/CRDT/domain) and conflict log.
26-45 days
Automate anti-entropy (reconciliation/repair) and styling reports.
Spend game-day: network separation, surge in conflicts, recovery.
Visualize on dashboards: lag, staleness, conflict rate, convergence.
15) Anti-patterns
Blind LWW for critical invariants (loss of money/points).
Lack of idempotency → duplicates of operations during retrains.
The "strong" model on the whole → excessive p99 tails and fragility in case of failures.
No RNOT/Session guarantees → UX "blinks," users "do not see" their changes.
Hidden cache and source misalignment (no CDC/disability).
Lack of a reconcile/anti-entropy tool - data "for centuries" diverges.
16) Maturity metrics
Replica lag p95 ≤ target (for example, ≤ 500 ms within a region, ≤ 2 s inter-regions).
Staleness SLO is performed ≥ 99% of requests on "strict" routes.
Conflict resolution success ≥ 99. 9%, average resolution time ≤ 1 min.
Convergence time after peaks - minutes, not hours.
100% of "money" transactions are covered by idempotency keys and outbox/inbox.
17) Recipes (snippets)
If-Match/ETag (HTTP)
PUT /profile/42
If-Match: "v17"
Body: { "email": "new@example.com" }
If the version has changed - '412 Precondition Failed' → the client resolves the conflict.
Query "not older than LSN" (pseudo)
x-min-lsn: 16/B373F8D8
The router chooses a replica with'replay _ lsn ≥ x-min-lsn ', otherwise it is the leader.
CRDT G-Counter (idea)
Each region keeps its own counter; Total - the sum of all components Replication - Operation is commutative.
18) Conclusion
Eventual consistency is not a compromise of quality, but a conscious contract: somewhere we pay freshness for the sake of speed and availability, but we protect critical invariants with domain strategies and tools. Enter versions, idempotency, RNOT/Session warranties, CDC and anti-entropy, measure lag/staleness/conflicts - and your distributed system will be fast, stable and predictably converging even under glitches and peak loads.