Strong Consistency: When Needed
Strong Consistency is a model in which all operations look like they are performed instantly and consistently in a single global order consistent with real time. The user will read the last confirmed value, and two parallel clients will not logically overtake each other.
Strict consistency gives a simple mental model and protects hard invariants, but requires coordination (quorums/leader), which increases latency and sensitivity to network partitions.
1) When Strong is mandatory
Finance and Settlements
Balances and write-offs: "Double spending" is unacceptable.
Transfers and settlements: the same amount cannot be posted twice.
Inventory and Limits
Remaining goods/hotel space/tickets: you cannot go into negative values.
Transaction limits per unit of time (credit limits, API credits).
Uniqueness and integrity
Unique deduplication logins/IDs/rules.
Invariants at the domain level: "≥1 doctor must be on duty in the department," "there cannot be> N active tasks in the queue."
Auditing and unchanging logs
Events that serve as a legal source of truth: order and completeness are critical.
If violation of the invariant carries an unacceptable business risk (loss of money, sanctions, loss of trust) - choose Strong Consistency.
2) What exactly is "strict"
Linearizability (operational level): reading sees the most recent successful write; times are respected.
Serializable (transaction level): the result is equivalent to executing transactions sequentially (can be strong, but sometimes implemented without a hard real-time order).
An important difference: Serializable protects against transaction level anomalies (phantom/write-skew), and Linearizable protects against single instantaneity and order of single operations. Often you need both properties (for example, money in the database + event log).
3) Price rigor: PACELC and CAP
PACELC: When splitting a network (P), you have to choose C (rigor) or A (availability). Strong → CP: it is better to refuse or block than to violate the invariant. When there is no separation (EL), we pay with L - p95/p99 grows in coordination/quorums.
Practice: strong for the "kernel of invariants," around - fast projections/caches with eventual so that UX does not suffer.
4) How Strong Consistency is achieved
Leadership and quorums
The sole leader accepts the recordings; reading - at the leader or by the quorum of replicas.
Quorum'W 'for writing and'R' for reading with'R + W> N'improves the chances of reading "last."
Matching algorithms
Raft/Paxos: replication log, majority confirmations, term/indexes.
Synchronous replication - The record is validated only after persistence on the quorum.
Hours and order
TrueTime/Hybrid Logical Clocks (HLC): Limit clock misalignment for secure global serialization.
Fence tokens/versioning: protection against "morning" leaders and split-brain.
Transaction isolation
Serializable (SI + predicate conflict checking/lock): protection against phantom/write-skew.
Strict-serializable: serializability + linearizability relative to real time.
5) Multi-region: options and trade-offs
Global Leader (CP)
Records go through one leading region; reads - local caches/projections or through a leader.
Pros: Simple model. Cons: p95/RTT to the leader, with P - record locks.
Regional leaders + synchronous quorum
Geographically expanded quorum from several regions; each record is waiting for confirmations> 50%.
Pros: without a single "narrow neck," high stability. Cons: Intercontinental latency.
Geo-partitioning
Home data for the region (tenant/jurisdiction); global operations - through sagas/aggregates.
Pros: Low latency for local recordings. Cons: Planning data boundaries.
6) Set up R/W and reads
Entries: 'W = majority' is the standard for strong.
Readings:- "Freshest" - 'R = majority' or reading at the leader.
- To reduce L - "stale-ok" reads from replicas for secondary screens (explicitly marked in UX).
- Read-repair/lease read: optimization without loss of severity for short leases of the leader.
7) Performance and UX
Latency: Focus on RTT between customer and leader/quorum (interregionally hundreds of ms).
"write-strong, read-fast" pattern: strong on write + cache/projection on reads, with RYW for author.
Batch/packets: Group records, but watch for tail latency.
Degradation contours: in an incident - read-only, honest statuses, prohibition of dangerous mutations.
8) Observability of strict-path
Metrics
p50/p95/p99 latency: write quorum, read quorum, leadership readings.
Quorum success, replays/rollbacks, leader changes.
Replication lag (expected small, but monitoring is mandatory).
Share of "steil" reads (if included).
Tracing
Spans: "leader acceptance," "replication," "quorum commit."
Теги: `term`, `leader_id`, `quorum_size`, `region`.
Alerts
Growth p95/p99, frequent re-election leader, quorum-timeouts, split-brain indicators.
9) Tests and chaos
Jepsen-like: network partitions, delays, drops, clock-skew.
Safety-invariants: impossibility of double spending/negative balances/double booking.
Leadership: leader refusal, re-election under load, fence tokens.
Read consistency: reading immediately after writing should see "new" (RYW/linearizable read).
10) Incident playbooks
Quorum loss: switch to read-only, notify clients, send an entry to the "home" region if geo-partitioning is present.
The growth of latency is interregional: temporarily reduce the volume of strict records (migration of some streams in the queue/projections), localize traffic.
Leader Flap: Increase election timeouts, check networks/hour-long drifts/GC pauses.
Split-brain: enable fence-tokens/lease-checks, stop old leaders at the operator level.
11) Typical errors
Demand Strong "everywhere": an explosion of latency and cost instead of focusing on invariants.
Trying to be CA under real splits: At point P, the system still makes a choice, often implicitly.
Dual-write to different regions without sagas/coordinator: phantoms and loss of invariants.
Absence of RYW: the user does not see his newly recorded entity - a drop in trust.
Ignoring the clock: Without HLC/TrueTime boundaries, it's easy to get "jumping" time and racing.
There is no degradation plan: at P, chaotic partial failures begin.
12) Quick fixes (recipes)
Payments/balances: leader + majority-quorum; strict-serializable transactions short timeouts, hard failure at P.
Booking (seats/slots): write-strong through leader, reads - cache with RYW; TTL-reserves + TCC.
Global SaaS: geo-partition by 'tenant/region'; strict operations in the home region, reports/search - through projections.
Audit/log: append-only CP-log; reads can be cached, but verified with checkpoints.
13) Pre-sale checklist
- Invariants requiring strong were written; the rest is in AP/projection.
- Single leader/quorum interregional/geo-partition selected.
- Configured 'W = majority', 'R = leader' majority 'for critical paths.
- RYW/monotonic provided for UX; explicitly marked "stale-ok" reads.
- Included metrics of quorum, lags, latencies; alerts on p95/p99 and re-election.
- There is a degrade plan: read-only, disabling dangerous mutations, queues for "after the storm."
- Chaos tests: divisions, clock-skew, leader failure; safety invariants were checked.
- Contract documentation: what is strict, what "may be behind," communication for product/support.
Conclusion
Strong Consistency is a tool for protecting truth where error is unacceptable. Apply it pointwise around hard invariants, consciously paying for coordination with latency and availability in storms. Combine: CP-kernel for critical, AP-reading and projection for speed. With the right telemetry, degradation and tests, you will retain both correctness and user experience.