CAP and engineering trade-offs
CAP states: in conditions of network separation (Partition, P), a distributed system cannot simultaneously guarantee both strong consistency (Consistency, C) and availability (Availability, A). If P is present, either CP or AP must be selected. In the absence of splits, the restriction does not apply, but other compromises appear - primarily latency and cost.
Practical engineering goes beyond CAP: PACELC is important (if P - choose C or A; otherwise - choose between Latency and Consistency), consistency models, SLA/SLO, use cases and business risks.
1) Basic definitions (no philosophy)
Consistency (C): All customers see the same result "as if" operations were performed sequentially (linearity/strong consistency).
Availability (A): Each request to a nonavailable node is completed by a response at a reasonable time, even when splitting.
Separation (P): loss or significant degradation of connectivity between nodes/regional clusters; essentially - "unavoidable" on a grand scale.
PACELC: if P, choose C or A; else (when P is not) select L (low delay) or C (strong consistency).
2) Intuitive selection picture
CP (consistency is more important): when separating, we reject/block some of the requests so as not to violate the invariants. Suitable for money, transactions, balance accounting.
AP (availability is more important): we always answer, but we admit temporary inconsistency, then we collapse conflicts (CRDT/merge rules). Suitable for social feeds, like counters, cached profiles.
CA (C and A at the same time): only possible in the absence of P - that is, as long as the network is healthy. In actual operation, "CA" is a temporary state, not a design property.
3) PACELC: Don't forget the delay
When P is not present, the choice is often between low latency (L) and strong consistency (C):- Strong consistency between regions = intercontinental quorums ⇒ tens to hundreds of ms to p95.
- Local reads (low L) = weaker guarantees (read-my-writes, bounded staleness, eventual).
- PACELC helps explain why "fast and strict" globally is rare: light is not instantaneous, and quorums grow with network folding.
4) Consistency models (fast spectrum)
Linearizable/Strong: as if one sequential order.
Serializable: equivalent to some sequential order of transactions (above the record level).
Read-your-writes/Monotonic reads: the client reads the new value after his own recording.
Bounded staleness: reads no more than N versions/ Δ t.
Eventual consistency: all copies converge over time; conflicts must be resolved.
5) CP and AP patterns in products and protocols (conceptually)
CP approaches: quorum logs/leadership (Raft/Paxos), strict transactions, global leader locations, synchronous replication. Price - failure of some requests at P and an increase in delays.
AP approaches: multi-master/multi-leader, CRDT, gossip distribution, asynchronous replication, conflict resolution (LWW, vector clock, domain merge functions). Price - temporary inconsistency and complexity of domain rules.
6) Tradeoffs in the multi-region
Global Leader (CP): Simple logic, but "distant" regions pay with latency; at P - records blocking.
Local leaders + asynchron (AP): write fast locally, then replicate; conflicting changes require merge.
Geo-partitioning: data "live" closer to the user/jurisdiction; cross-region - aggregates only.
Dual-write is prohibited without sagas/CRDT: otherwise phantoms and double write-offs are obtained.
7) Engineering Invariants and Business Solutions
First, invariants: what can never be violated (double consumption, negative balance, uniqueness of the key), and what "survives" eventual (view counter, recommendations).
Then the selection:- The CP hard → invariant for the corresponding operations.
- AP soft → invariant followed by collapse.
8) Trade-off mitigation techniques
Cache and CQRS: reads through close cache/projections (AP), writes to strict log (CP).
RPO/RTO as a compromise language: how much data can be lost (RPO) and how to recover quickly (RTO).
Consistent ID and clock: monotonous timestamps (Hybrid/TrueTime approaches), ULID/Snowflake.
Sagas/TSS: business compensation instead of global locks.
CRDT and domain merge: for collections, counters, "last wins."
Bounded staleness: balance of UX and precision.
9) Observability, SLO and incident management
SLO by latency (p50/p95/p99) separately for reads/records and regions.
SLO by availability, taking into account the region's feilover.
Lag replications/conflicts: percentage of conflicts, average resolution time.
Alerts on the P sign: a surge in timeouts of interregional channels, an increase in quorum errors.
Degrade plans: read-only mode, local maintenance followed by merge, disabling "expensive" functions.
10) Strategy selection checklist
1. What invariants should not be violated? What does eventual allow?
2. Is a low-latency cross-regional record needed?
3. What are the target SLOs (latency/availability) and cost (egress/replication)?
4. Do you allow manual merge or automatic only (CRDT/rules)?
5. What is the network failure profile, frequency, duration, blast radius?
6. Is there a legal localization of data (residency)?
7. Which consistency model is acceptable for each data type/operation?
8. How will you observe: lags, conflicts, the state of quorums?
9. What does the system do at P: block, degrade, split traffic?
10. What is the data recovery and repatriation plan after P?
11) Typical errors
Pursuit of "CA Forever." At the first P, you have to choose - better in advance.
Global multi-master without merge rules. Conflicts eat up data and trust.
Strong consistency "everywhere." Excess quorums hit p95/p99 and budget.
Dual-write without transactions/saga. Lost invariants and phantoms.
Ignoring PACELC. In peacetime, latency suffers, in a storm - accessibility.
Zero telemetry of conflicts and lags. Problems are visible only to the user.
12) Quick recipes
Payment/balance: CP storage with quorum; records only through the leader; reads can be cached, but in critical UX - read-your-writes.
Content/feed: AP replication + CRDT/merge rules; at P - serve locally, then collapse.
Global SaaS: geo-partitioning by 'tenant/region'; strict operations in the "home" region (CP), reports/search - through asynchronous projections (AP).
Real-time signaling: Anycast/edge + AP bus; critical commands pass through the acknowledged channel (CP).
Audit/log: the only source of truth (append-only) with CP guarantees, around - caches and projections.
13) Mini-reference architecture (verbally)
Write-core (CP): leader + quorum replication, strict invariants, sagas for interservice effects.
Read-plane (AP): materialized views, caches, search indexes, asynchronous update.
Geo-routing: users enter the "home" region; at P - local mode + subsequent replication.
Conflict-engine: CRDT/rules; conflict log and manual resolution tools.
Observability: quorum tracing, lags, network incident map.
14) Practical Delay Math (Simple Score)
Optics ≈ 5 ms per 1000 km (RTT even more). Intercontinental quorums → p95 easily> 150-250 ms.
Any "global Strong" to record is an expensive request. If UX requires <100-150ms, consider local write-home + asynchronous consequences.
15) Separation policies
CP path: block records outside the quorum; enable read-only; give honest statuses to the user.
AP path: serve locally; Mark versions during recovery - deterministic merge; conflicts are raised to the parsing queue.
Conclusion
CAP is not a dogma, but a reminder: network divisions are inevitable, and the project must choose in advance what to sacrifice in the storm - accessibility or strict consistency. PACELC adds a key delay axis in clear weather. Combine strategies: keep the CP core where invariants are sacred, and the AP plane where speed and stability are more important. Lay telemetry, degradation plans and merging processes - and the system will preserve both data and user trust.