Consensus algorithms

1) What consensus is and why it is needed

Consensus - Negotiate the same value/sequence of values between multiple nodes for failures and delays. Traditionally, the problem of state replication (State Machine Replication, SMR) is solved: there is a deterministic state machine and a general order of operations.

Distinguish:

Consensus (SMR): a single total order of commands → linearizable storage/registry, cluster metadata, transactional coordinators.
Quorum operations without total order (Dynamo-like, CRDT): allow divergence and subsequent merger; do not replace consensus where global serialization is needed.

2) Time and failure model

2. 1 Time model

Asynchronous network: delays are unlimited; the FLP theorem ⇒ it is impossible to guarantee both safety and liveness without additional assumptions.
Partially synchronous (often in practice): after an unknown T, the system "behaves" synchronously - most algorithms (Raft, Paxos) rely on this.
Synchronous: strict time limits for channels (rarely in sales, except for special networks/PRT).

2. 2 Failure model

Crash-fault tolerant (CFT): Nodes may fall/hang but do not behave maliciously.
Byzantine-fault tolerant (BFT): Nodes can lie/spoof messages. Requires 3f + 1 knots for tolerance to f Byzantine.

3) Properties of consensus

Safety: consistency (two correct nodes will not solve different values).
Liveness: If the network is "healthy," a solution is achieved.
Linearizability (linearity): operations are "seen" as atomic in a single order.
Durability: decisions made are not rolled back (log/quorum protection).

4) Quorums, majorities and intersections

In the CFT world, the classic: quorum> N/2. Records and leader elections use quorum crossing, so the two "valid" operations do not conflict.
In the BFT world: quorums 2f + 1 of 3f + 1 provide an intersection of at least f + 1 honest nodes.

Crossing rule: Any two quorum must have ≥1 common fair node (CFT) or ≥f+1 (BFT).

5) State replication (log + application)

Commands are added to the log with identifiers (index/era). Principle: "first agree on a log entry (commit), then deterministically apply to the state." To control growth, make:

Snapshots (a slice of the state after which early records can be deleted/compiled).
Journal compaction and log triem.
Determinism checks (same code/config version).

6) Leadership and non-leadership schemes

Leadership (Raft, Multi-Paxos, ZAB): one leader serializes records → easier mentally and operationally, better latency on a stable leader.
Leaderless/multi-leader (EPaxos, Caesar): more parallelism and tolerance for the leader, but more difficult implementation and conflict-solution; profit is visible with small conflicts of teams.

7) Classic CFT algorithms

7. 1 Paxos/Multi-Paxos (and practices)

The idea: two phases (prepare/propose), promises of acceptors, majority quorums. Multi-Paxos leaves a "stable leader" and turns operations into one-round (for new entries) after "warming up."

Features:

Flexible but difficult to implement model.
Re-configurations via special entries (joint consensus).
In practice, libraries/engines (Chubby/Spanner generation).

7. 2 Raft

Decomposed for learning: leader selection, log replication, re-configuration.

Election: random timeouts, term voting; one leader per term.
AppendEntries: the leader streams records, monitors the match of the prefix (index/term), commits to majority fixation.
Read-path: lease-reads (with a strong leader) or read-index for linear reads.
Reconfig: "joint" - nodes vote in two clusters before switching.

Pros: easier to develop/debug, strong order invariants. Cons: Leader is a pressure point.

7. 3 ZAB (ZooKeeper Atomic Broadcast) / Viewstamped Replication (VRR)

ZAB: leader broadcasts transactions, recovery phase, zxid (epoch + index).
VRR: "views" with a primary replicant, similar to Multi-Paxos in spirit.

8) Non-Leader/Multi-Leader CFTs

8. 1 EPaxos

There is no permanent leader: any node can initiate a command.
Conflict teams receive partial order; conflict-free - commit to 1-2 RTT locally.
Gain in geo-distribution with low conflict, but complex dependency/graph codes.

8. 2 Caesar, Mencius

Variations that optimize latency/balancing and WAN operation.

9) BFT algorithms and PoS family

9. 1 PBFT (Practical BFT)

Three phases (pre-prepare/prepare/commit), requires 3f + 1 nodes.
Low latency on local networks, multi-step roads and O (N ²) messages.

9. 2 Tendermint (BFT-PoS style)

Rounds with proposal and two votes (prevent/precommit).
Deterministic validator-proposer, timeouts, partial synchronicity.
Good for permitted/PoS networks with dozens/hundreds of validators.

9. 3 HotStuff (and derivatives)

The three-phase scheme is unified into "packages" with quorum certificates (QC).
The linear complexity of communications, support for packaging and parallel pipelining, is convenient for implementations in blockchains (for example, the Diem/Move ecosystem).
With threshold signatures, traffic is reduced.

9. 4 PoW/cumulative consensus (in brief)

Not BFT in the strict sense, but probabilistic convergence (the chain with the most work). Advantages: simplicity/openness; disadvantages: energy, ~ seconds-minutes to finality.

10) Reads: linear, sequential, and cached

Linear reads: leader with active lease or through read-index (Raft) → confirmation through quorum.
Sequential: can be read from any node, but without guarantees of freshness.
Follower reads: permissible under weak requirements; for caches - ok.

11) Re-configuration (change of cluster composition)

Requires two overlapping configurations (joint consensus): any commits pass quorums of both configurations → without "holes."

Add/remove one at a time, observe the quorum size.
Leadership transfer reduces pauses.

12) Performance and tuning

12. 1 Delays

Leadership algorithms: 1 RTT per write (with a stable leader) + replication.
Geographic distribution: WAN RTT dominates → use local regions + cross-regional replication or EPaxos/HotStuff approaches.

12. 2 Throughput

Batching (grouping of commands), parallel AppendEntries, log page cache, parallel application (when operations are switched).
Disks: NVMe/Log on a separate device, 'O _ DIRECT '/AIO, large' fsync 'interval with SLA restrictions (durability/latency compromise).

12. 3 p99 tails

Avoid hot knots (there is always one leader): periodic rotation or read-offload on followers.
Monitor GC pauses (JVM/Go), CPU pins, NUMA.

13) Geography and Topologies

1 region, 3 zones: classic CFT cluster (N = 3/5).
2 regions: avoid - no reliable quorum at 1-1 split.
3 + regions: leader in the "middle" or leaderless algorithms; regional proxies/local fronts with asynchronous log are possible.

14) Practical issues of operation

14. 1 Snapshots and recovery

Threshold by journal size/number of transactions.
Transfer of a snapshot to new nodes; Check of checksums.

14. 2 Monitoring

Leadership: who is the leader, how many terms (term/epoch) have changed.
Лаги: `append_latency`, `commit_index - applied_index`.

Quorum health: "are the majority alive/2f + 1?"

Log size/compression speed, snapshots queue.

For the BFT: vote share, dump signatories, QC delays

14. 3 Code Security/Consistency

Protocol versioning on the wire, migrations with log compatibility.
Fencing tokens on external effects (see "Distributed locks"): lead time (term) to transfer to CRON/jobs.

15) Anti-patterns

Put consensus "for the sake of fashion" where one DBMS or quorum reads without serialization are enough.
Mix CP and AP without clear boundaries (CAP) → unpredictable UX.
Overestimate PoW/PoS for enterprise tasks with dozens of nodes (difficult/expensive).
Ignore the re-configuration and "intersecting quorums" when the composition changes.
Lack of read-barrier (lease/read-index) → dirty reads.
Run 2-node clusters (no majority).
Underestimating discs and fsync: "everything flies in memory" - until the first reboot.

16) Selection checklist

1. Failure model: CFT (crashes) or BFT (malicious)?
2. Geography: one region/three zones or WAN? RTT decides.
3. Reading semantics: are linear readings necessary? Lease/read-index/follow-up.
4. Load: expectations by p50/p99, throughput; whether multi-leadership is needed.
5. Operational complexity: library/off-the-shelf engine (etcd/ZK/Consul/Raft-libs) vs native implementation.
6. Reconfiguration: Frequent? We need joint consensus and migration tools.
7. Integrations: external side effects - is there fencing by epoch/term?
8. Security: host authentication, channel encryption, protocol version control.
9. Test playbooks: partition, GC-stop, kill leader, slow disk, clock skew.
10. Observability: Leader/Lags/Journal and Alert metrics are set up.

17) Mini-guide: when to take what

etcd/Raft cluster for configurations/locks/coordination within DC.
ZooKeeper/ZAB for services already tied to ZK (old stacks, queues, leadership).
Multi-Paxos through a ready-made service/library in highly specialized systems.
EPaxos for geo-distribution and low command conflict.
Tendermint/HotStuff for permitted networks/PoS-layer with tens to hundreds of validators and finality requirements.
Dynamo-like/CRDT when consensus is not needed, but accessibility/scale with subsequent merger is important.

18) Examples of interfaces (pseudo)

18. 1 Commit records (Raft style)

pseudo client -> leader: Propose(cmd)
leader. appendLog(cmd)
leader. replicateToQuorum()
if quorum_acked:
leader. commit(index)
leader. apply(index)
leader. reply(client, ok)

18. 2 Read-index for linear reading (Raft)

pseudo client -> any: LinearizableRead node -> leader: ReadIndex?
leader -> quorum: Heartbeat (barrier)
leader -> node: ReadIndex=commit_index node. wait_until(applied_index >= ReadIndex)
node. reply(client, state_at(ReadIndex))

18. 3 Joint configuration

pseudo old_conf + new_conf # quorums must intersect commit (entries under joint)
switch_to(new_conf)

18. 4 BFT (HotStuff approximate)

pseudo propose(block)
collect votes -> QC lock on highest QC commit when have consecutive QCs across phases

19) FAQ

Q: Why not use two knots and a tie-breaker?
A: Two nodes without a third arbiter do not give quorum in the split. You need ≥3 (CFT) or 3f + 1 (BFT).

Q: What is Raft "simpler" Paxos?
A: Clear decomposition, understandable invariants of the log and configuration; easier to implement and maintain.

Q: How do you read fast without loading the leader?
A: Follower-reads (consecutive) for non-critical, or lease-reads/read-index for linear; cache.

Q: What kills p99?
A: WAN-RTT, disk fsync, GC-stops, overheated leader, large snapshots at rush hour.

Q: Does the cache/catalog need consensus?
A: If enough eventual consistency - no. If transactional invariants are required, yes.

20) Totals

Consensus is a tool for strict consistency and ordering. Choose an algorithm based on the failure model and geography; provide quorum crossings, correct re-configuration, linear reads where critical, and observability. Don't forget about fencing for external effects, snapshots and magazine discipline. Then state replication will be predictable, and incidents will be rare and understandable.

Consensus algorithms

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects