Choice of leader
1) Why you need a leader and when he is justified at all
Leader - a node that has the exclusive right to perform critical actions: starting a crown/ETL, coordinating shards, distributing keys, changing the configuration. It simplifies invariants ("one performer"), but adds risks (SPOF, re-election, lag).
Use leadership if:- need uniqueness of execution (for example, a billing aggregator once a minute);
- Changes need to be serialized (configuration register, distributed locks)
- the cluster protocol assumes leadership replication (Raft).
- the problem is solved by idempotence and order by key;
- can be parallelized through work-stealing/queues;
- "leader" becomes the only narrow point (wide fan-in).
2) Base model: lease + quorum + epoch
Terms
Lease: The leader is entitled to T seconds; must renew.
Heartbeat: periodic extension/live signal.
Epoch/term: monotonously growing leadership number. Helps recognize "old" leaders.
Fencing token: the same monotone number that the resource consumer (database/storage) checks and rejects the operations of the old leader.
Invariants
At any time, no more than one actual leader (safety).
In case of failure, progress is possible: a new one (liveness) is elected in a reasonable time.
Leader operations are accompanied by an era; sinky only accept newer eras.
3) Overview of algorithms and protocols
3. 1 Raft (Leadership Replication)
Status: Follower → Candidate → Leader.
Timers: random election timeout (jitter), RequestVote; the leader holds AppendEntries as heartbeat.
Guarantees: quorum, no split-brain under standard prerequisites, logbook with logical monotony (term/index).
3. 2 Paxos/Single-Decree / Multi-Paxos
Theoretical basis of consensus; in practice - variations (e. g., Multi-Paxos) with a "chosen coordinator" (leader analogue).
Harder to implement directly; ready-made implementations/libraries are used more often.
3. 3 ZAB (ZooKeeper Atomic Broadcast)
ZK mechanism: leadership journal replication with recovery phases; epochs (zxid) and sequential ephemeral nodes for primitives like leadership.
3. 4 Bully/Chang-Roberts (Rings/Monarch)
"Training" algorithms for static topologies without quorum. Do not take into account partial network failures/partitions - do not apply in sales.
4) Practical platforms
4. 1 ZooKeeper
EPHEMERAL_SEQUENTIAL pattern: the process creates '/leader/lock-XXXX ', the minimum number is the leader.
Loss of session ⇒ node disappears ⇒ re-selection is instantaneous.
Justice through waiting for the "predecessor."
4. 2 etcd (Raft)
Native leadership at the cluster level itself; for applications - etcd concurrency: 'Session + Mutex/Election'.
Lease-ID с TTL, keepalive; You can store an epoch in a key value.
4. 3 Consul
'session '+' KV acquire ': whoever holds the key is the leader. TTL/heartbeat in session.
4. 4 Kubernetes
Leases coordination API (`coordination. k8s. io/v1`): ресурс `Lease` c `holderIdentity`, `leaseDurationSeconds`, `renewTime`.
The client library'leaderelection' (client-go) implements capture/renewal; ideal for leader-pods.
5) How to build a "safe" leader
5. 1 Keep the era and fencing
Each lead increases epoch (e.g. etcd/ZK revision zxid or separate counter).
All side effects of the leader (writing to the database, performing tasks) must be transmitted 'epoch' and compared:sql
UPDATE cron_state
SET last_run = now(), last_epoch =:epoch
WHERE name = 'daily-rollup' AND:epoch > last_epoch;
The old leader (after split-brain) will be rejected.
5. 2 Timings
'leaseDuration '≥' 2-3 × heartbeatInterval + network + p99 GC pause '.
Election timeout - randomize (jitter) so that candidates do not collide.
If renewal is lost, immediately stop critical operations.
5. 3 Identity
`holderId = node#pid#startTime#rand`. When updating/removing, check the same holder.
5. 4 Watchers
All followers subscribe to'Lease/Election 'changes and start/stop according to status.
6) Implementations: fragments
6. 1 Kubernetes (Go)
go import "k8s. io/client-go/tools/leaderelection"
lec:= leaderelection. LeaderElectionConfig{
Lock: &rl. LeaseLock{
LeaseMeta: metav1. ObjectMeta{Name: "jobs-leader", Namespace: "prod"},
Client: coordClient,
LockConfig: rl. ResourceLockConfig{Identity: podName},
},
LeaseDuration: 15 time. Second,
RenewDeadline: 10 time. Second,
RetryPeriod: 2 time. Second,
Callbacks: leaderelection. LeaderCallbacks{
OnStartedLeading: func(ctx context. Context) { runLeader(ctx) },
OnStoppedLeading: func() { stopLeader() },
},
}
leaderelection. RunOrDie(context. Background(), lec)
6. 2 etcd (Go)
go cli, _:= clientv3. New(...)
sess, _:= concurrency. NewSession(cli, concurrency. WithTTL(10))
e:= concurrency. NewElection(sess, "/election/rollup")
_ = e. Campaign (ctx, podID )//blocking call epoch: = sess. Lease ()//use as part of fencing defer e. Resign(ctx)
6. 3 ZooKeeper (Java, Curator)
java
LeaderSelector selector = new LeaderSelector(client, "/leaders/rollup", listener);
selector. autoRequeue();
selector. start(); // listener. enterLeadership () performs leader work with try/finally
7) Re-elections and service degradation
Sharp flappings of the leader → "fish bone" in the charts. Treated by increasing leaseDuration/renewDeadline and eliminating GC/CPU saws.
For the re-selection period, enable brownout: reduce the intensity of background tasks or completely freeze them to a confirmed leadership.
For long jobs, do checkpoints + idempotent dokat after a change of leader.
8) Split-brain: How to stay out
Use CP stores (etcd/ZK/Consul) with quorum; you cannot take a leader without a quorum.
Never build leadership on an AP cache without a quorum arbiter.
Even in the CP model, keep fencing at the resource level - this is insurance against rare abnormal scenarios (pauses, stuck drivers).
9) Observability and operation
Metrics
`leadership_is_leader{app}` (gauge 0/1).
`election_total{result=won|lost|resign}`.
`lease_renew_latency_ms{p50,p95,p99}`, `lease_renew_fail_total`.
'epoch _ value '(cluster monotony).
'flaps _ total'is the number of leader shifts per window.
For ZK/etcd: replication lag, quorum health.
Alerts
Frequent lead change (> N in an hour).
Renewal failures' renew '/high p99.
epoch infeasibility (two different epochs at different nodes).
There is no leader longer than X seconds (if the business does not allow).
Logs/Trails
Link events: 'epoch', 'holderId', 'reason' (lost lease, session expired), 'duration _ ms'.
10) Test playbooks (Game Days)
Partition: break the network between 2 zones - leadership is allowed only in the quorum part.
GC-stop: artificially stop the leader for 5-10s - should lose the lease and stop working.
Clock skew/drift: Make sure that correctness does not depend on wall-clock (fencing/epoch is saved).
Kill -9: Sudden leader crash → new leader ≤ leaseDuration.
Slow storage: slow down disks/Raft log - estimate election time, debug timings.
11) Anti-patterns
"Leader" via Redis' SET NX PX'with no fencing and no quorum.
'leaseDuration'is less than p99 of the critical operation duration.
Stopping/continuing work after losing leadership ("I'll finish a minute").
Lack of jitter in election timers → election storm.
A single long job with no checkpoints - each flap results in a replay from scratch.
Close link of leadership and traffic routing (sticky) without fallback - the bottoms with the flap get 5xx.
12) Implementation checklist
- Quorum arbiter selected is etcd/ZK/Consul/K8s Lease.
- Store and pass epoch/fencing into all leader side effects.
- Configured timings are 'leaseDuration', 'renewDeadline', 'retryPeriod' with network/GC margin.
- Built-in watchers and correct shutdown when leadership is lost.
- Leadership tasks are idempotent and checkpoint.
- Metrics/alerts and logging 'epoch/holderId' are enabled.
- Held game days: partition, GC-stop, kill, clock skew.
- Politicians are documented: who/what the leader does, who can replace him, how to resolve epoch conflicts.
[The] Degradation Plan: What a leaderless system does.
- Performance test: flaps under load do not destroy SLO.
13) FAQ
Q: Can leadership be built without a quorum?
A: In prod, no. You need a CP component (quorum) or a cloud service with equivalent guarantees.
Q: Why epoch if there is lease?
A: Lease provides survivability, but does not protect against the "old leader" after separation/pauses. Epoch/fencing invalidates the effects of the old leader.
Q: What are the defaults of timings in the K8s?
A: Often used 'LeaseDuration≈15s', 'RenewDeadline≈10s', 'RetryPeriod≈2s'. Match your p99 load and GC.
Q: How do you test leadership locally?
A: Run 3-5 instances, emulate network (tc/netem), pause (SIGSTOP), kill leader (SIGKILL), check metrics/logs/epochs.
Q: What to do with long tasks when changing leaders?
A: Checkpoint + idempotent docat; in case of loss of leadership - immediate stop and release of resources.
14) Totals
A reliable choice of leader is a quorum arbiter + discipline of eras. Keep the leadership as a lease with a heartbeat, beat all the effects with a fencing token, set up timings with a margin, make the leader's tasks idempotent and observable, regularly lose crashes. Then "one and only one" performer will not be a slogan, but a guarantee that is resistant to pauses, network whims and human errors.