Choice of leader

1) Why you need a leader and when he is justified at all

Leader - a node that has the exclusive right to perform critical actions: starting a crown/ETL, coordinating shards, distributing keys, changing the configuration. It simplifies invariants ("one performer"), but adds risks (SPOF, re-election, lag).

Use leadership if:

need uniqueness of execution (for example, a billing aggregator once a minute);
Changes need to be serialized (configuration register, distributed locks)
the cluster protocol assumes leadership replication (Raft).

Avoid if:

the problem is solved by idempotence and order by key;
can be parallelized through work-stealing/queues;
"leader" becomes the only narrow point (wide fan-in).

2) Base model: lease + quorum + epoch

Terms

Lease: The leader is entitled to T seconds; must renew.
Heartbeat: periodic extension/live signal.
Epoch/term: monotonously growing leadership number. Helps recognize "old" leaders.
Fencing token: the same monotone number that the resource consumer (database/storage) checks and rejects the operations of the old leader.

Invariants

At any time, no more than one actual leader (safety).
In case of failure, progress is possible: a new one (liveness) is elected in a reasonable time.
Leader operations are accompanied by an era; sinky only accept newer eras.

3) Overview of algorithms and protocols

3. 1 Raft (Leadership Replication)

Status: Follower → Candidate → Leader.
Timers: random election timeout (jitter), RequestVote; the leader holds AppendEntries as heartbeat.
Guarantees: quorum, no split-brain under standard prerequisites, logbook with logical monotony (term/index).

3. 2 Paxos/Single-Decree / Multi-Paxos

Theoretical basis of consensus; in practice - variations (e. g., Multi-Paxos) with a "chosen coordinator" (leader analogue).
Harder to implement directly; ready-made implementations/libraries are used more often.

3. 3 ZAB (ZooKeeper Atomic Broadcast)

ZK mechanism: leadership journal replication with recovery phases; epochs (zxid) and sequential ephemeral nodes for primitives like leadership.

3. 4 Bully/Chang-Roberts (Rings/Monarch)

"Training" algorithms for static topologies without quorum. Do not take into account partial network failures/partitions - do not apply in sales.

4) Practical platforms

4. 1 ZooKeeper

EPHEMERAL_SEQUENTIAL pattern: the process creates '/leader/lock-XXXX ', the minimum number is the leader.
Loss of session ⇒ node disappears ⇒ re-selection is instantaneous.

Justice through waiting for the "predecessor."

4. 2 etcd (Raft)

Native leadership at the cluster level itself; for applications - etcd concurrency: 'Session + Mutex/Election'.
Lease-ID с TTL, keepalive; You can store an epoch in a key value.

4. 3 Consul

'session '+' KV acquire ': whoever holds the key is the leader. TTL/heartbeat in session.

4. 4 Kubernetes

Leases coordination API (`coordination. k8s. io/v1`): ресурс `Lease` c `holderIdentity`, `leaseDurationSeconds`, `renewTime`.
The client library'leaderelection' (client-go) implements capture/renewal; ideal for leader-pods.

5) How to build a "safe" leader

5. 1 Keep the era and fencing

Each lead increases epoch (e.g. etcd/ZK revision zxid or separate counter).

All side effects of the leader (writing to the database, performing tasks) must be transmitted 'epoch' and compared:

sql
UPDATE cron_state
SET last_run = now(), last_epoch =:epoch
WHERE name = 'daily-rollup' AND:epoch > last_epoch;

The old leader (after split-brain) will be rejected.

5. 2 Timings

'leaseDuration '≥' 2-3 × heartbeatInterval + network + p99 GC pause '.
Election timeout - randomize (jitter) so that candidates do not collide.
If renewal is lost, immediately stop critical operations.

5. 3 Identity

`holderId = node#pid#startTime#rand`. When updating/removing, check the same holder.

5. 4 Watchers

All followers subscribe to'Lease/Election 'changes and start/stop according to status.

6) Implementations: fragments

6. 1 Kubernetes (Go)

go import "k8s. io/client-go/tools/leaderelection"

lec:= leaderelection. LeaderElectionConfig{
Lock: &rl. LeaseLock{
LeaseMeta: metav1. ObjectMeta{Name: "jobs-leader", Namespace: "prod"},
Client:  coordClient,
LockConfig: rl. ResourceLockConfig{Identity: podName},
},
LeaseDuration: 15 time. Second,
RenewDeadline: 10 time. Second,
RetryPeriod:  2 time. Second,
Callbacks: leaderelection. LeaderCallbacks{
OnStartedLeading: func(ctx context. Context) { runLeader(ctx) },
OnStoppedLeading: func() { stopLeader() },
},
}
leaderelection. RunOrDie(context. Background(), lec)

6. 2 etcd (Go)

go cli, _:= clientv3. New(...)
sess, _:= concurrency. NewSession(cli, concurrency. WithTTL(10))
e:= concurrency. NewElection(sess, "/election/rollup")
_ = e. Campaign (ctx, podID )//blocking call epoch: = sess. Lease ()//use as part of fencing defer e. Resign(ctx)

6. 3 ZooKeeper (Java, Curator)

java
LeaderSelector selector = new LeaderSelector(client, "/leaders/rollup", listener);
selector. autoRequeue();
selector. start(); // listener. enterLeadership () performs leader work with try/finally

7) Re-elections and service degradation

Sharp flappings of the leader → "fish bone" in the charts. Treated by increasing leaseDuration/renewDeadline and eliminating GC/CPU saws.
For the re-selection period, enable brownout: reduce the intensity of background tasks or completely freeze them to a confirmed leadership.
For long jobs, do checkpoints + idempotent dokat after a change of leader.

8) Split-brain: How to stay out

Use CP stores (etcd/ZK/Consul) with quorum; you cannot take a leader without a quorum.
Never build leadership on an AP cache without a quorum arbiter.
Even in the CP model, keep fencing at the resource level - this is insurance against rare abnormal scenarios (pauses, stuck drivers).

9) Observability and operation

Metrics

`leadership_is_leader{app}` (gauge 0/1).
`election_total{result=won|lost|resign}`.
`lease_renew_latency_ms{p50,p95,p99}`, `lease_renew_fail_total`.
'epoch _ value '(cluster monotony).
'flaps _ total'is the number of leader shifts per window.
For ZK/etcd: replication lag, quorum health.

Alerts

Frequent lead change (> N in an hour).
Renewal failures' renew '/high p99.
epoch infeasibility (two different epochs at different nodes).
There is no leader longer than X seconds (if the business does not allow).

Logs/Trails

Link events: 'epoch', 'holderId', 'reason' (lost lease, session expired), 'duration _ ms'.

10) Test playbooks (Game Days)

Partition: break the network between 2 zones - leadership is allowed only in the quorum part.
GC-stop: artificially stop the leader for 5-10s - should lose the lease and stop working.
Clock skew/drift: Make sure that correctness does not depend on wall-clock (fencing/epoch is saved).
Kill -9: Sudden leader crash → new leader ≤ leaseDuration.
Slow storage: slow down disks/Raft log - estimate election time, debug timings.

11) Anti-patterns

"Leader" via Redis' SET NX PX'with no fencing and no quorum.
'leaseDuration'is less than p99 of the critical operation duration.
Stopping/continuing work after losing leadership ("I'll finish a minute").
Lack of jitter in election timers → election storm.
A single long job with no checkpoints - each flap results in a replay from scratch.
Close link of leadership and traffic routing (sticky) without fallback - the bottoms with the flap get 5xx.

12) Implementation checklist

Quorum arbiter selected is etcd/ZK/Consul/K8s Lease.
Store and pass epoch/fencing into all leader side effects.
Configured timings are 'leaseDuration', 'renewDeadline', 'retryPeriod' with network/GC margin.
Built-in watchers and correct shutdown when leadership is lost.
Leadership tasks are idempotent and checkpoint.
Metrics/alerts and logging 'epoch/holderId' are enabled.
Held game days: partition, GC-stop, kill, clock skew.
Politicians are documented: who/what the leader does, who can replace him, how to resolve epoch conflicts.

[The] Degradation Plan: What a leaderless system does.

Performance test: flaps under load do not destroy SLO.

13) FAQ

Q: Can leadership be built without a quorum?
A: In prod, no. You need a CP component (quorum) or a cloud service with equivalent guarantees.

Q: Why epoch if there is lease?
A: Lease provides survivability, but does not protect against the "old leader" after separation/pauses. Epoch/fencing invalidates the effects of the old leader.

Q: What are the defaults of timings in the K8s?
A: Often used 'LeaseDuration≈15s', 'RenewDeadline≈10s', 'RetryPeriod≈2s'. Match your p99 load and GC.

Q: How do you test leadership locally?
A: Run 3-5 instances, emulate network (tc/netem), pause (SIGSTOP), kill leader (SIGKILL), check metrics/logs/epochs.

Q: What to do with long tasks when changing leaders?
A: Checkpoint + idempotent docat; in case of loss of leadership - immediate stop and release of resources.

14) Totals

A reliable choice of leader is a quorum arbiter + discipline of eras. Keep the leadership as a lease with a heartbeat, beat all the effects with a fencing token, set up timings with a margin, make the leader's tasks idempotent and observable, regularly lose crashes. Then "one and only one" performer will not be a slogan, but a guarantee that is resistant to pauses, network whims and human errors.

Choice of leader

Invariants

Alerts

Logs/Trails

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects