Distributed locks
1) Why (and when) distributed locks are needed
Distributed blocking is a mechanism that guarantees mutual exclusion for a critical section between several cluster nodes. Typical tasks:- Leader election for the background task/shader.
- Restriction of a single performer over a shared resource (file movement, schema migrations, exclusive payment step).
- Sequential processing of the aggregate (wallet/order) if it is impossible to achieve idempotency/ordering otherwise.
- If you can make an idempotent upsert, CAS (compare-and-set) or per-key ordering.
- If the resource allows commutative operations (CRDT, counters).
- If the problem is solved by a transaction in one store.
2) Threat model and properties
Failures and complications:- Network: delays, partition, packet loss.
- Processes: GC pause, stop-the-world, crash after lock capture.
- Time: Clock drift and displacement break TTL approaches.
- Repossession: The "zombie" process after the net may think it still owns the castle.
- Safety: no more than one valid owner (safety).
- Survivability: the lock is released when the owner fails (liveness).
- Justice: There is no fasting.
- Clock independence: correctness does not depend on wall-clock (or compensated by fencing tokens).
3) Main models
3. 1 Lease (rental lock)
The lock is issued with TTL. The owner is obliged to renew it before the expiration (heartbeat/keepalive).
Pros: Crashing self-absorption.
Risks: if the owner is "stuck" and continues to work, but has lost the extension, double ownership may arise.
3. 2 Fencing token
With each successful capture, a monotonously growing number is issued. Resource consumers (database, queue, file storage) check the token and reject operations with the old number.
This is extremely important for TTL/lease and network partitions - it protects against the "old" owner.
3. 3 Quorum locks (CP systems)
Distributed consensus is used (Raft/Paxos; etcd/ZooKeeper/Consul), the record is associated with a consensus log → there is no split brain with most nodes.
Plus: strong security guarantees.
Minus: sensitivity to quorum (when it is lost, survivability is lame).
3. 4 AP locks (in-memory/cache + replication)
For example, a Redis cluster. High availability and speed, but without strong security guarantees for network partitions. Require fencing on the side of the bruise.
4) Platforms and patterns
4. 1 etcd/ZooKeeper/Consul (recommended for strong locks)
Ephemeral nodes (ZK) or sessions/leases (etcd): the key exists while the session is alive.
Session keepalive; quorum loss → the session expires → the lock is released.
Sequence nodes (ZK'EPHEMERAL _ SEQUENTIAL ') for the wait queue → fair.
go cli, _:= clientv3. New(...)
lease, _:= cli. Grant(ctx, 10) // 10s lease sess, _:= concurrency. NewSession(cli, concurrency. WithLease(lease. ID))
m:= concurrency. NewMutex(sess, "/locks/orders/42")
if err:= m. Lock(ctx); err!= nil { / handle / }
defer m. Unlock(ctx)
4. 2 Redis (neat)
Classic - 'SET key value NX PX ttl'.
Problems:- Replication/feilover may allow simultaneous owners.
- Redlock from multiple instances reduces risk, but does not eliminate; is controversial in environments with an unreliable network.
It is safer to use Redis as a fast coordination layer, but always complement the fencing token in the target resource.
Example (Lua-unlock):lua
-- release only if value matches if redis. call("GET", KEYS[1]) == ARGV[1] then return redis. call("DEL", KEYS[1])
else return 0 end
4. 3 DB locks
PostgreSQL advisory locks: lock within the Postgres cluster (process/session).
It's good when all critical sections are already in the same database.
sql
SELECT pg_try_advisory_lock(42); -- take
SELECT pg_advisory_unlock(42); -- let go
4. 4 File/cloud locks
S3/GCS + object metadata lock with'If-Match '(ETag) conditions → essentially CAS.
Suitable for backups/migrations.
5) Safety lock design
5. 1 Owner identity
Store'owner _ id '(# process # pid # start _ time node) + random token for unlock verification.
Repeated unlock should not remove someone else's lock.
5. 2 TTL and extension
TTL <T_fail_detect (fault detection time) and p99 ≥ of critical section operation × spare.
Renewal - periodically (for example, every'TTL/3 '), with deadline.
5. 3 Fencing token on a bruise
The section modifying the external resource must pass' fencing _ token '.
Sink (DB/cache/storage) stores' last _ token'and rejects smaller ones:sql
UPDATE wallet
SET balance = balance +:delta, last_token =:token
WHERE id =:id AND:token > last_token;
5. 4 Waiting Queue and Justice
In ZK - 'EPHEMERAL _ SEQUENTIAL' and observers: the client is waiting for the release of the nearest predecessor.
In etcd - keys with revision/versioning; order by 'mod _ revision'.
5. 5 Split-brain behavior
CP approach: without a quorum, you cannot take a lock - it is better to stand than to break safety.
AP approach: progress is allowed in divided islands → fencing is required.
6) Leader election
In etcd/ZK, the "leader" is an exclusive epemeric key; the rest are signed up for changes.
Leader writes heartbeats; loss - re-election.
Accompany all leader operations with a fencing token (era/revision number).
7) Errors and their processing
The client took the lock, but crash to work → the norm, no one will suffer; TTL/session will be released.
Lock expired in the middle of work:- Mandatory watchdog: if the extension fails, interrupt the critical section and roll back/compensate.
- No "finish later": without a lock, the critical section cannot be continued.
A long pause (GC/stop-the-world) → the extension did not happen, the other took the lock. The workflow must detect the loss of ownership (keepalive channel) and abort.
8) Dedloki, priorities and inversion
Dedloki in a distributed world are rare (there is usually one castle), but if there are several castles, adhere to a single order of taking (lock ordering).
Priority inversion: A low-priority owner holds a resource while high-priority ones wait. Solutions: TTL limits, preemption (if business allows), sharding of the resource.
Fasting: Use waiting queues (ZK-sub-order nodes) for fairness.
9) Observability
Metrics:- `lock_acquire_total{status=ok|timeout|error}`
- `lock_hold_seconds{p50,p95,p99}`
- 'fencing _ token _ value '(monotony)
- `lease_renew_fail_total`
- 'split _ brain _ prevented _ total '(number of attempts denied due to lack of quorum)
- `preemptions_total`, `wait_queue_len`
- `lock_name`, `owner_id`, `token`, `ttl`, `attempt`, `wait_time_ms`, `path` (для ZK), `mod_revision` (etcd).
- "acquire → critical section → release" spans with the result.
- Growth 'lease _ renew _ fail _ total'.
- `lock_hold_seconds{p99}` > SLO.
- "Orphan" locks (without heartbeat).
- Bloated waiting lists.
10) Case studies
10. 1 Secure Redis lock with fencing (pseudo)
1. We store the token counter in a reliable store (for example, Postgres/etcd).
2. If'SET NX PX'is successful, we read/increment the token and make all changes to the resource with the token checked in the database/service.
python acquire token = db. next_token ("locks/orders/42") # monotone ok = redis. set("locks:orders:42", owner, nx=True, px=ttl_ms)
if not ok:
raise Busy()
critical op guarded by token db. exec("UPDATE orders SET... WHERE id=:id AND:token > last_token",...)
release (compare owner)
10. 2 etcd Mutex + watchdog (Go)
go ctx, cancel:= context. WithCancel(context. Background())
sess, _:= concurrency. NewSession(cli, concurrency. WithTTL(10))
m:= concurrency. NewMutex(sess, "/locks/job/cleanup")
if err:= m. Lock(ctx); err!= nil { /... / }
// Watchdog go func() {
<-sess. Done ()//loss of session/quorum cancel ()//stop working
}()
doCritical (ctx )//must respond to ctx. Done()
_ = m. Unlock(context. Background())
_ = sess. Close()
10. 3 Leadership in ZK (Java, Curator)
java
LeaderSelector selector = new LeaderSelector(client, "/leaders/cron", listener);
selector. autoRequeue();
selector. start(); // listener. enterLeadership() с try-finally и heartbeat
10. 4 Postgres advisory lock with deadline (SQL + app)
sql
SELECT pg_try_advisory_lock(128765); -- attempt without blocking
-- if false --> return via backoff + jitter
11) Test playbooks (Game Days)
Quorum loss: disable 1-2 etcd nodes → the attempt to take the lock should not pass.
GC-pause/stop-the-world: artificially delay the owner's flow → check that the watchdog is interrupting work.
Split-brain: emulation of the network separation between the owner and the side of the castle → the new owner gets a higher fencing token, the old one is rejected by the blue.
Clock skew/drift: take the clock away from the owner (for Redis/lease) → make sure that the correctness is ensured by tokens/checks.
Crash before release: process crash → lock is released via TTL/session.
12) Anti-patterns
Clean TTL lock without fencing when accessing an external resource.
Rely on local time for correctness (no HLC/fencing).
Distribution of locks through one Redis master in an environment with a feilover and without confirmation of replicas.
Infinite critical section (TTL "for the ages").
Removing someone else's lock without checking 'owner _ id '/token.
Lack of backoff + jitter → storm of attempts.
A single global lock "for everything" - a bag of conflicts; key sharding is better.
13) Implementation checklist
- Resource type defined and can CAS/queue/idempotency be dispensed with.
- Mechanism selected: etcd/ZK/Consul for CP; Redis/cache - only with fencing.
- Implemented: 'owner _ id', TTL + extension, watchdog, correct unlock.
- The external resource checks the fencing token (monotony).
- There's a leadership strategy and failover.
- Configured metrics, alerts, logging tokens and revisions.
- Backoff + jitter and acquire timeouts are provided.
- Game days held: quorum, split-brain, GC pauses, clock skew.
- Documentation of the procedure for taking several locks (if required).
- brownout plan - what to do when the lock is unavailable.
14) FAQ
Q: Is' SET NX PX 'Redis lock enough?
A: Only if the resource checks the fencing token. Otherwise, with network partitioning, two owners are possible.
Q: What to choose "by default"?
A: For strict guarantees - etcd/ZooKeeper/Consul (CP). For easy tasks inside one database - advisory locks Postgres. Redis - only with fencing.
Q: Which TTL to put?
A: 'TTL ≥ p99 critical section duration × 2' and short enough to quickly clear "zombies." Renewal - every'TTL/3 '.
Q: How to avoid fasting?
A: Waiting queue in order (ZK sequential) or fairness algorithm; limit of attempts and fair planning.
Q: Do I need time synchronization?
A: For correctness - no (use fencing). For operational predictability, yes (NTP/PTP), but don't rely on wall-clock for lock logic.
15) Totals
Reliable distributed locks are built on quorum levels (etcd/ZK/Consul) with lease + keepalive, and are necessarily supplemented by fencing token at the level of the resource being changed. Any TTL/Redis approaches without fencing are a split-brain risk. Think first about causality and idempotence, use locks where it is impossible without them, measure, test failure modes - and your "critical sections" will remain critical only in meaning, not in the number of incidents.