In our previous deep dive, we looked at how Fintech systems defend against fraud in milliseconds. But even in a "legitimate" system, a different kind of chaos lurks: Race Conditions.
Imagine two microservices simultaneously trying to update a user's wallet balance, or two workers trying to process the exact same expensive report. In a single-threaded environment, a simple mutex solves this. In a distributed system with hundreds of nodes, a local lock is useless. You need a global "source of truth" for who owns a resource.
This is the domain of Distributed Locking. Today, we deconstruct one of the most popular (and controversial) implementations: Redlock.
1. The Single-Node Dilemma
Before we look at the distributed version, we must understand how a basic lock works in Redis. For years, the standard approach was using the SETNX (Set if Not eXists) command.
The Modern Atomic Command
The correct way to acquire a lock in a single Redis instance is:
SET resource_name my_unique_identifier NX PX 30000NX: Only set the key if it doesn't already exist.
PX 30000: Set an expiry of 30,000 milliseconds (TTL).my_unique_identifier: This must be unique across all clients to ensure that only the person who acquired the lock can release it.
The Fatal Flaw: This works perfectly until the Redis node crashes. If the node fails before the lock expires and you fail over to a replica, the replica might not have received the lock key yet due to asynchronous replication. You now have two clients holding the "exclusive" lock.
2. The Redlock Algorithm
To solve the single-point-of-failure problem, Salvatore Sanfilippo (Antirez, the creator of Redis) proposed Redlock. The core idea is to use N independent Redis masters (usually 5) that do not share any data.
The Acquisition Process
To acquire the lock, a client performs the following steps:
Get Current Time: Record the start time in milliseconds.
Sequential Acquisition: Try to acquire the lock in all N instances using the same key and unique random value. The client uses a small timeout (e.g., 5-50ms) for each request to avoid getting stuck on a crashed node.
Calculate Elapsed Time: Subtract the start time from the current time.
Quorum Check: The client is considered to have acquired the lock only if:
It acquired the lock from a majority of nodes (at least 3 out of 5).
The total time elapsed is less than the lock validity time.
Adjust TTL: The effective lock time is the initial TTL minus the time spent acquiring it.
If the client fails to acquire the majority, it must immediately send an Unlock Script to all instances (even the ones it thinks it failed to lock).
3. The "Unlock" Safety
Releasing a lock isn't as simple as DEL resource_name. If a client hangs for 31 seconds and then tries to delete the lock, it might be deleting a lock that was just acquired by someone else.
The release must be atomic, typically using a Lua script:
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
endThis ensures the client only deletes the lock if the unique_identifier matches what they originally set.
4. The Great Debate: Kleppmann vs. Antirez
Redlock is famous in the system design world because of a rigorous critique by distributed systems researcher Martin Kleppmann. He argued that Redlock is fundamentally unsafe for systems where correctness is paramount.
Problem A: Clock Drift
Redlock relies on the assumption that clocks across all nodes tick at the same rate. If one node's clock jumps forward significantly, it might expire a lock prematurely, allowing another client to grab it while the first still thinks they own it.
Problem B: The "Stop-The-World" GC Pause
Imagine this timeline:
Client 1 acquires the Redlock.
Client 1 enters a long Garbage Collection (GC) pause.
The lock expires on the Redis nodes.
Client 2 acquires the same Redlock.
Client 1 wakes up from GC and performs the "exclusive" write.
Result: Mutual exclusion is broken.
The Solution: Fencing Tokens
Kleppmann suggests that for a lock to be truly safe, the storage layer must support Fencing Tokens. Every time a lock is granted, it comes with an incrementing ID. The database (like Postgres or Cassandra) must check that the token is still valid and has not been superseded by a higher ID before committing a write.
5. When to use Redlock vs. Zookeeper/Etcd
If you need a distributed lock, you generally choose between two philosophies:
Feature | Redis (Redlock) | Zookeeper / Etcd |
Philosophy | Performance & Availability | Strict Consistency (CP) |
Mechanism | TTL-based expiration | Sessions / Ephemeral Nodes |
Failure Detection | Wait for TTL to expire | Session timeout (Heartbeat) |
Performance | Extremely High | Moderate |
Best For | Task scheduling, rate limiting | Distributed Config, Leader Election |
The Rule of Thumb: Use Redlock for "Liveness" (preventing duplicate work). Use Zookeeper or Etcd for "Correctness" (preventing data corruption).
Summary: Designing for Concurrency
Distributed locking is a trade-off. Redlock provides a high-performance, fault-tolerant way to manage resources across a cluster, but it is not a silver bullet.
Always use a unique ID to release locks.
Keep lock duration short to minimize the impact of client failures.
Assume the lock might fail implement idempotency or fencing tokens at the database level for mission-critical operations.
References & Further Reading
Redis.io: Distributed Locks with Redis - The original documentation of the Redlock algorithm.
Martin Kleppmann: How to do Distributed Locking - The famous critique that every system designer should read.
Antirez: Is Redlock Safe? - Salvatore's detailed response to the Kleppmann critique.
Zookeeper Recipes: Distributed Locks - Understanding the alternative "Sequential Znode" approach.
The Jepsen Tests: Redis - Analytical tests on Redis's consistency and partition tolerance.