System Design

Distributed Locking: Preventing Race Conditions with Redlock

5 min read

Feb 20

In our previous deep dive, we looked at how Fintech systems defend against fraud in milliseconds. But even in a "legitimate" system, a different kind of chaos lurks: Race Conditions.

Imagine two microservices simultaneously trying to update a user's wallet balance, or two workers trying to process the exact same expensive report. In a single-threaded environment, a simple mutex solves this. In a distributed system with hundreds of nodes, a local lock is useless. You need a global "source of truth" for who owns a resource.

This is the domain of Distributed Locking. Today, we deconstruct one of the most popular (and controversial) implementations: Redlock.

1. The Single-Node Dilemma

Before we look at the distributed version, we must understand how a basic lock works in Redis. For years, the standard approach was using the SETNX (Set if Not eXists) command.

The Modern Atomic Command

The correct way to acquire a lock in a single Redis instance is:

SET resource_name my_unique_identifier NX PX 30000

NX: Only set the key if it doesn't already exist.

PX 30000: Set an expiry of 30,000 milliseconds (TTL).
my_unique_identifier: This must be unique across all clients to ensure that only the person who acquired the lock can release it.

The Fatal Flaw: This works perfectly until the Redis node crashes. If the node fails before the lock expires and you fail over to a replica, the replica might not have received the lock key yet due to asynchronous replication. You now have two clients holding the "exclusive" lock.

2. The Redlock Algorithm

To solve the single-point-of-failure problem, Salvatore Sanfilippo (Antirez, the creator of Redis) proposed Redlock. The core idea is to use N independent Redis masters (usually 5) that do not share any data.

The Acquisition Process

To acquire the lock, a client performs the following steps:

Get Current Time: Record the start time in milliseconds.
Sequential Acquisition: Try to acquire the lock in all N instances using the same key and unique random value. The client uses a small timeout (e.g., 5-50ms) for each request to avoid getting stuck on a crashed node.
Calculate Elapsed Time: Subtract the start time from the current time.
Quorum Check: The client is considered to have acquired the lock only if:
- It acquired the lock from a majority of nodes (at least 3 out of 5).
- The total time elapsed is less than the lock validity time.
Adjust TTL: The effective lock time is the initial TTL minus the time spent acquiring it.

If the client fails to acquire the majority, it must immediately send an Unlock Script to all instances (even the ones it thinks it failed to lock).

3. The "Unlock" Safety

Releasing a lock isn't as simple as DEL resource_name. If a client hangs for 31 seconds and then tries to delete the lock, it might be deleting a lock that was just acquired by someone else.

The release must be atomic, typically using a Lua script:

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end

This ensures the client only deletes the lock if the unique_identifier matches what they originally set.

4. The Great Debate: Kleppmann vs. Antirez

Redlock is famous in the system design world because of a rigorous critique by distributed systems researcher Martin Kleppmann. He argued that Redlock is fundamentally unsafe for systems where correctness is paramount.

Problem A: Clock Drift

Redlock relies on the assumption that clocks across all nodes tick at the same rate. If one node's clock jumps forward significantly, it might expire a lock prematurely, allowing another client to grab it while the first still thinks they own it.

Problem B: The "Stop-The-World" GC Pause

Imagine this timeline:

Client 1 acquires the Redlock.
Client 1 enters a long Garbage Collection (GC) pause.
The lock expires on the Redis nodes.
Client 2 acquires the same Redlock.
Client 1 wakes up from GC and performs the "exclusive" write.
Result: Mutual exclusion is broken.

The Solution: Fencing Tokens

Kleppmann suggests that for a lock to be truly safe, the storage layer must support Fencing Tokens. Every time a lock is granted, it comes with an incrementing ID. The database (like Postgres or Cassandra) must check that the token is still valid and has not been superseded by a higher ID before committing a write.

5. When to use Redlock vs. Zookeeper/Etcd

If you need a distributed lock, you generally choose between two philosophies:

Feature	Redis (Redlock)	Zookeeper / Etcd
Philosophy	Performance & Availability	Strict Consistency (CP)
Mechanism	TTL-based expiration	Sessions / Ephemeral Nodes
Failure Detection	Wait for TTL to expire	Session timeout (Heartbeat)
Performance	Extremely High	Moderate
Best For	Task scheduling, rate limiting	Distributed Config, Leader Election

The Rule of Thumb: Use Redlock for "Liveness" (preventing duplicate work). Use Zookeeper or Etcd for "Correctness" (preventing data corruption).

Summary: Designing for Concurrency

Distributed locking is a trade-off. Redlock provides a high-performance, fault-tolerant way to manage resources across a cluster, but it is not a silver bullet.

Always use a unique ID to release locks.
Keep lock duration short to minimize the impact of client failures.
Assume the lock might fail implement idempotency or fencing tokens at the database level for mission-critical operations.

References & Further Reading

Redis.io: Distributed Locks with Redis - The original documentation of the Redlock algorithm.
Martin Kleppmann: How to do Distributed Locking - The famous critique that every system designer should read.
Antirez: Is Redlock Safe? - Salvatore's detailed response to the Kleppmann critique.
Zookeeper Recipes: Distributed Locks - Understanding the alternative "Sequential Znode" approach.
The Jepsen Tests: Redis - Analytical tests on Redis's consistency and partition tolerance.

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN

System Design

Distributed Locking: Preventing Race Conditions with Redlock

Written byTanyaradzwa

5 min read

Feb 20

In our previous deep dive, we looked at how Fintech systems defend against fraud in milliseconds. But even in a "legitimate" system, a different kind of chaos lurks: Race Conditions.

This is the domain of Distributed Locking. Today, we deconstruct one of the most popular (and controversial) implementations: Redlock.

1. The Single-Node Dilemma

Before we look at the distributed version, we must understand how a basic lock works in Redis. For years, the standard approach was using the SETNX (Set if Not eXists) command.

The Modern Atomic Command

The correct way to acquire a lock in a single Redis instance is:

SET resource_name my_unique_identifier NX PX 30000

NX: Only set the key if it doesn't already exist.

PX 30000: Set an expiry of 30,000 milliseconds (TTL).
my_unique_identifier: This must be unique across all clients to ensure that only the person who acquired the lock can release it.

2. The Redlock Algorithm

The Acquisition Process

To acquire the lock, a client performs the following steps:

Get Current Time: Record the start time in milliseconds.
Sequential Acquisition: Try to acquire the lock in all N instances using the same key and unique random value. The client uses a small timeout (e.g., 5-50ms) for each request to avoid getting stuck on a crashed node.
Calculate Elapsed Time: Subtract the start time from the current time.
Quorum Check: The client is considered to have acquired the lock only if:
- It acquired the lock from a majority of nodes (at least 3 out of 5).
- The total time elapsed is less than the lock validity time.
Adjust TTL: The effective lock time is the initial TTL minus the time spent acquiring it.

If the client fails to acquire the majority, it must immediately send an Unlock Script to all instances (even the ones it thinks it failed to lock).

3. The "Unlock" Safety

Releasing a lock isn't as simple as DEL resource_name. If a client hangs for 31 seconds and then tries to delete the lock, it might be deleting a lock that was just acquired by someone else.

The release must be atomic, typically using a Lua script:

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end

This ensures the client only deletes the lock if the unique_identifier matches what they originally set.

4. The Great Debate: Kleppmann vs. Antirez

Problem A: Clock Drift

Problem B: The "Stop-The-World" GC Pause

Imagine this timeline:

Client 1 acquires the Redlock.
Client 1 enters a long Garbage Collection (GC) pause.
The lock expires on the Redis nodes.
Client 2 acquires the same Redlock.
Client 1 wakes up from GC and performs the "exclusive" write.
Result: Mutual exclusion is broken.

The Solution: Fencing Tokens

5. When to use Redlock vs. Zookeeper/Etcd

If you need a distributed lock, you generally choose between two philosophies:

Feature	Redis (Redlock)	Zookeeper / Etcd
Philosophy	Performance & Availability	Strict Consistency (CP)
Mechanism	TTL-based expiration	Sessions / Ephemeral Nodes
Failure Detection	Wait for TTL to expire	Session timeout (Heartbeat)
Performance	Extremely High	Moderate
Best For	Task scheduling, rate limiting	Distributed Config, Leader Election

The Rule of Thumb: Use Redlock for "Liveness" (preventing duplicate work). Use Zookeeper or Etcd for "Correctness" (preventing data corruption).

Summary: Designing for Concurrency

Distributed locking is a trade-off. Redlock provides a high-performance, fault-tolerant way to manage resources across a cluster, but it is not a silver bullet.

Always use a unique ID to release locks.
Keep lock duration short to minimize the impact of client failures.
Assume the lock might fail implement idempotency or fencing tokens at the database level for mission-critical operations.

References & Further Reading

Redis.io: Distributed Locks with Redis - The original documentation of the Redlock algorithm.
Martin Kleppmann: How to do Distributed Locking - The famous critique that every system designer should read.
Antirez: Is Redlock Safe? - Salvatore's detailed response to the Kleppmann critique.
Zookeeper Recipes: Distributed Locks - Understanding the alternative "Sequential Znode" approach.
The Jepsen Tests: Redis - Analytical tests on Redis's consistency and partition tolerance.

Newsletter

Level Up Your Tech Knowledge

Join 5,000+ developers receiving expert insights, coding tips, and exclusive content delivered straight to your inbox.

No spam, ever. Unsubscribe at any time.

Comments0

Leave a thought

No comments yet.
Be the first to share your thoughts!

Explore related posts

Chaos Engineering: How to Build Systems That Embrace Failure

Don't wait for a crash. How to use tools like Chaos Monkey to break your system intentionally and build resilience.

Mar 27 min read

SYSTEM DESIGN

Serving AI Agents: Scalable LLM Inference Architecture

Moving beyond chatbots. How to architect systems that run autonomous AI agents using vector databases and RAG.

Feb 275 min read

SYSTEM DESIGN

Collaborative Editing: How Google Docs Handles Concurrency

How two people type at once without overwriting each other. Explaining Operational Transformation (OT) and CRDTs.

Feb 235 min read

SYSTEM DESIGN