Many distributed systems require coordination among processes. A replicated database may need to ensure that only one replica processes a particular update at a time. A distributed file system may need to elect a primary server that coordinates access to shared files. A cluster manager may need to designate a single scheduler that assigns work to nodes.
In a single machine, these problems are solved with local mechanisms provided by the operating system: locks, semaphores, and monitors for mutual exclusion; simple elections based on process IDs for choosing a coordinator.
These mechanisms rely on shared memory and reliable communication within a single address space. In a distributed system, where processes do not share memory and communicate over unreliable networks, we need different algorithms.
Mutual exclusion coordinates access to a resource. Leader election selects a single process to play a coordinating role.
We will examine two fundamental coordination problems: distributed mutual exclusion and leader election. We will study several algorithms for each problem, analyzing their message complexity, fault tolerance, and tradeoffs. These algorithms illustrate key techniques that appear throughout distributed systems: using logical timestamps to order events, using tokens to serialize access, and using unique identifiers to break symmetry.
Distributed Mutual Exclusion
When multiple processes share a resource, they need a way to coordinate access. In a single machine, this is solved with locks, semaphores, or monitors. In a distributed system, where processes do not share memory, we need distributed algorithms for mutual exclusion.
The Problem
Distributed mutual exclusion ensures that at most one process is in the critical section at any time. The algorithms must satisfy three properties:
Safety (mutual exclusion) requires that if one process is in the critical section, no other process is.
Liveness (progress) requires that if a process requests the critical section and no process holds it forever, the requester eventually enters.
Fairness (bounded waiting) requires that there exists a bound on the number of times other processes may enter the critical section before a waiting process is granted access. Many algorithms enforce FIFO ordering to achieve this.
Centralized Approach
The simplest approach designates one process as the coordinator.
-
To enter the critical section, a process sends a request to the coordinator.
-
The coordinator grants access if the critical section is free by responding with a grant message. Otherwise, it queues the request.
-
When a process exits the critical section, it notifies the coordinator by sending it a release message, which grants access to the next queued requester.
This approach requires only three messages per critical section entry (request, grant, release) and is easy to implement. The drawback is that the coordinator is a single point of failure. If it crashes, no one can enter the critical section. The algorithm itself does not tolerate coordinator failure and recovery requires a separate leader election mechanism. The coordinator can also become a performance bottleneck under high load.
Lamport’s Algorithm
Leslie Lamport proposed a fully distributed algorithm in his 1978 paper “Time, Clocks, and the Ordering of Events in a Distributed System” (the same one that introduced Lamport timestamps). The algorithm uses Lamport timestamps to order requests.
Each process maintains a request queue ordered by timestamp. Ties are broken using process IDs to ensure a total order.
-
To enter the critical section, a process timestamps its request and sends it to all other processes.
-
When a process receives a request, it adds it to its queue and sends an acknowledgment. A process may enter the critical section when its request is at the head of its queue and it has received acknowledgments from all other processes for its request.
-
When exiting, a process sends a release message to all other processes, which remove that request from their queues.
The algorithm requires 3(N-1) messages per critical section entry: N-1 requests, N-1 acknowledgments, and N-1 releases. It guarantees fairness by imposing a total ordering on requests via logical timestamps.
However, although it has no single coordinator, it assumes all processes are correct and responsive. If one process crashes and does not send acknowledgments, others may block indefinitely.
Ricart-Agrawala Algorithm
Glenn Ricart and Ashok Agrawala observed in 1981 that Lamport’s release messages are redundant. Their optimization eliminates them by merging the release with the acknowledgment.
-
When a process wants to enter the critical section, it timestamps its request and sends it to all others.
-
When a process receives a request, it sends a reply immediately if it does not want the critical section or if it wants it, but the requester’s timestamp is smaller (with ties broken by process ID). Otherwise, it defers the reply until it exits the critical section.
-
A process enters when it has received replies from all others. Upon exiting, it sends deferred replies to all waiting processes.
This reduces the message count to 2(N-1) per entry: N-1 requests and N-1 replies. The algorithm is optimal among permission-based algorithms that require consent from every other process in a fully connected network.
Like Lamport’s algorithm, it has no central coordinator but assumes all processes respond. A crashed process that never replies can block progress.
Token Ring Algorithm
An entirely different approach uses a token that circulates among processes. Only the process holding the token may enter the critical section.
Processes are organized in a logical ring. The token circulates continuously around the ring. When a process receives the token and wants the critical section, it enters it. When done, or if it does not want the section, it passes the token to its neighbor.
If no one wants the critical section, the algorithm generates a constant stream of token-passing messages, which wastes bandwidth. If many processes want it, they must wait for the token to circulate.
As long as there is exactly one token and the ring remains intact, bounded waiting is guaranteed. However, if the token is lost due to a process crash, a recovery mechanism must regenerate it. Care must be taken to avoid creating duplicate tokens, which would violate mutual exclusion.
Comparing the Algorithms
Each mutual exclusion algorithm involves tradeoffs. The following table summarizes their characteristics for a system with N processes.
| Algorithm | Messages per Entry | Synchronization Delay | Fault Tolerance | Bottleneck |
|---|---|---|---|---|
| Centralized | 3 | 2 message delays | Coordinator failure blocks progress unless re-elected | Coordinator handles all requests |
| Lamport | 3(N-1) | 2 message delays | No single coordinator, but requires all processes to respond | None |
| Ricart-Agrawala | 2(N-1) | 2 message delays | No single coordinator, but requires all processes to respond | None |
| Token Ring | 1 to N-1 | 0 to N-1 message delays | Token loss requires recovery protocol | None, but continuous overhead |
Note that for the Token Ring algorithm, entry takes zero additional delay only if the process already holds the token.
Centralized is the simplest and most efficient in terms of messages, but the coordinator is a single point of failure and potential bottleneck. If the coordinator crashes, no process can enter the critical section until a new coordinator is elected. The coordinator can also become a performance bottleneck under high load. This approach works well when simplicity is important and you have a separate mechanism to quickly elect a new coordinator when needed.
Lamport’s algorithm is fully distributed with no single point of failure. It guarantees fairness (requests are served in timestamp order). The downside is high message overhead: every request requires communication with every other process. It also requires all processes to be available; if one process crashes and stops responding, others will wait forever for its acknowledgment.
Ricart-Agrawala improves on Lamport by reducing message count from 3(N-1) to 2(N-1). The properties are otherwise similar to Lamport’s: fully distributed, fair, but requires all processes to be available. This is a better choice than Lamport’s for distributed mutual exclusion when you need fairness and can assume all processes remain available.
Token ring has variable performand depending on where the token is and how much contention there is for it. When contention is high or the ring is large, a process might wait for the token to travel the entire ring. The token ring wastes bandwidth when no one wants the critical section (the token keeps circulating) and requires a recovery mechanism if the token is lost due to a process crash.
Leader Election
Many distributed systems designate a single process as the leader or coordinator. The leader might sequence operations, make decisions, or coordinate activities. When the current leader fails, the remaining processes must elect a new one.
The Bully Algorithm
The bully algorithm, proposed by Hector Garcia-Molina in 1982, assumes every process has a unique identifier (such as a process ID or IP address) and that processes can communicate directly with any other process. It also assumes a synchronous model with timeouts used for failure detection.
The algorithm earns its name because the process with the highest ID “bullies” its way to leadership. The protocol uses three message types.
-
An ELECTION message announces a new election.
-
An OK message acknowledges receipt of an election message and tells the sender that a higher-ID process is alive.
-
A COORDINATOR message announces the winner.
When a process P notices the coordinator has failed or recovers from its own failure, it initiates an election:
-
P sends ELECTION messages to all processes with higher IDs.
-
If P receives no OK responses within a timeout, it declares itself the coordinator and sends COORDINATOR messages to all other processes.
-
If P receives an OK response, it knows a higher-ID process will take over and waits for a COORDINATOR message.
When a process receives an ELECTION message from a lower-ID process, it sends an OK response and starts its own election if it has not already. When a process receives a COORDINATOR message, it records the new coordinator.
The bully algorithm guarantees that the highest-ID surviving process becomes the coordinator. The worst-case message complexity is O(n2), occurring when the lowest-ID process initiates the election. The best case is O(n), when the second-highest-ID process starts.
If processes frequently crash and recover, repeated elections may occur.
Ring-Based Election
An alternative approach, proposed by Chang and Roberts in 1979, organizes processes in a logical ring.
The worst-case message complexity for this algorithm is O(n2). The best case is O(n).
When a process notices the coordinator has failed:
-
It creates an election message containing its own ID and sends it clockwise around the ring.
-
When a process receives an election message, it compares the ID in the message with its own.
-
If the received ID is larger, the process forwards the message unchanged.
-
If the received ID is smaller and the process has not yet sent an election message, it replaces the ID with its own and forwards the message.
-
If the received ID is smaller but the process has already sent an election message with its own ID, it discards the received message.
-
When a process receives an election message containing its own ID, it knows its ID is the largest in the ring and it is the winner. It then sends an ELECTED message around the ring announcing itself as the new coordinator.
The worst case is 3N-1 messages, occurring when the process immediately following the highest-ID process initiates the election. The election message travels nearly all the way around the ring before reaching the eventual winner, then the ELECTED message must traverse the entire ring.
Considerations for the Real World
Real systems rarely implement these classical algorithms exactly. Raft integrates leader election with consensus to ensure safety even during network partitions. ZooKeeper’s ZAB protocol also has its own leader election mechanism.
The classical algorithms illustrate fundamental ideas: using unique identifiers to break symmetry, propagating information through the group, and eventually converging on a single leader. These concepts appear in various forms in production systems.
Summary
Distributed mutual exclusion algorithms ensure that only one process accesses a critical section at a time without shared memory. The centralized approach is simple but introduces a single point of failure. Lamport’s and Ricart-Agrawala are fully distributed but assume all processes remain responsive. Token-based approaches provide bounded waiting but require careful token management.
Leader election selects a single coordinator from a group of processes. The bully algorithm uses process IDs to determine the winner, with the highest ID surviving process becoming the coordinator. Ring-based algorithms organize processes in a logical ring and circulate election messages, achieving lower message complexity.
These coordination primitives, combined with the group communication and failure detection mechanisms from the previous lecture, form the foundation for building reliable distributed systems. In the next lecture, we will see how consensus algorithms like Raft build on these ideas to achieve agreement even in the face of failures and network partitions.