Group Communication, Mutual Exclusion, and Leader Election

Sending, Receiving, and Delivering

These distinctions apply to any protocol handling reliability or ordering, not just multicast.

Sending is the act of transmitting a message from an application through the communication layer to the network.

Receiving is the act of a machine accepting a message from the network. The message has arrived at the machine but is not yet visible to the application.

Delivering is the act of passing a received message to the application. This is when the application actually sees and processes the message.

Between receiving and delivering, the communication layer may take one of three actions:

It may deliver the message immediately if no special handling is required.
It may discard the message if it is a duplicate.
It may place the message in a holdback queue if it arrived out of order and must wait for earlier messages.

All ordering and reliability guarantees refer to the delivery order, not the receipt order.

IP Multicast

IP Multicast provides network-layer one-to-many delivery. It uses two protocols: IGMP handles communication between hosts and their local routers, while PIM handles the routing of multicast traffic between routers.

IGMP (Internet Group Management Protocol) operates between hosts and their directly connected routers. When a host wants to receive multicast traffic for a group, it sends an IGMP membership report (join message) to its local router. The router then ensures that multicast traffic for that group flows to that network segment. Routers periodically send IGMP queries to discover which groups have active members, and hosts respond with membership reports.

PIM (Protocol Independent Multicast) routes multicast traffic across the network. It is called “protocol independent” because it uses the existing unicast routing table rather than building its own. PIM operates in two modes that reflect different assumptions about how receivers are distributed.

PIM Dense Mode uses a flood-and-prune approach. The protocol initially floods multicast traffic to all routers, and routers with no interested receivers send prune messages upstream to stop receiving traffic. This mode is appropriate when most subnets have receivers interested in the traffic.

PIM Sparse Mode requires receivers to explicitly request traffic. The protocol uses a Rendezvous Point (RP) that serves as a meeting point for sources and receivers. When a host joins a multicast group, routers send Join messages toward the RP, building a shared distribution tree. Sources send their traffic to the RP, which then distributes it down the tree to all receivers. This means a source only needs to send one stream to the RP regardless of how many receivers exist. Sparse mode is appropriate when receivers are sparsely distributed across the network.

IP Multicast works well in controlled environments such as cable TV networks and data center trading systems. However, most ISPs block it at network boundaries due to traffic engineering complexity, billing challenges, and security concerns.

Application-Level Multicast

Production distributed systems implement application-level multicast over reliable unicast. Reliability and ordering are two independent dimensions that can be combined as needed.

Reliability Levels

Unreliable multicast provides best-effort delivery with no guarantees. Messages may be lost, duplicated, or delivered to only some recipients.

Best-effort reliable multicast guarantees that if the sender completes without crashing, all live recipients receive the message. It also guarantees no duplication and no spurious messages.

The implementation uses timeouts and retransmission: the sender transmits to all recipients using reliable unicast (such as TCP), waits for acknowledgments, and retransmits if an acknowledgment does not arrive within the timeout period. However, this approach does not guarantee consistency if the sender crashes mid-transmission, and messages do not survive system restarts.

Reliable multicast provides strong consistency guarantees even when the sender crashes during transmission. It guarantees three properties:

Agreement: If any correct process delivers a message, all correct processes eventually deliver it.
Integrity: Messages are delivered at most once and only if they were actually sent.
Validity: A correct sender eventually delivers its own message.

Reliable multicast does not guarantee persistence across system restarts.

Durable multicast adds persistence to reliable multicast. Messages are written to stable storage before being acknowledged, so they survive crashes and restarts. Apache Kafka is an example of a system that provides durable multicast.

Publish-subscribe (pub/sub) systems apply group communication at scale. Publishers send to named topics; subscribers register interest in topics. The topic acts as a named multicast group. We cover this in detail later in the course.

Ordering Levels

Unordered delivery provides no guarantees about message sequence. Messages may arrive in any order at different recipients.

Single source FIFO ordering (SSF) guarantees that messages from the same sender are delivered in the order they were sent.

Formally: if a process sends multicast(G, m) and later sends multicast(G, m′), then every correct process that delivers m′ will have already delivered m.

The implementation uses per-sender sequence numbers: each sender maintains a counter and attaches it to each message, and receivers buffer messages until they can be delivered in sequence.

Causal ordering guarantees that if message m1 happened-before message m2, then m1 is delivered before m2 at all processes.

Formally: if multicast(G, m) → multicast(G, m′), where → denotes happened-before, then every correct process that delivers m′ will have already delivered m. Causal ordering implies single-source FIFO ordering because messages from the same sender are causally related.

The implementation uses vector clocks (the same data structure from the logical clocks lecture). Each message carries the sender’s vector, and the receiver buffers messages until all causally preceding messages have been delivered.

Total ordering guarantees that all processes deliver all messages in the same order.

Formally: (1) if a process sends m before m′, then any correct process that delivers m′ will have already delivered m; and (2) if a correct process delivers m before m″, then every correct process that delivers both will deliver m before m″.

Total ordering does not imply causal or single source FIFO ordering; it only requires that everyone agrees on the same order. One implementation uses a central sequencer; another uses distributed agreement where processes propose and converge on sequence numbers.

Synchronous ordering uses a sync primitive (a special set of messages) that acts as a barrier. When a process issues a sync, it blocks until all in-flight messages have been delivered to all recipients.

Formally: if a process sends messages M, then issues a sync, then sends messages M′, every correct process will deliver all messages in M before delivering any message in M′.

This creates logical message groups or epochs with clean boundaries between them. The sync primitive is used for view changes to ensure all old-view messages are delivered before transitioning to a new view.

Real-time ordering would deliver messages in actual physical time order (mirroring exactly when they were sent). This is impossible to implement perfectly because clocks cannot be perfectly synchronized.

Atomic multicast combines reliable multicast with total ordering. It guarantees that all correct processes deliver the same set of messages in the same order. Some systems additionally require the total order to respect causality.

Failure Detection

Failure detection determines which processes have crashed. In asynchronous systems, perfect failure detection is impossible because a slow process is indistinguishable from a crashed one. This observation underlies the FLP impossibility result, which proves that consensus cannot be guaranteed in asynchronous systems where even one process might crash.

Real systems work around FLP by adding timeouts (accepting occasional false suspicions), assuming partial synchrony (eventual timing bounds), or using randomness.

False positives occur when a failure detector incorrectly suspects a live process has crashed. False negatives occur when a failure detector fails to detect that a process has actually crashed.

Heartbeat-based detection uses periodic messages to indicate liveness. There are two approaches:

In push-based heartbeating, monitored processes periodically send heartbeat messages to monitors.
In pull-based heartbeating (also called pinging), monitors periodically query processes and expect responses.

A monitor that does not receive heartbeats within a timeout period suspects the process has failed. The choice of timeout involves a tradeoff: shorter timeouts detect failures faster but generate more false positives.

The Phi Accrual Failure Detector

The phi accrual failure detector outputs a continuous suspicion level (φ) rather than a binary alive/dead judgment.

The detector maintains a sliding window of time gaps between consecutive heartbeats and learns what “normal” timing looks like for this particular connection. When a heartbeat is late, the detector calculates how surprising this silence is given the learned distribution.

The φ value is on a logarithmic scale: φ = k means roughly a 10^−k probability that this delay is a normal variation. For example, φ = 3 means about a 0.1% (one in a thousand) chance that the process is still alive and the heartbeat is just delayed; φ = 8 means about a 0.000001% chance.

Applications choose a threshold based on their needs. A lower threshold means faster detection but more false positives. Apache Cassandra (a distributed database) uses this detector with a configurable threshold that defaults to 8. The key advantage is that it separates monitoring from decision-making: the detector provides suspicion information, and applications decide when to act.

Virtual Synchrony and Group Membership

Group membership maintains a consistent view of which processes are in a group. The membership is tracked collectively by all members rather than by a central server. Each process runs a group membership service (GMS) layer that monitors other members, participates in view change protocols, and notifies the application when membership changes.

A view is a snapshot of group membership containing a unique identifier (typically a monotonically increasing number) and a list of member processes. All processes in a view agree on its membership.

Message Stability

When a process multicasts a message, it does not get delivered immediately. The protocol works as follows:

The sender sends the message to all group members.
Each member receives the message, buffers it, and sends an acknowledgment back to the sender.
When the sender receives acknowledgments from all members of the current view, the message is stable.
The sender sends a stability announcement to all members.
Only after receiving this announcement can members deliver the message to the application.

A message is unstable if it has been received by some members but the sender has not yet confirmed stability (either because acknowledgments are still pending or because the sender crashed before completing the protocol). Unstable messages are held in a buffer and cannot be delivered to applications.

This protocol ensures that if a sender crashes mid-transmission, no member will have delivered a partially-sent message. Either the message becomes stable (all members have it and can deliver), or it remains unstable and is eventually discarded.

View Changes

A view change is a protocol that transitions all members from one view to another when membership changes due to a join, leave, or failure. The protocol ensures that all processes agree on which messages were delivered in the old view before transitioning.

The challenge is handling in-flight messages: messages that have been received and buffered by some members but are unstable because the sender crashed or has not yet confirmed stability. These messages are sitting in member buffers, waiting for a stability announcement that may never come.

The view change protocol works in three phases:

Flush: Processes stop sending new application messages and exchange information about which messages they have buffered but not yet delivered. This propagates any in-flight messages to all surviving members.
Collect: Each process waits to receive flush messages from all surviving members. After this phase, all survivors have the same set of buffered messages.
Commit: Processes agree on the new view membership. Messages that all survivors have are marked stable and delivered. Messages that only some members received are discarded. All processes then atomically transition to the new view.

When a process fails, the failure detector notices the unresponsive process, and any detecting process can initiate a view change. The failed process is excluded from the new view. A recovered process must rejoin as a new member.

Recovery and State Transfer

If a failed process recovers, it cannot simply rejoin the group with its old state. While the process was dead, the group continued operating: messages were delivered, state changed, and views advanced. The recovered process has stale state and an outdated view of the world.

To rejoin, the recovered process must perform a state transfer:

The recovering process contacts an existing group member and requests a state transfer.
The existing member sends its current state (either a full snapshot or a recent checkpoint plus subsequent updates).
The recovering process initializes its replica to this state.

The state transfer is treated as an atomic event: no other processing occurs at the recovering process until the transfer is complete. This ensures the process does not attempt to participate in the group with partially updated state.

Once the state transfer completes, the recovered process joins the group as a new member through the normal join protocol, triggering a view change that adds it to the membership.

Virtual synchrony coordinates membership changes with message delivery. The key guarantee is view synchrony: if a message is delivered in a view, it is delivered in that same view at all processes that deliver it. This ensures that all processes agree on what the group membership was when each message was delivered.

Distributed Mutual Exclusion

Distributed mutual exclusion ensures that at most one process is in a critical section at any time without relying on shared memory. The algorithms must satisfy three properties:

Safety (mutual exclusion): At most one process is in the critical section at any time.
Liveness (progress): If a process requests the critical section and no process holds it forever, the requester eventually enters.
Fairness (bounded waiting): There is a bound on how many times other processes can enter before a waiting process is granted access.

Centralized Mutual Exclusion Algorithm

The centralized approach designates one process as the coordinator. When a process wants to enter the critical section, it sends a request message to the coordinator. If the critical section is free, the coordinator sends a grant message back immediately. If the critical section is occupied, the coordinator queues the request. When the process in the critical section finishes, it sends a release message to the coordinator, which then sends a grant to the next process in the queue.

This approach requires only three messages per critical section entry (request, grant, release) and is simple to implement. The drawback is that the coordinator is a single point of failure and a potential performance bottleneck.

Lamport’s Mutual Exclusion Algorithm

Lamport’s algorithm is fully distributed with no central coordinator. Each process maintains a request queue ordered by Lamport timestamp, with ties broken by process ID to ensure a total order.

When a process wants to enter the critical section, it timestamps its request and sends it to all other processes. When a process receives a request, it adds it to its local queue and sends an acknowledgment back to the requester. A process may enter the critical section when two conditions are met:

Its own request is at the head of its queue (meaning it has the earliest timestamp), and
It has received acknowledgments from all other processes.

When a process exits the critical section, it sends a release message to all other processes. Upon receiving a release, each process removes that request from its queue.

The algorithm requires 3(N−1) messages per critical section entry: N−1 requests, N−1 acknowledgments, and N−1 releases. It guarantees fairness because requests are ordered by timestamp. However, it assumes all processes are correct and responsive; if one process crashes and stops sending acknowledgments, other processes will block indefinitely.

Ricart-Agrawala Algorithm

The Ricart-Agrawala algorithm optimizes Lamport’s algorithm by eliminating the release messages. Instead of always acknowledging immediately, a process defers its reply if it also wants the critical section and has a higher priority (earlier timestamp).

When a process wants to enter the critical section, it timestamps its request and sends it to all other processes. When a process receives a request, it compares the request’s timestamp to its own. If it does not want the critical section, or if its own request has a later timestamp, it sends a reply immediately. Otherwise, it defers the reply until after it exits the critical section.

A process enters the critical section when it has received replies from all other processes. Upon exiting, it sends all the deferred replies.

This reduces the message count to 2(N−1) per entry: N−1 requests and N−1 replies. Like Lamport’s algorithm, it requires all processes to respond.

Token Ring Mutual Exclusion

The token ring algorithm organizes processes in a logical ring and uses a circulating token. Only the process holding the token may enter the critical section.

The token circulates continuously around the ring. When a process receives the token and wants the critical section, it enters. When it finishes (or if it does not want the critical section), it passes the token to its neighbor.

This approach provides bounded waiting and is simple to implement. The drawbacks are that the token generates continuous network traffic even when no process wants the critical section, and if the token is lost due to a crash, a recovery mechanism must regenerate it while avoiding duplicate tokens.

Algorithm Comparison

Algorithm	Messages	Fault Tolerance	When to Use
Centralized	3	Coordinator failure blocks progress	When simplicity is important and fast coordinator recovery is available
Lamport	3(N−1)	Requires all processes respond	When fairness and full distribution are needed
Ricart-Agrawala	2(N−1)	Requires all processes respond	Optimal choice for permission-based mutual exclusion
Token Ring	1 to N−1	Token loss requires recovery	When bounded waiting is important and requests are frequent

Leader Election

Leader election selects a single coordinator from a group of processes. The elected leader typically has the highest process ID among surviving processes.

Bully Algorithm

The bully algorithm assumes a synchronous model with timeouts for failure detection. It uses three message types: ELECTION (to announce a new election), OK (to acknowledge an election message and tell the sender a higher-ID process is alive), and COORDINATOR (to announce the winner).

When a process P detects that the coordinator has failed, it initiates an election by sending ELECTION messages to all processes with higher IDs. If P receives no OK responses within a timeout, it declares itself coordinator and sends COORDINATOR messages to all other processes. If P does receive an OK response, it knows a higher-ID process will take over and waits for a COORDINATOR message.

When a process receives an ELECTION message from a lower-ID process, it sends an OK response and starts its own election if it has not already. The highest-ID surviving process always wins.

The worst-case message complexity is O(n²), occurring when the lowest-ID process initiates the election. The best case is O(n).

Ring Election Algorithm

The ring election algorithm organizes processes in a logical ring. When a process notices the coordinator has failed, it creates an election message containing its own ID and sends it clockwise to its neighbor.

When a process receives an election message, it compares the ID in the message with its own:

If the received ID is larger than its own, the process forwards the message unchanged because a higher-ID process should win.
If the received ID is smaller and the process has not yet participated in this election, it replaces the ID in the message with its own ID and forwards it, nominating itself.
If the received ID is smaller but the process has already sent its own election message, it discards the message to avoid continuing a duplicate election.

When a process receives an election message containing its own ID, it knows its message has traveled all the way around the ring and its ID is the largest. It then sends an ELECTED message around the ring to announce itself as the new coordinator.

The worst case is 3N−1 messages, occurring when the process immediately following the highest-ID process initiates the election.

What You Do Not Need to Study

You do not need to memorize exact pseudocode, PIM protocol details, or the exact phi calculation formula. Focus on understanding the concepts: why IGMP and PIM exist (host membership vs. router distribution), why sparse mode uses an RP (explicit joins vs. flooding), how phi adapts to observed network conditions, and why virtual synchrony coordinates membership with delivery.