Group Communication, Membership, and Failure Detection
Distributed systems frequently need to coordinate the actions of multiple machines. A replicated database needs all replicas to process the same updates in the same order. A distributed lock service needs to ensure that only one client holds a lock at a time. A cluster of web servers needs to agree on which server handles requests for a particular user session.
These coordination problems share common building blocks. We need ways for groups of machines to communicate reliably. We need mechanisms to detect when machines fail. And we need protocols to manage group membership consistently so that all participants agree on who is in the group.
We will take a look at how groups of machines communicate with different reliability and ordering guarantees. We will explore how systems detect failures in their peers despite the impossibility of distinguishing slow machines from crashed ones. And we will examine how groups manage membership changes consistently through virtual synchrony.
These concepts form the foundation for reliable replication. In the next class, we will explore consensus algorithms such as Raft and Paxos, which enable distributed systems to agree on message sequencing despite failures. Those algorithms rely on reliable communication, accurate failure detection, and well-defined group membership. Understanding these primitives will make the more complex protocols far more intuitive.
Multicast Communication
In distributed systems, we frequently need to send the same message to multiple recipients. A stock trading system needs to broadcast price updates to all participants. A replicated database needs to propagate writes to all replicas. A distributed lock service needs to notify all interested parties when a lock becomes available.
Sending the same message to a specific group of processes is called multicast. Sending a message to a single process is unicast, while sending a message to everyone on the network is called broadcast.
Sending, Receiving, and Delivering
Before diving into multicast protocols, we need to clarify three terms essential to communication that are often confused: sending, receiving, and delivering. These distinctions apply not just to group communication but to any communication protocol that handles reliability or ordering.
Sending is the act of transmitting a message from an application through the communication layer to the network. When an application calls a send function, the message enters the communication subsystem and is transmitted toward its destination(s).
Receiving is the act of a machine accepting a message from the network. When a message arrives at a host, the communication layer receives it. However, receiving a message does not mean the application sees it yet. The message has arrived at the machine, but it may not be ready for the application to process.
Delivering is the act of passing a message received from the network to the application. This is when the application actually gets to see and process the message. Delivery is the event that matters from the application’s perspective.
The distinction between receiving and delivering is crucial because the communication layer may need to process messages before the application sees them. When a message is received, the communication layer might:
-
Deliver it immediately if no special handling is required
-
Discard it if it is a duplicate of a message already delivered
-
Place it in a holdback queue if it arrived out of order and must wait for earlier messages
For example, consider FIFO (first-in, first-out) ordering. If process A sends message m1 followed by m2, but m2 arrives at process B before m1 due to network reordering, process B’s communication layer receives m2 first. But it cannot deliver m2 yet because m1 has not arrived. The layer places m2 in a holdback queue. When m1 arrives, the layer delivers m1, then immediately delivers m2 from the queue.
This separation of concerns allows applications to think in terms of clean delivery semantics (single source FIFO, causal, total order) while the communication layer handles the messy details of network behavior. Throughout this discussion, when we talk about ordering and reliability guarantees, we mean the order and reliability of message delivery, not message receipt.
The Challenge of Multicast
At its simplest, multicast means sending one message to multiple destinations. We could implement this by sending separate unicast messages to each recipient, but this approach has significant drawbacks. It increases network traffic, introduces timing skew between recipients, and makes it difficult to reason about the order in which messages arrive.
IP Multicast
IP Multicast provides a network-layer solution that allows a single packet to be delivered to multiple hosts. The sender transmits to a multicast group address (in the range 224.0.0.0 to 239.255.255.255 for IPv4), and network routers replicate packets as needed to reach all group members.
IGMP: Joining and Leaving Groups
The Internet Group Management Protocol (IGMP) allows hosts to join and leave multicast groups dynamically. When a host wants to receive traffic for a multicast group, it sends an IGMP membership report to its local router. The router then ensures that multicast traffic for that group flows to the network segment where the host resides.
IGMP operates between hosts and their immediately connected routers. Routers periodically send IGMP queries to discover which multicast groups have active members on each interface. Hosts respond with membership reports. When no hosts on a segment respond for a group, the router stops forwarding traffic for that group to that segment.
PIM: Routing Multicast Traffic
While IGMP handles the last hop between routers and hosts, Protocol Independent Multicast (PIM) handles routing multicast traffic across the network. PIM is called “protocol independent” because it does not build its own routing tables. Instead, it uses the existing unicast routing table to determine paths back to multicast sources. This Reverse Path Forwarding (RPF) approach ensures that multicast traffic follows efficient paths without creating loops.
PIM operates in two primary modes that reflect different assumptions about receiver density.
Dense Mode (PIM-DM) assumes that most subnets have receivers interested in the multicast traffic. The protocol uses a “flood and prune” approach: it initially floods multicast traffic to all routers, and routers with no interested receivers send prune messages upstream to stop receiving traffic. This flooding repeats periodically (typically every 3 minutes), making dense mode efficient only when receivers are truly dense. Dense mode is appropriate for scenarios like a campus network, where most locations want to receive a video feed.
Sparse Mode (PIM-SM) assumes that receivers are sparsely distributed, so flooding would waste bandwidth. Instead of flooding, sparse mode requires receivers to explicitly request traffic. The protocol uses a Rendezvous Point (RP), which serves as a meeting point for sources and receivers.
To receive traffic for a group, a host joins the group with IGMP, and its local router sends a PIM Join message toward the RP. Routers forward this join toward the RP using their normal unicast routing table. Each router along the path installs multicast forwarding state, so these joins stitch together a shared distribution tree rooted at the RP. When sources send to the group, traffic reaches the RP and is forwarded down that shared tree to every subnet that joined.
Once traffic is flowing, routers can switch to source-specific shortest-path trees if it is more efficient. Sparse mode is appropriate for the general case in which only a small fraction of network locations want a particular multicast stream.
The Internet Multicast Problem
While IP Multicast works well in controlled network environments, it has significant limitations on the public internet. Most internet service providers block multicast traffic at their network boundaries. There are several reasons for this:
-
Traffic engineering complexity: Multicast traffic patterns are unpredictable and can amplify bandwidth usage in ways that complicate capacity planning.
-
Billing challenges: Traditional billing models charge for bandwidth at network boundaries, but multicast traffic is replicated inside the network, making cost allocation difficult.
-
Security concerns: Multicast-based denial-of-service attacks are difficult to mitigate because traffic replication happens automatically inside the network.
-
State requirements: Routers must maintain per-group forwarding state, which does not scale well to millions of potential groups.
This blocking effectively prevents IP multicast from working across the global internet. Applications that require global multicast, such as video streaming services, typically use application-level solutions, such as content delivery networks, instead.
Where IP Multicast Works Well
Within controlled network environments, IP multicast remains valuable. Cable television providers use IP multicast extensively within their internal networks. When thousands of subscribers watch the same channel, the provider sends a single video stream that routers replicate at branch points in the network rather than sending individual streams to each subscriber. This dramatically reduces bandwidth requirements. Financial trading systems use IP multicast within data centers to distribute market data to hundreds of trading applications simultaneously. Video conferencing within enterprise networks and live event streaming within content delivery networks also benefit from IP multicast.
Application-Level Multicast
Because IP Multicast has limited reach, production distributed systems typically implement application-level multicast protocols built on reliable unicast communication (usually TCP). This gives us the control we need to provide the reliability and ordering guarantees that applications require.
Application-level multicast protocols can be characterized along two independent dimensions: reliability and ordering. These dimensions are orthogonal: you can have reliable delivery with no ordering guarantees, or unreliable delivery with strong ordering guarantees (for the messages that do arrive). In practice, systems choose appropriate levels of each based on application requirements.
Reliability Levels
Reliability determines what guarantees the multicast system provides about message delivery. Each level makes stronger promises but requires more coordination overhead.
Unreliable Multicast
Unreliable multicast provides best-effort delivery with no guarantees. The system attempts to deliver messages but makes no promises about success. This is the default behavior of IP multicast over UDP.
What it guarantees:
- Best-effort delivery attempt
What it does NOT guarantee:
-
Messages may be lost
-
Messages may be duplicated
-
Messages may be delivered to only some recipients
Implementation: The sender transmits to all recipients without waiting for acknowledgments. Lost messages are not retransmitted.
Unreliable multicast is appropriate when occasional message loss is acceptable, such as in streaming video, where a dropped frame causes a brief glitch but the stream continues, or in real-time sensor data, where old readings become quickly irrelevant.
Best-Effort Reliable Multicast
Best-effort reliable multicast adds reliability guarantees when the sender completes successfully, but does not handle the case where the sender crashes during transmission. This is a practical middle ground for many applications where sender crashes are rare, and the complexity of full reliability is not warranted.
What it guarantees:
-
If the sender completes the multicast without crashing, all live recipients receive the message
-
No duplication (each message is delivered at most once)
-
No spurious messages (only messages that were actually sent are delivered)
What it does NOT guarantee:
-
If the sender crashes during the multicast, some recipients may receive the message, and others may not
-
Messages do not survive system restarts
Implementation: The sender transmits to all recipients using reliable unicast (TCP) and waits for acknowledgments from each. If a recipient does not acknowledge within a certain time, the sender retransmits. The sender considers the multicast complete when all acknowledgments are received or a longer timeout is reached, at which point the sender assumes the receiver is dead. However, if the sender crashes partway through, there is no recovery mechanism to ensure consistency among recipients.
Reliable Multicast
Reliable multicast provides strong consistency guarantees even when the sender crashes during transmission. The key addition is the agreement property: either all correct processes receive the message, or none of them do. This “all or nothing” semantics is essential for applications like replicated state machines, where all replicas must see the same sequence of operations.
Formal definition: A multicast protocol is reliable if it satisfies three properties:
-
Agreement: If any correct process delivers a message m, then all correct processes eventually deliver m.
-
Integrity: Every correct process delivers each message at most once, and only if m was previously multicast by some process.
-
Validity: If a correct process multicasts a message m, then it eventually delivers m.
What it guarantees:
-
Agreement: If any correct process delivers a message, all correct processes eventually deliver it
-
Integrity: Every correct process delivers each message at most once, and only if it was actually sent
-
Validity: If a correct process sends a message, it eventually delivers that message to itself
What it does NOT guarantee:
-
Messages do not survive system restarts (no persistence)
-
No specific ordering (unless combined with an ordering guarantee)
Implementation: A simple approach works as follows. When a process receives a message for the first time, it re-multicasts it to all group members before delivering it to the application. This ensures that even if the original sender crashes mid-transmission, any process that received the message will propagate it to the others. The re-multicast guarantees agreement: if anyone got it, everyone will get it. This approach is correct but expensive, generating O(n2) messages for a group of n processes. More sophisticated protocols use techniques such as hierarchical forwarding and negative acknowledgments to reduce message overhead.
Durable Multicast
Durable multicast adds persistence to reliability. Messages are written to stable storage before being acknowledged, ensuring that they survive process crashes and system restarts. This is essential for systems that cannot afford to lose messages under any circumstances, such as distributed transaction logs, replicated databases, or event streaming platforms.
What it guarantees:
-
All guarantees of reliable multicast (agreement, integrity, validity)
-
Messages are written to persistent storage before being acknowledged
-
Messages survive process crashes and system restarts
-
No acknowledged message is ever lost
What it does NOT guarantee:
- No specific ordering (unless combined with an ordering guarantee)
Implementation: The sender writes the message to local persistent storage (such as a write-ahead log), then transmits it to all recipients. Each recipient writes the message to its own persistent storage before sending an acknowledgment. The sender considers the multicast complete only after receiving acknowledgments from a sufficient number of recipients (often a majority or all).
Systems like Apache Kafka (an event streaming platform) provide durable multicast by writing messages to disk and replicating them across multiple brokers before acknowledging to the sender. The replication factor determines how many node failures can be tolerated without data loss.
Publish-Subscribe as Group Communication
Pub/sub (publish-subscribe) systems apply these group communication concepts at scale. Publishers send messages to named topics rather than specific recipients. Subscribers register interest in topics and receive matching messages.
This decouples senders from receivers, since neither needs to know about the other. The topic acts like a named multicast group, with the messaging system handling delivery, durability, and (depending on configuration) ordering.
We’ll examine how systems like Kafka implement this later in the semester.
Ordering Levels
Ordering determines the sequence in which messages are delivered to applications. There are several levels, from weakest to strongest. Ordering is orthogonal to reliability: a protocol can provide single source FIFO ordering over unreliable transport, or unordered delivery over durable transport.
Preview: Ordering Meets Membership (Virtual Synchrony)
So far, ordering sounds like a pure messaging question: “In what order do we deliver messages?” In practice, the hard part is that membership changes are happening at the same time. A sender can crash mid-multicast, or the network can partition the group. At that moment, different processes can disagree about who is still “in the group,” and that disagreement can make reliable ordering meaningless.
Virtual synchrony is the idea that a group membership service gives processes a sequence of agreed-upon membership snapshots called views, and the communication layer ties message delivery to those views. Informally: every process agrees on when the group changed, and it can label each delivered message as belonging to a particular view. That lets applications reason about message delivery without having to guess who was present.
We will make this precise later in the Virtual Synchrony and Group Membership section. For now, keep in mind that the strongest ordering and reliability semantics usually assume some disciplined way to handle membership change, rather than treating it as an afterthought.
Unordered Delivery
Unordered delivery provides no guarantees about message sequence. Messages may arrive in different orders at different recipients, and concurrent messages from different senders may interleave arbitrarily.
What it guarantees:
- Nothing about the message sequence
What it does NOT guarantee:
-
Messages may arrive in any order at different recipients
-
No relationship between send order and delivery order
Implementation: Messages are delivered to the application as soon as they arrive, with no buffering or reordering.
Unordered delivery is sufficient when messages are independent and commutative (the order of application does not matter), such as independent updates to different keys in a distributed cache, or idempotent operations that produce the same result regardless of order.
Single Source FIFO Ordering (SSF)
Single source FIFO ordering guarantees that messages from the same sender are delivered in the order they were sent. This is a natural expectation: if I send you two messages, you should receive them in the order I sent them. However, single source FIFO ordering says nothing about the relative order of messages from different senders.
Formal definition: If a process sends multicast(G, m) and later sends multicast(G, m′), then every correct process that delivers m′ will have already delivered m.
What it guarantees:
-
Messages from the same sender are delivered in the order they were sent
-
If process P sends m1 before m2, then every process that delivers both will deliver m1 before m2
What it does NOT guarantee:
-
Nothing about the relative order of messages from different senders
-
Two recipients might see messages from different senders interleaved differently
Implementation: Each sender maintains a sequence number that it increments with each message. Each message carries its sequence number. Receivers maintain a counter for each sender, indicating the next expected sequence number. When a message arrives, the receiver checks if it is the next expected message from that sender. If so, it delivers immediately. If the message is from the future (with a higher sequence number than expected), the receiver buffers it until the gap is filled. Messages from different senders are delivered independently.
Causal Ordering
Causal ordering is stronger than single source FIFO ordering. It guarantees that if message m1 causally precedes message m2 (meaning m2 could have been influenced by m1), then m1 is delivered before m2 at all processes. The “happened-before” relation from Lamport’s work defines causal precedence.
Formal definition: If multicast(G, m) → multicast(G, m′), where ) → denotes the happened-before relation, then every correct process that delivers m′ will have already delivered m.
What it guarantees:
-
If message m1 causally precedes message m2, then m1 is delivered before m2 at all processes
-
Causal precedence means: m2 was sent after m1 was received, so m2 might have been influenced by m1
-
Implies single source FIFO ordering (messages from the same sender are causally related)
What it does NOT guarantee:
-
Nothing about the order of concurrent messages (messages that are not causally related)
-
Two concurrent messages may be delivered in different orders at different recipients
Consider a discussion forum where Alice posts a question, Bob sees it and posts an answer, and Carol sees it and posts a different answer. Bob’s answer is causally dependent on Alice’s question because Bob saw the question before composing his answer. Causal ordering ensures that no one sees Bob’s answer before seeing Alice’s question. However, Carol might see Bob’s answer before or after her own answer, since those events are concurrent (neither caused the other).
Implementation using vector clocks: Causal ordering can be implemented using vector clocks, also called vector timestamps when attached to messages. The mechanism is the same as vector clocks from the logical clocks lecture; the difference is that we use the vector to decide when a received message is safe to deliver.
-
Each process maintains a vector V of length n (for n processes), where V[i] represents the number of messages from process i that this process has delivered.
-
When process P sends a message, it first increments V[P] (its own entry) and attaches the entire vector to the message.
-
When process Q receives a message m from process P with attached vector Vm, process Q checks whether it has already delivered all the messages that P had delivered before sending m. Specifically, Q buffers the message until both conditions are met:
-
Vm[P] = V[P] + 1 (this is the next expected message from P, ensuring FIFO from P)
-
For all j ≠ P: Vm[j] ≤ V[j] (Q has already delivered everything that P had delivered from other processes)
-
-
Once both conditions are met, Q delivers the message and merges the vector timestamp: for all k, set V[k] = max(V[k], Vm[k]). This ensures that when a message is delivered, all messages it causally depends on have already been delivered.
Total Ordering
Total ordering guarantees that all processes deliver all messages in the same order. If any process delivers m1 before m2, then every process that delivers both will deliver m1 before m2. The specific order does not need to respect causality or physical time; it simply needs to be consistent across all processes.
Formal definition: If a correct process delivers message m before message m′, then every correct process that delivers both m and m′ will deliver m before m′.
What it guarantees:
-
All processes deliver all messages in the same order
-
If any process delivers m1 before m2, every process delivers m1 before m2
What it does NOT guarantee:
-
The delivery order need not respect causality (Bob’s answer might be delivered before Alice’s question, as long as everyone agrees)
-
The delivery order need not respect the physical sending time
-
The order may be arbitrary as long as everyone agrees on it
Implementation using a sequencer: A simple way to implement total ordering is to designate one process as a sequencer. All multicasts are sent to the sequencer first. The sequencer assigns a global sequence number to each message and broadcasts it to all recipients. Receivers buffer messages and deliver them in sequence-number order. The drawback is that the sequencer is a single point of failure and a potential bottleneck.
Implementation using distributed agreement (Isis-style, optional): One classic approach to total order multicast uses a two-round agreement protocol to assign each message a final sequence number. When a process multicasts a message, it sends the message to all group members. Each receiver proposes a sequence number (typically max(local counter, any previously proposed values) + 1). The sender collects proposals, selects the maximum as the final sequence number, and then announces that final number to all receivers. Receivers buffer messages until the final number is known and then deliver in final sequence-number order.
This ordering mechanism stands on its own. Systems like Isis (Ken Birman, Cornell, 1980s) paired total order multicast with a membership model (virtual synchrony) to define what happens when membership changes during communication. The ordering protocol and the membership semantics are separable: you can study total order multicast without virtual synchrony, and then add virtual synchrony when you care about consistent behavior under view changes.
It is important to note that total ordering does not imply causal ordering. A totally ordered delivery might deliver Bob’s answer before Alice’s question at all processes, as long as everyone agrees on this order. Systems often implement atomic multicast, which provides reliable total order delivery. Some systems additionally require the total order to respect causality.
Synchronous Ordering
Synchronous ordering uses a sync primitive that acts as a barrier: when a process issues a sync, it blocks until all messages that were in transit have been delivered to all recipients. This creates logical groups of messages, with the sync primitive marking the boundary between groups.
Formal definition: If a process sends a set of messages M, then issues a sync, then sends a set of messages M′, every correct process will deliver all messages in M before delivering any message in M′.
What it guarantees:
-
When the sync primitive completes, all messages sent before the sync have been delivered to all recipients
-
Creates well-defined epochs: all messages before the sync are in one group, all messages after are in another
-
All processes see the same boundary between message groups
What it does NOT guarantee:
-
Higher latency when sync is used (must wait for all in-flight messages)
-
No ordering guarantees within a group (unless combined with other ordering)
Implementation: The sync primitive works like a flush. When a process calls sync, it sends a sync marker to all group members. The sync completes only when the process has received acknowledgment that all recipients have:
-
Received all messages sent before the sync and
-
Received the sync marker itself. This ensures that everyone agrees on which messages came before versus after the sync.
In the Isis system (which popularized virtual synchrony), a barrier primitive called GBCAST was used to coordinate group membership changes. Before installing a new view, a sync ensures that all messages from the old view have been delivered.
This is what makes view synchrony possible: the sync primitive creates a clean boundary between the old view and the new view, ensuring all processes transition at the same logical point.
Real-Time Ordering
Real-time ordering would guarantee that messages are delivered in the order they were actually sent according to real (physical) time. If message m1 was sent at 10:00:00.000 and m2 was sent at 10:00:00.001, then m1 would be delivered before m2 at all recipients.
Why it cannot be implemented perfectly:
-
Clocks are never perfectly synchronized across machines
-
Network delays vary unpredictably
-
A message sent later might arrive before a message sent earlier
-
Even with GPS-synchronized clocks accurate to microseconds, clock uncertainty remains
Systems can approximate real-time ordering using tightly synchronized clocks. As we will see in a future lecture, Google Spanner uses TrueTime, which provides a bounded uncertainty interval for the current time. Spanner delays operations to ensure that the uncertainty intervals do not overlap, effectively achieving real-time ordering at the cost of latency. However, perfect real-time ordering remains impossible because some clock uncertainty always exists.
Summary of Ordering Relationships
Causal ordering implies single source FIFO ordering (since messages from the same sender are causally related by definition). Total ordering does not imply single source FIFO or causal ordering. Synchronous ordering is orthogonal to the other orderings: it does not specify delivery order but rather provides a barrier that groups messages into epochs. A sync primitive ensures all messages sent before it are delivered before any messages sent after it.
For critical applications like replicated state machines, systems typically implement causal-total ordering (atomic multicast), which provides both causal and total ordering guarantees. This ensures that all replicas see the same sequence of operations in a causally consistent order.
Failure Detection
Distributed systems must cope with failures. Machines crash. Networks partition. Disks fill up. A critical capability for any distributed system is determining which components are functioning correctly and which have failed.
The Challenge of Failure Detection
In a synchronous system with bounded message delays and processing times, failure detection would be straightforward. If a process does not respond within the known maximum delay, it has failed. But real distributed systems are asynchronous: there is no bound on how long messages or computations might take. A slow process is indistinguishable from a failed one.
This observation leads to one of the most important impossibility results in distributed computing. Fischer, Lynch, and Paterson proved in 1985 that in an asynchronous system where even one process might crash, there is no algorithm that can guarantee consensus. The FLP impossibility result, as it is known, means we cannot build perfect failure detectors in asynchronous systems.
FLP takeaway: in a fully asynchronous model, there is no algorithm that always terminates with the correct answer in the presence of even one crash. Real systems make progress by changing the model: add timeouts and accept false suspicions, assume partial synchrony (eventual timing bounds), or use randomness.
Since perfect detection is impossible, practical failure detectors make mistakes. They sometimes suspect processes that are actually alive (false positives) or fail to suspect processes that have crashed (false negatives). The art of failure detection is minimizing these mistakes while providing timely information about failures.
Heartbeat-Based Detection
The most common approach to failure detection uses heartbeats. Each process periodically sends a heartbeat message to indicate it is alive. A monitor that does not receive heartbeats from a process within some timeout period suspects that the process has failed.
The choice of timeout is critical. A short timeout detects failures quickly but generates more false positives when processes are slow or networks are congested. A long timeout reduces false positives but delays failure detection.
Several variations exist. In push-based heartbeating, monitored processes send heartbeats to monitors. In pull-based approaches (sometimes called pinging), monitors query processes periodically and expects responses. Some systems use a combination, with periodic heartbeats supplemented by on-demand probes when problems are suspected.
Heartbeat-based detection requires selecting several parameters: the heartbeat interval, the timeout threshold, and the number of missed heartbeats required to trigger a failure. These choices involve fundamental tradeoffs between detection speed, accuracy, and network overhead.
The Phi Accrual Failure Detector
Fixed-timeout failure detectors treat failure as binary: a process is either alive or dead. But in practice, our confidence that a process has failed changes over time. If we have not heard from a process in 100 milliseconds, we are somewhat suspicious. If we have not heard from it in 10 seconds, we are very suspicious. If we have not heard from it in a minute, we are nearly certain.
The phi accrual failure detector, introduced by Hayashibara and colleagues in 2004, captures this intuition. Instead of outputting a binary alive/dead judgment, it outputs a suspicion level φ (phi) on a continuous scale.
The key insight is that heartbeats do not arrive at perfectly regular intervals. Network congestion, garbage collection pauses, and varying system load cause heartbeat timing to fluctuate. The phi accrual failure detector learns the normal pattern of heartbeat arrivals and uses this to judge whether a missing heartbeat is suspicious.
Here is how it works:
The detector maintains a sliding window of recent heartbeat arrival times. From these, it computes the time gaps between consecutive heartbeats. If heartbeats typically arrive every 100ms, but sometimes take 150ms or 200ms due to network variation, the detector learns this distribution.
When a heartbeat is late, the detector asks: given the normal pattern of heartbeat timing I have observed, how likely is it that this heartbeat is simply delayed versus the process having crashed? If heartbeats normally arrive with low variance (say, 100ms ± 5ms), then a 500ms gap is extremely unlikely to be normal variation. But if heartbeats have high variance (100ms ± 100ms), a 500ms gap might be within the realm of normal behavior.
The phi value is best understood as “how surprising this silence is” given the learned heartbeat timing distribution, mapped onto a base-10 logarithmic scale. Roughly, φ = k corresponds to about a 10(-k) chance that a delay this large is still consistent with the learned heartbeat arrival pattern. Each increment of phi means the observed silence is another order of magnitude less likely to be normal variation.
Applications choose a threshold φ based on their needs. A lower threshold (say, φ = 3) means faster detection but more false positives, since you’re acting when there’s still a 0.1% chance the process is alive. A higher threshold (say, φ = 8) means waiting until the probability of being alive is roughly one in 100 million, giving slower detection but fewer false positives. Apache Cassandra defaults to φ = 8.
The key advantage of accrual failure detectors is that they separate the monitoring layer from the decision layer. The detector continuously provides suspicion information; applications decide when to act on it. This flexibility lets different parts of a system use different thresholds for the same underlying data.
Other Failure Detection and Membership Approaches
There are many other protocols in this space, including randomized, gossip-based approaches such as SWIM. These are widely used in large clusters because they scale well, but they add another layer of mechanisms and parameters. For our purposes, heartbeat-based detectors and the phi accrual model are enough to understand the tradeoffs and to motivate why group membership is hard.
Virtual Synchrony and Group Membership
When group membership changes, either because processes join, leave, or fail, the system faces a fundamental challenge. Different processes might disagree about who is currently in the group, leading to inconsistencies. A message multicast to “the group” might reach different sets of processes depending on when each recipient learns about membership changes.
A system is virtually synchronous if processes see the same sequence of membership views, and message delivery is consistent with those views: if two processes both deliver a message, they deliver it in the same view. The communication layer presents membership changes as clean boundaries for delivery, even though failures and delays occur asynchronously underneath.
Virtual Synchrony
Virtual synchrony, developed by Ken Birman and colleagues and popularized in the Isis system at Cornell, provides a powerful abstraction for programming with dynamic groups. While few systems today implement the full virtual synchrony model as originally specified, the core concepts permeate modern distributed systems. Understanding virtual synchrony helps explain why systems like ZooKeeper (a distributed coordination service), Raft (a consensus protocol), and Kafka (an event streaming platform) handle membership changes the way they do.
The core idea of virtual synchrony is to make group membership changes appear to happen synchronously with message delivery, even though the underlying system is asynchronous.
A virtually synchronous system guarantees view synchrony: if a message is delivered in some view, it is delivered in the same view at all processes that deliver it. This means that all processes that receive a message agree on what the group membership was when that message was sent and received.
Consider what happens when a process P fails while multicasting a message M. Some processes might receive M before learning of P’s failure; others might learn of the failure first. Virtual synchrony ensures consistency: either all surviving processes receive M in the view before P’s failure, or none of them do. The failure and the message delivery are ordered consistently at all processes.
The view change protocol described later is what makes virtual synchrony possible. By coordinating the transition between views, the protocol ensures that message delivery and membership changes are ordered consistently at all processes.
The Group Membership Problem
Consider a replicated database where updates are multicast to all replicas. If replica A thinks the group contains {A, B, C} and replica B thinks it contains {A, B, D}, they will send updates to different sets of processes. The result is inconsistency: some replicas receive updates that others do not.
Group membership services address this problem by providing a consistent view of group membership to all members. At any moment, each process has a current view of the group. The membership service guarantees that all processes in a group agree on the membership. When membership changes, all surviving processes transition to a new view together.
Who Tracks Group Membership
In virtual synchrony systems, group membership is tracked collectively by all members of the group rather than by a single centralized server. Each process maintains its own copy of the current view and participates in the view change protocol when membership changes.
There is typically a group membership service (GMS) layer within each process that handles the mechanics of view changes. This layer monitors other members using failure detection, participates in view change protocols, and notifies the application when the view changes. The GMS at each process communicates with the GMS at other processes to coordinate view changes.
Some systems designate one member as the view leader or coordinator for each view. The leader drives the view change protocol, but it is not a single point of failure: if the leader fails, a new leader is elected as part of the view change that removes the failed leader.
Views and View Changes
A view is a snapshot of group membership at a point in time. Each view has a unique identifier (typically a monotonically increasing number) and contains a list of the processes that are members of the group in that view. For example, view 5 might contain processes {A, B, C}, indicating that at this point in the group’s history, these three processes are the members.
Every process in a group maintains its own copy of the current view. A critical invariant is that all processes in the same view agree on the membership of that view. If process A believes it is in view 5 with members {A, B, C}, then processes B and C must also believe they are in view 5 with members {A, B, C}.
A view change is a transition from one view to another. View changes occur when membership changes: a process joins the group, a process voluntarily leaves, or a process is detected as failed. The view change is not instantaneous; it is a protocol that coordinates all surviving members to agree on the new membership before any of them begin operating in the new view.
Message Stability
Before a message can be delivered to the application, the system must ensure it is stable: all current group members have received it. This guarantees that if the sender crashes immediately after sending, the message will not be partially delivered to only some members.
The protocol works as follows:
-
When a process receives a message, it holds the message in a buffer and sends an acknowledgment to the sender.
-
When the sender receives acknowledgments from all members of the current view, the message is stable.
-
The sender announces that the message is stable, and receivers can now deliver it to the application.
As an optimization, receivers can acknowledge batches of messages, and senders can confirm stability for batches rather than individual messages.
Message stability is essential for view changes. During a flush, processes exchange information about which messages they have received. Only stable messages are delivered before the view change; unstable messages are either stabilized (if all surviving members have them) or discarded (if not), ensuring all processes enter the new view having delivered exactly the same set of messages.
The View Change Protocol
When a membership change is detected (through the failure detector noticing a crashed process, or through a join/leave request), the surviving processes execute a view change protocol. The goal is to ensure that all processes agree on three things:
-
Which messages were delivered in the old view
-
What the membership of the new view is
-
When the transition to the new view occurs
A simplified view change protocol works as follows:
Phase 1: Flush. When a process learns that a view change is needed, it stops sending new application messages and sends a flush message to all members of the current view. The flush message contains the set of messages that this process has received but not yet delivered, ensuring that any in-flight messages are propagated. In practice, systems often exchange message IDs or stability summaries rather than the full message data, and retransmit missing messages separately.
Phase 2: Collect. Each process waits to receive flush messages from all other surviving members (or determines that non-responding members have failed). At this point, each process has the same set of messages that were sent in the old view.
Phase 3: Commit. The processes agree on the new view membership and the new view identifier. All processes deliver any remaining messages from the old view, then atomically transition to the new view. The application is notified of the view change.
After the view change completes, all surviving processes are in the new view with identical state: they have delivered exactly the same set of messages from the old view and agree on the new membership.
Handling Process Failures
When a process fails, the system must detect the failure and execute a view change to remove the failed process from the group. Here is how this works:
Detection. The failure detector (using heartbeats or the phi accrual detector) notices that a process is unresponsive. Different processes may detect the failure at different times.
Initiation. When a process detects a failure, it initiates a view change. If multiple processes detect the failure simultaneously, they may all try to initiate a view change, but the protocol handles this by converging on a single new view.
Exclusion. The view change protocol excludes the failed process from the new view. The failed process does not participate in the protocol (it cannot, since it has crashed). Surviving processes proceed without it.
Message consistency. The key challenge is handling messages that the failed process may have been sending when it crashed. Some processes may have received these messages; others may not have. The view change protocol ensures consistency: either all surviving processes deliver a message in the old view, or none of them do. Messages that were only partially delivered are discarded.
Recovery. If a failed process later recovers, it cannot simply rejoin with its old state. While the process was dead, the group continued operating: messages were delivered, state changed, and views advanced. The recovered process has stale state and an outdated view of the world.
To rejoin, the recovered process must perform a state transfer:
-
The recovering process contacts an existing group member and requests a state transfer.
-
The existing member sends its current state (either a full snapshot or a recent checkpoint plus subsequent updates).
-
The recovering process initializes its replica to this state.
The state transfer is treated as an atomic event: no other processing occurs at the recovering process until the transfer is complete. This ensures the process does not attempt to participate in the group with partially updated state. Once the state transfer completes, the recovered process joins the group as a new member through the normal join protocol, triggering a view change that adds it to the membership.
Why Virtual Synchrony Concepts Still Matter
The insight of virtual synchrony is that membership changes and message delivery must be coordinated. You cannot reason about who received what message without knowing what everyone believed about group membership at the time. This insight appears in modern systems even when they do not use the term “virtual synchrony.”
ZooKeeper’s membership views, Raft’s configuration changes, and Kafka’s consumer group rebalancing all grapple with the same fundamental problem. They all need to ensure that processes agree on who is in the group before they can meaningfully agree on anything else.
Virtual synchrony also demonstrates a key design principle: by providing stronger semantics at the communication layer, you simplify the logic at the application layer. An application built on virtually synchronous communication can treat group membership as a well-defined state rather than a fuzzy approximation. The complexity is pushed into the infrastructure, where it can be implemented once correctly.
The original Isis system that implemented virtual synchrony was used in production at the New York Stock Exchange, the Swiss Stock Exchange, the French air traffic control system, and the US Navy AEGIS warship system. These deployments demonstrated that the abstraction was practical for demanding, real-world applications.
Modern systems like Derecho, also from Cornell, continue to develop virtually synchronous communication. Derecho is designed for modern data center networks and provides Paxos-based state machine replication with performance approaching hardware limits.
Summary
Group communication provides reliable, ordered message delivery to groups of processes. Reliability ranges from best-effort to reliable to durable. Ordering ranges from unordered through single source FIFO, causal, total, and synchronous. Stronger guarantees require more coordination overhead.
Failure detection identifies crashed processes, with an inherent tradeoff between detection speed and false positive rate. The phi accrual failure detector learns normal heartbeat patterns and provides a continuous suspicion level based on how abnormal the current silence is. Many other protocols exist for large-scale membership and failure detection (for example, SWIM), but we will not cover them in detail here.
Virtual synchrony coordinates group membership changes with message delivery, ensuring that all processes agree on which messages were delivered in which view. Views provide a consistent snapshot of group membership, and view change protocols coordinate transitions between views. The concepts remain relevant in modern coordination systems even when the specific protocol is not used.
These primitives form the foundation for consensus, replication, and coordination services, which we will explore in the next lecture.