Concurrency Control

Paul Krzyzanowski

November 2017

Consensus

Reaching agreement

Introduction

Consensus is the task of getting all processes in a group to agree on some specific value based on the votes proposed by one or more processes. All processes must agree upon the same value and it must be a value that was submitted by at least one of the processes (i.e., the consensus algorithm cannot just invent a value).

This is a simple-sounding problem but finds a surprisingly large amount of use in distributed systems. Any algorithm that relies on multiple processes maintaining common state relies on solving the consensus problem.

Some examples of places where consensus is useful are:

  • synchronizing replicated state machines and making sure all replicas have the same (consistent) view of system state
  • electing a leader (e.g., for mutual exclusion)
  • distributed, fault-tolerant logging with globally consistent sequencing
  • managing group membership
  • deciding to commit or abort for distributed transactions

Consensus among processes is easy to achieve in a perfect world.
For example, when we examined distributed mutual exclusion algorithms earlier, we visited a form of consensus where everybody reaches the same decision on who can access a resource. The simplest implementation was to assign a system-wide coordinator who is in charge of determining the outcome.

This is easy because we assume that all processes are functioning and are able to communicate with each other. Faulty processes, computers, and networks make consensus challenging.

Paxos

Paxos is a popular and widely-used fault-tolerant distributed consensus algorithm. It allows a globally consistant (total) order to be assigned to client messages (actions).

Much of what is summarized here is from Lamport’s Paxos Made Simple but I tried to simplify it substantially. Please refer to that paper for more detail and definitive explanations.

The goal of a distributed consensus algorithm is to allow a set of computers to all agree on a single value that one of the nodes in the system proposed (as opposed to making up a random value). The challenge in doing this in a distributed system is that messages can be lost or machines can fail. Paxos guarantees that a set of machines will chose a single proposed value as long as a majority of systems that participate in the algorithm are available.

The setting for the algorithm is that of a collection of processes that can propose values. The algorithm has to ensure that a single one of those proposed values is chosen and all processes should learn that value.

There are three classes of agents:

  1. Proposers
  2. Acceptors
  3. Learners

A machine can take on any or all of these roles. Typically, learners will be integrated with acceptors, for example.

Proposers put forth proposed values. Acceptors drive the algorithm’s goal to reach agreement on a single value and let the learners are informed of the outcome. Acceptors either reject a proposal or agree to it and make promises on what proposals they will accept in the future. This ensures that only the latest set of propsals will be accepted. A process can act as more than one agent in an implementation. Indeed, many implementations have collections of processes where each process takes on all three roles.

Agents communicate with each other asynchronously. They may also fail to communicate and may restart. Messages can take arbitrarily long to deliver. They may can be duplicated or lost but are not corrupted. A corrupted message should be detectable as such and can be counted as a lost one (this is what UDP does, for example).

The absolutely simplest implementation contains a single acceptor. A proposer sends a proposal value to the acceptor. The acceptor processes one request at a time, chooses the first proposed value that it receives, and lets everyone (learners) know. Other proposers must agree to that value.

This works as long as the acceptor does not fail. Unfortunately, acceptors are subject to failure. To guard against the failure of an acceptor, we turn to replication and use multiple acceptor processes. A proposer now sends a proposal containing a value to a set of acceptors. The value is chosen when a majority of the acceptors accept that proposal (agree to it).

Different proposers, however, could independently initiate proposals at approximately the same time and those proposals could contain different values. They each will communicate with a different subset of acceptors. Now different acceptors will each have different values but none will have a majority. We need to allow an acceptor to be able to accept more than one proposal. We will keep track of proposals by assigning a unique proposal number to each proposal. Each proposal will contain a proposal number and a value. The value is the thing on which we need to agree; for example setting a name=value field in a replicated database.

Each proposal must have a unique proposal number. Our goal is to agree on one of those proposed values from the pool of proposals sent to different subsets of acceptors.

A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. That means it has been chosen. Multiple proposals can be chosen but all of them bust have the same value: if a proposal with a value v is chosen, then every higher-numbered proposal that is chosen must also have value v.

If a proposal with proposal number n and value v is issued, then there is a set S consisting of a majority of acceptors such that either:

  1. no acceptor in S has accepted any proposal numbered less than n, or
  2. v is the value of the highest-numbered proposal among all proposals numbered < n accepted by the acceptors in S.

A proposer that wants to issue a proposal numbered n must learn the highest numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in a majority of acceptors.

To do this, the proposer gets a promise from an acceptor that there will be no future acceptance of proposals numbered less than n.

The Paxos algorithm

With Paxos, a client sends a proposal (e.g., a name=value setting) to a proposer, which is then responsible for running the algorithm. The Paxos algorithm operates in two phases:

Phase 1: PREPARE: send a proposal request

1A. Proposer: prepare

  • A proposer chooses a proposal number N and sends a prepare request to a majority of acceptors. The number N is stored in the proposer’s stable storage so that the proposer can ensure that a higher number is used for the next proposal (even if the proposer process restarts). To ensure uniqueness among all proposers, the proposal number can be of the form sequence_number.process_id, where sequence_number is a monotonically-increasing local number and process_id is a unique identifier for the process, such as the machine’s IP or Ethernet MAC address concatenated with its process ID.

  • The proposer sends a prepare(N) message to the majority of acceptors.

1B. Acceptor: promise – receive a prepare(N) message

  • The acceptor promises not to accept any prepare messages with smaller request numbers. If an acceptor has already received a proposal greater than N, it will reject this prepare(N) request. To do this, it keeps track of the highest proposal number that it has seen.
if (N > max_received_proposal)
    max_received_proposal = N
else
    reject()
  • It is possible that the acceptor may have already accepted a proposal (e.g., one that came concurrently from another proposer). In this case, it will convey that information to the proposer. To be able to do this, the acceptor keeps track of the highest previously accepted proposal number and its value.
    if (have_accepted_proposal)
    promise(accepted_number, accepted_value)
    else
    promise(N)
    

1C. Proposer: receive promise messages from a majority of acceptors

  • If a proposer receives promise message from a majority of acceptors, it can now choose a value. If any of the acceptors returned an accepted proposal, the proposer chooses the one associated with the highest proposal number. The proposer must use this value instead of the one it originally proposed. Note that the proposer changes its value but does not change its proposal number in this case. If no acceptor returned an accepted value, then the proposer is free to use the value it originally proposed along with its proposal number.

Phase 2: ACCEPT: send a proposal (and then propagate it to learners after acceptance)

Proposer:

  • A proposer can now issue its proposal. Note that the value is either the proposer’s initial value or a value it received from an acceptor. It will send a message to all acceptors (reaching a majority): accept(N, value)

Acceptor:

  • When an acceptor receives an accept(N, value) message, it checks to see if it is the highest sequence number that it has seen. If so, then it accepts the proposal and stores information about the request so it can return it to other proposers, if necessary. It also sends the proposal value to each learner node. Note that, in some implementations, the learner may implemented at the proposer. In this case, that proposed value is returned to
    the acceptor.

if (N ≥ max_received_proposal) {
    accepted_value = value
    accepted_number = N
    max_received_proposal = N
    have_accepted_proposal = true
}
return max_received_proposal (or send to learner)

In all cases, the acceptor returns the maxium received proposal number.

Acceptor:

  • The learner (or proposer, if it implements the learner’s funciton) must receive responses from a majority of acceptors. If a proposer receives response contains a proposal number that is greater than the proposal number it submitted, then it knows that that request has been rejected.

The acceptor receives two types of requests from proposers: prepare and accept requests. Any request can be ignored. An acceptor only needs to remember the highest-numbered proposal that it has ever accepted and the number of the highest-numbered prepare request to which it has responded. The acceptor must store these values in stable storage so they can be preserved in case the acceptor fails and has to restart.

A proposer can make multiple proposals as long as it follows the algorithm for each one.

Consensus

Now that the acceptors have a proposed value, we need a way to learn that a proposal has been accepted by a majority of acceptors. The learner is responsible for getting this information, although its role is often integrated into the proposer process. Each acceptor, upon accepting a proposal, forwards it to all the learners. The problem with doing this is the potentially large number of duplicate messages: (number of acceptors) * (number of learners). If desired, this could be optimized. One or more “distinguished learners” could be elected. Acceptors will communicate to them and they, in turn, will inform the other learners.

Ensuring progress

One problem with the algorithm is that its possible for two proposers to keep issuing sequences of proposals with increasing numbers, none of which get chosen. An accept message from one proposer may be ignored by an acceptor because a higher numbered prepare message has been processed from the other proposer. To ensure that the algorithm will make progress, a “distinguished proposer” is selected as the only one to try issuing proposals.

In operation, clients send commands to the leader, an elected “distinguished proposer”. This proposer sequences the commands (assigns a value) and runs the Paxos algorithm to ensure that an agreed-upon sequence number gets chosen. Since there might be conflicts due to failures or another server thinking it is the leader, using Paxos ensures that only one command (proposal) gets assigned that value.

Leasing versus Locking

Processes often rely on locks to ensure exclusive access to a resource. The difficulty with locks is that they are not fault-tolerant. If a process holding a lock dies or forgets to release the lock, the lock exists unless additional software is in place to detect these actions and break the lock. For this reason, it is more safer to add an expiration time to a lock. This turns a lock into a lease.

We saw an example of this approach with the two-phase and three-phase commit protocols. A two-phase commit protocol uses locking while the three-phase commit uses leasing; if a lease expires, the transaction is aborted. We also saw this approach with maintaining references to remote objects. If the lease expires, the server considers the object unreferenced and suitable for deletion. The client is responsible for renewing the lease periodically as long as it needs the object.

The downside with a leasing approach is that the resource is unavailable to others until the lease expires. Now we have a trade-off: have long leases with a possibly long wait after a failure or have short leases that need to be renewed frequently.

Hierarchical leases versus consensus

In a fault tolerant system with replicated components, leases for resources should be granted by running a consensus algorithm. Looking at Paxos, it is clear that, while there is not a huge amount of message passing taking place, there are number of players involved and hence there is a certain efficiency cost in using the algorithm. A compromise approach is to use the consensus algorithm as an election algorithm to elect a coordinator. This coordinator is granted a lease on a large set of resources or the state of the system. In turn, the coordinator is now responsible for handing out leases for all or a subset of the system state.
When the coordinator’s main lease expires, a consensus algorithm has to be run again to grant a new lease and possibly elect a new coordinator but it does not have to be run for every client’s lease request; that is simply handled by the coordinator.

References

Leslie Lamport,
Paxos Made Simple,
November 2001.
One of the clearest papers out there detailing the Paxos algorithm
Angus MacDonald,
_Paxos By Example,
June 27, 2012.
Short and clear walkthough of an example of using Paxos to achieve consensus.
Lampson, Butler.
How to Build a Highly Available System Using Consensus,
Microsoft Research
An updated version of Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science 1151, Springer, 1996, pp 1–17.
A great coverage of leases, the Paxos algorithm, and the need for consensus in achieving highly available computing using replicated state machines.
Henry Robinson,
_
Consensus Protocols: Paxos_,
Paper Trail blog, February 2009.

Iair Amir, Jonathan Kirsch, Paxos for System Builders: An Overview, Johns Hopkins University. : Written from a system-builder’s perspective and covers some of the details of implementation. The paper is a really brief (5 page) overview.

Iair Amir, Jonathan Kirsch, Paxos for System Builders, Johns Hopkins University, Technical Report CNDS–2008–2, March 2008. : This is the 35-page full version of the above paper.

Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson,
Impossibility of Distributed Consensus with One Faulty Process,
Journal of the Association for Computing Machinery, Volume 32, No. 2, April 1985, pp. 374–382.
This is the seminal paper (known as FLP85) that proves that one cannot achieve consensus with completely asynchronous faulty processes.
Bracha, G. and Toueg, S.
Asynchronous Consensus and Broadcast Protocols, Journal of the ACM 32, 4 (October 1985), 824–840.
Describes a fail-stop consensus algorithm.

Dolev, D., Dwork, C., and Stockmeyer, L. On the Minimal Synchronism Needed for Distributed Consensus, J. ACM 34, 1 (January 1987), 77–97.

This is an updated version of the original that was oritinally published on October, 2011.