Lookup Without a Coordinator
GFS and HDFS depend on a central metadata server. There is one master that knows where everything lives. That is what makes those systems manageable, but it also creates a fundamental scaling limit and a single point of administrative control.
What if you wanted a storage or lookup system that had no central authority at all? This question became pressing in the late 1990s with the rise of peer-to-peer file sharing. Napster, launched in 1999, was enormously popular but relied on a central index server to match users with the files they wanted. When Napster’s servers were shut down by court order in 2001, the service died. A truly decentralized system would not have that vulnerability.
The obvious response was to eliminate the central server entirely. Gnutella (2000) and Kazaa (2001) did this by having each node flood its query to its neighbors, which forwarded it to their neighbors, and so on. If someone had the file, they would eventually hear about the request and respond. This worked, but it did not scale. Every query generated traffic proportional to the size of the network, and with millions of nodes, the bandwidth consumed by query flooding was enormous. Gnutella in particular was known for generating so much background traffic that it strained the networks it ran on.
The academic community responded with a family of systems called Distributed Hash Tables, or DHTs. The central idea is to spread a hash table across many nodes so that any node can find the data it is looking for without asking a central authority.
DHTs give you a lookup primitive: given a key, find the value and the node responsible for storing it.
Consistent Hashing
The Problem with Ordinary Hash Tables
In a conventional hash table, you store a key-value pair by computing bucket = hash(key) mod N, where N is the number of buckets. This works fine when N is fixed. But in a distributed system, N changes constantly: nodes crash, new nodes are added, and the system must keep running throughout. If N changes by even one, almost every key maps to a different bucket. You would have to move nearly all data to its new location. That is prohibitively expensive.
What Hashing Actually Does
Before going further, it helps to be precise about what a hash function does.
A hash function takes an input of arbitrary size and produces a fixed-size output called a digest or hash value. The function is deterministic: the same input always produces the same output, and it distributes inputs uniformly across the output space. A good hash function makes it computationally infeasible to find two inputs that produce the same output (collision resistance).
DHTs typically use a cryptographic hash function like SHA-256 (part of the SHA-2 family). SHA-256 produces a 256-bit output: a number between 0 and 2256 - 1. Any string you give it (a filename, an IP address, a user ID) maps to a 256-bit number in a way that looks random but is completely reproducible. Two different strings map to very different outputs with no discernible pattern.
The key properties SHA-256 provides for DHTs are uniform distribution (keys spread evenly across the output space) and large output space (2256 possible values means collisions are astronomically unlikely in practice).
Consistent Hashing
Consistent hashing, introduced by Karger et al. in 1997, solves the re-mapping problem. The core idea is to hash a key into the same large identifier space, and then assign each node responsibility for a contiguous range of hash values. A key belongs to whichever node owns the range that contains the key’s hash.
When a new node is added, it takes over a portion of an existing node’s range. Only the keys in that portion need to move; all other keys stay where they are. When a node is removed, its range is absorbed by one or more neighboring nodes. Again, only the keys in the departing node’s range are affected.
This is the “consistent” in consistent hashing: adding or removing one node only reassigns the keys that belonged to that node. In a system with N nodes and K keys, a membership change moves on average K/N keys, not K keys as ordinary modular hashing would require.
Consistent Hashing vs. Conventional Hash Tables
The table below summarizes the key differences.
| Property | Conventional hash table | Consistent hash table |
|---|---|---|
| Key assignment | hash(key) mod N |
Node owning that hash range |
| Effect of adding/removing a node | Almost all keys reassigned | Only the affected range reassigned |
| Node discovery | Direct index lookup | Routing table or ring traversal |
| Suitable for | Fixed number of buckets | Dynamic membership |
Different DHTs implement the range assignment differently. CAN uses geometric zones in a multi-dimensional space; Chord uses a one-dimensional ring with successor pointers. Both are forms of consistent hashing.
CAN: Content Addressable Network
CAN (Content-Addressable Network), proposed at UC Berkeley in 2001, uses a geometric approach to DHT routing.
Imagine a multi-dimensional Cartesian coordinate space: a rectangle in 2D, a cube in 3D, or higher. The 2D case is the easiest to imagine, so we’ll stick with that. Each node in the system owns a zone: a contiguous rectangular region of this space. A zone has minimum and maximum x values and minimum and maximum y values. Any point with x and y coordinates within that range belongs to the zone.
Together, all zones partition the entire space with no gaps and no overlaps.
Mapping a Key to a Point
A single hash function produces one value, so getting a point in a d-dimensional space requires d independent hash values. CAN applies d different hash functions to the key, one per dimension. In a 2D space, hash_x(key) gives the x-coordinate and hash_y(key) gives the y-coordinate. The pair (hash_x(key), hash_y(key)) is the point in the coordinate space where the key-value pair lives.
The pair is stored on the node whose zone contains that point. Lookups hash the key the same way to recover the target coordinates, then route toward them.
Routing in Two Dimensions
Each node knows only its immediate neighbors: the nodes whose zones share a border with its own. Routing is a simple rule applied at every hop. Suppose a node receives a lookup for a key that hashes to coordinates (x, y), and the node’s zone covers the rectangle [x_min, x_max] x [y_min, y_max].
-
If x < x_min, forward the request to the western neighbor.
-
If x > x_max, forward the request to the eastern neighbor.
-
If y < y_min, forward the request to the southern neighbor.
-
If y > y_max, forward the request to the northern neighbor.
-
If (x, y) falls inside the zone, this node is responsible; handle the request locally.
Each hop moves the request strictly closer to the target zone. In a d-dimensional space with n nodes, the average lookup takes O(d * n(1/d)) hops. More dimensions mean fewer hops on average but more neighbors to maintain per node (roughly 2d neighbors for a d-dimensional space).
Node Joins and Departures
Node joins work by splitting an existing zone. A new node contacts any existing node, which routes to the zone that contains the new node’s hash position. The zone owner splits its zone in half and transfers one half to the new node along with the key-value pairs it contains. Departures work in reverse: a leaving node’s zone merges with a neighbor’s zone.
Chord
Chord, proposed at MIT in 2001, simplifies the geometry to a single dimension: a ring.
Nodes and keys are both assigned identifiers by hashing. Node identifiers come from hashing the node’s IP address; key identifiers come from hashing the key. Both identifiers live in the same SHA-1 space, a 160-bit ring of integers from 0 to 2160 - 1.
A key is assigned to the successor of its identifier on the ring: the node with the smallest identifier greater than or equal to the key’s hash. If nodes have IDs 10, 20, 40, 70, and 90 on a ring, then a key hashing to 35 would be assigned to node 40.
With only successor pointers, lookup takes O(n) hops in the worst case (walk around the ring). Chord makes this efficient with a finger table at each node. A node at position p has an entry for the successor of positions p + 20, p + 21, p + 22, …, p + 2(k-1), where k is the number of bits in the identifier. This gives O(log n) lookup: at each hop, the message covers at least half the remaining distance to the target.
Node joins and departures must update finger tables and transfer keys. The Chord paper describes a stabilization protocol that keeps the ring consistent as nodes join and leave.
DHTs like Chord and CAN solve a real problem elegantly, but they come with trade-offs. The lookup guarantee is for exact key matching; they do not naturally support searches or range queries. And if one key is requested millions of times per second, the node responsible for it is overwhelmed. Real deployments built caching layers on top to address this.
Amazon Dynamo
In 2007, Amazon published the Dynamo paper describing the key-value store it uses internally. Dynamo powers services like shopping carts, user preferences, best-seller lists, and session management. These are all use cases where applications need only primary-key access to data, and the volume of requests is too high for a multi-table relational database.
A relational database would be overkill for these workloads and would limit both scalability and availability. A distributed file system is also unsuitable: there is no need for partial reads and writes, directory structure, or POSIX semantics. The right abstraction is a simple key-value store that is always available for reads and writes.
The Dynamo paper is worth reading because it documents a system built under real operational constraints, and it does not hide the compromises it makes.
The API
Dynamo exposes two operations:
-
get(key)returns the value associated with the key, along with a context value. -
put(key, data, context)stores the value associated with the key.
The data is an arbitrary sequence of bytes, typically less than 1 MB, identified by a unique key. The context is a value returned by a prior get and sent back with a subsequent put. It is opaque to the calling application; treat it like a cookie. Internally, it carries version information in the form of a vector clock. This allows Dynamo to detect whether a put is based on an up-to-date read or on a stale one. The application does not need to understand the context; it just has to pass it back.
A Decentralized System
Dynamo is fully decentralized. There is no central coordinator that oversees operations, no master that tracks which node owns which data, and no leader that must be consulted before a request can proceed. Every node in the system has equivalent responsibilities. The system can grow dynamically by adding new nodes without any reconfiguration of existing ones.
Because of this, each node in a Dynamo cluster serves three distinct functions:
1. Handling get and put requests. A node may act as the coordinator for a request (the node that manages the quorum read or write for a particular key) or it may simply forward the request to the coordinator. Any node can receive a client request, and any node that knows the key’s location on the ring can coordinate it.
2. Membership and failure detection. Each node tracks which other nodes are in the system and which hash ranges they are responsible for. Nodes detect when peers go offline and update their local view of the ring membership accordingly, without asking any central authority.
3. Local persistent storage. Each node is the primary or replica store for the keys that hash into its assigned range. The actual storage engine can vary depending on application requirements. Dynamo supports the Berkeley Database Transactional Data Store, MySQL for large objects, and an in-memory buffer with persistent backing store for highest-performance use cases.
A Zero-Hop DHT
Like Chord, Dynamo uses consistent hashing to assign keys to nodes on a ring. The key difference in routing is that Dynamo nodes maintain a complete view of the ring: every node knows the position and address of every other node. This makes Dynamo a zero-hop DHT. When a node receives a request, it can compute the responsible node directly from its local ring state and send the request there in a single network hop.
Chord, by contrast, uses finger tables to limit the routing table size: each node stores O(log n) entries and forwards through O(log n) intermediate nodes. That is the right trade-off for a peer-to-peer network that might have millions of nodes with no administrative control. Dynamo operates in a data center with hundreds or at most a few thousand nodes. Storing the full ring state at each node is a small overhead, and eliminating intermediate hops has significant performance benefits.
Consistent Hashing with Virtual Nodes
Like Chord, Dynamo hashes keys to positions on a ring. Rather than assigning each physical server one position on the ring, Dynamo introduces virtual nodes (vnodes). Each physical server owns many positions on the ring, typically around 150. These positions are distributed randomly.
Virtual nodes serve two purposes. First, they provide balanced load distribution. With a small number of nodes and a single ring position per node, the hash function might assign disproportionately large key ranges to some nodes and small ranges to others. With 150 virtual nodes per server, the expected range size is much more uniform. Second, when a server fails, its virtual node positions are distributed across many surviving servers rather than all landing on one successor. When a new server joins, it takes a fraction of the load from many existing servers rather than one. This keeps load balanced through membership changes.
More powerful servers can be given more virtual nodes; less powerful ones receive fewer. This lets a heterogeneous cluster be balanced without any manual key range assignment.
Replication
Each key-value pair is replicated on N consecutive distinct nodes on the ring (N is typically 3). The first replica goes to the coordinator for the key; the next N-1 replicas go to the next N-1 distinct physical nodes clockwise on the ring. This ordered list of N nodes is the preference list for that key, and it is known to every node in the cluster.
Tunable Consistency: N, R, and W
Dynamo makes consistency and fault tolerance configurable at the application level through three parameters:
-
N: the number of replicas to maintain for each key.
-
W: the number of replicas that must acknowledge a write before it is considered successful.
-
R: the number of replicas that must respond to a read before it is considered successful.
The condition R + W > N guarantees that any read overlaps with the most recent write: the set of R nodes that respond to a read must include at least one node that participated in the most recent write of W nodes. With N=3, R=2, W=2, the system tolerates one replica failure without blocking reads or writes.
Applications can adjust these values to suit their needs. A write-heavy application that can tolerate reading slightly stale data might use W=1, R=3. A read-heavy application that needs the freshest possible data on reads might use W=3, R=1. The trade-offs are explicit: lower W means faster writes but less durability; lower R means faster reads but possible staleness.
Optimistic Replication and Eventual Consistency
Dynamo uses optimistic replication: updates are propagated to replicas asynchronously rather than synchronously. A write is acknowledged as soon as W replicas confirm it; the remaining replicas may not yet have the new value. Dynamo accepts that replicas may temporarily diverge and relies on eventual reconciliation to bring them into agreement. This is what is meant by eventual consistency: given no further writes, all replicas will converge to the same value, but during the window between a write and full propagation, different replicas may return different values for the same key.
The trade-off is availability. Because Dynamo does not wait for all replicas to agree before returning to the client, it can continue to serve requests even when some replicas are unreachable. In a strongly consistent system, a write that cannot reach all replicas must either block or fail. Dynamo chooses to succeed and reconcile later.
Conflict Resolution with Vector Clocks
When two replicas have different values for the same key (either because of a network partition or because of concurrent writes), Dynamo detects the divergence using vector clocks. Recall that the context returned by get and sent back with put carries a vector clock. Each put increments the entry for the coordinating node in the vector clock. If two versions of a value have vector clocks that are not causally related (neither happened-before the other), they are concurrent and must be reconciled.
Dynamo presents both conflicting versions to the application and expects the application to merge them. For a shopping cart, merging means taking the union of both sets of items. This places a burden on the developer that a strongly consistent system would avoid, but it gives the application full control over how conflicts are resolved.
Hinted Handoff
When a preferred replica is unavailable, Dynamo does not fail the write. Instead, it sends the write to a different node (one not in the preference list) with a hint indicating which node the write was intended for. The substitute node stores the write temporarily. When the intended node recovers, the substitute delivers the write to it. This mechanism, called hinted handoff, keeps writes available even when preferred replicas are down. The trade-off is that a node may temporarily hold data outside its normal responsibility range.
Gossip Protocol
Dynamo has no central directory of node membership. Instead, membership information and failure observations propagate through a gossip protocol. Each node periodically picks a random peer from its known member list and exchanges membership state with it. Both nodes then merge their views: if one node has observed that a third node has failed, the other learns of it at the next gossip exchange.
This epidemic propagation is efficient. After O(log N) rounds of gossip, information about any change has reached all N nodes in the cluster with high probability. Each round takes one network round-trip, so convergence is fast even with large clusters. There is no central node that needs to be informed of every change; the information simply spreads on its own.
Gossip also handles node joins. A new node contacts any existing node, announces itself, and its presence propagates through subsequent gossip exchanges. Within a few rounds, all nodes know about the new member and the key ranges it is taking on.
Summary
DHTs solve the problem of decentralized key-value lookup. Consistent hashing is the foundation: both keys and nodes are hashed into the same identifier space, and each node is responsible for a range of values. Adding or removing a node moves only a small fraction of keys, making the approach practical for systems with dynamic membership.
CAN partitions a multi-dimensional coordinate space among nodes and routes greedily toward the target coordinates. Chord places everything on a one-dimensional ring and uses finger tables for O(log n) routing. Both are clean academic demonstrations of decentralized lookup.
Amazon Dynamo takes the consistent hashing foundation and builds a complete production storage system on it. Its key additions (virtual nodes for load balance, N/R/W quorums for tunable consistency, optimistic replication with eventual consistency, vector clocks for conflict detection, hinted handoff for availability, and gossip for membership) all serve the same operational goal: keep the system available and responsive under the failure conditions that are routine in a large data center.
References
-
Ratnasamy; et al., A Scalable Content-Addressable Network, ACM SIGCOMM Computer Communication Review, Volume 31, Issue 4, pp. 161-172, 2001.
-
I. Stoica et al., Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications, IEEE/ACM Transactions on Networking, vol. 11, no. 1, pp. 17-32, Feb. 2003.
-
G. DeCandia, D. Hastorun, et al., Dynamo: Amazon’s highly available key-value store, ACM Symposium on Operating System Principles, 2007.ß