pk.org: CS 417/Lecture Notes

Decentralized Storage

Study Guide

Paul Krzyzanowski – 2026-03-07

We cover three distinct approaches to distributing storage and lookup across many nodes: data-intensive distributed file systems (GFS and HDFS), distributed hash tables (CAN, Chord, and Dynamo), and DNS as a planet-scale distributed data store.

Distributed File Systems: GFS and HDFS

Goals and Workload

The Google File System (GFS) was designed for a very specific workload: enormous files (multi-GB), sequential reads and writes, a high rate of concurrent appends, and an environment where commodity hardware failures are routine. It prioritizes throughput over latency and fault tolerance by design rather than by exception.

The key insight driving the design is that if you know your workload in advance, you can build a system optimized for it rather than a general-purpose system that handles everything adequately.

Separation of Data and Metadata

GFS separates the control plane (metadata) from the data plane (file content). A single master node holds all metadata in memory: the namespace, the mapping of files to chunks, and chunk replica locations. Clients contact the master only to find out where data lives. All actual data transfer happens directly between clients and chunkservers, bypassing the master entirely.

This design keeps the master out of the data path, allowing it to handle many concurrent clients without becoming a throughput bottleneck.

Chunks and Replication

Files are divided into fixed-size chunks of 64 MB, each identified by a unique 64-bit handle. Each chunk is stored as an ordinary file on a chunkserver’s local disk and replicated on three chunkservers by default. Replication is what makes the system fault-tolerant: if one chunkserver fails, two others still hold the data.

Master State and Fault Tolerance

The master stores metadata in memory for fast access. Namespace and file-to-chunk mappings are persisted to an operation log that is replicated on remote machines. Chunk locations are not persisted; the master rebuilds them by polling chunkservers at startup.

The operation log is the authoritative record of the file system. The master periodically checkpoints its state so that log replay after a crash is fast.

The Two-Phase Write

Writes in GFS involve two distinct phases, and understanding why they are separated matters.

Phase 1 (data transfer): The client pushes data to all replicas in a pipelined chain, from one chunkserver to the next. No writes occur yet; the data is simply buffered at each replica.

Phase 2 (write request): Once all replicas confirm receipt, the client sends a write request to the primary replica, the one holding the current lease for that chunk. The primary assigns a serial number to the mutation and applies it locally. It then forwards the write request to all secondaries, which apply the mutation in the same order. The primary acknowledges success to the client after all secondaries respond.

Separating data flow from control flow improves network efficiency. The primary’s lease serializes concurrent mutations, ensuring all replicas apply changes in the same order.

Atomic Record Append

GFS supports a special record append operation in which the client provides data and GFS chooses the offset. GFS guarantees that the data will appear atomically at least once in the file, even with concurrent appenders. This supports producer-consumer workflows common in batch processing.

Fault Detection

Chunkservers send heartbeats to the master. A chunkserver that stops sending heartbeats is marked as failed; its chunks are re-replicated from surviving copies. Each chunk stores checksums over its data; a chunkserver detects corruption by verifying checksums on read or during background scans.

HDFS: What Carried Over and What Changed

HDFS (Hadoop Distributed File System) is essentially an open-source re-implementation of GFS in Java. The one-to-one mapping is:

The architecture, replication strategy, and fault-detection mechanisms remain the same.

Some key differences are:

Beyond GFS

GFS served Google well for nearly a decade. As files became smaller and more numerous, the single master became a metadata bottleneck. Google replaced GFS with Colossus, which distributes the metadata function. HDFS followed the same direction with Federation. Both converged on the same lesson: at sufficient scale, metadata is itself a distributed systems problem.

Distributed Hash Tables

A Distributed Hash Table (DHT) is a decentralized system that provides a key-value lookup interface: given a key, find the value. There is no central server. Responsibility for keys is divided among participating nodes, and any node can route a query to the correct node in a bounded number of hops.

The motivation was peer-to-peer networks. Centralized index servers (like Napster) are vulnerable to shutdown. A DHT provides the same lookup functionality without any central authority.

Consistent Hashing

All DHTs rely on consistent hashing. Keys and nodes are both hashed to positions in the same identifier space (typically a ring of integers). A key is assigned to the nearest node by some rule (e.g., the next node clockwise). Adding or removing a single node only redistributes the keys that were assigned to it; all other keys stay put. This minimizes key movement under membership changes.

CAN

CAN (Content Addressable Network) maps keys to points in a d-dimensional Cartesian coordinate space. Each node owns a rectangular zone. A lookup for a key hashes to a point in the space and routes to that point: each hop moves closer. The expected lookup time is O(d · n^(1/d)) hops with O(d) neighbors per node (roughly 2d neighbors for a d-dimensional space). Node joins split an existing zone; departures merge zones with a neighbor.

Chord

Chord organizes nodes and keys on a one-dimensional ring using SHA-1 hashes. A key belongs to its successor: the node with the smallest ID greater than or equal to the key’s hash. Pure successor-pointer routing takes O(n) hops. The finger table at each node stores pointers to nodes at exponentially increasing distances around the ring, enabling O(log n) lookup. Node joins and departures are managed by a stabilization protocol that keeps successor pointers and finger tables up to date.

Amazon Dynamo

Dynamo is Amazon’s internal key-value store, designed to power shopping cart and similar services with strict availability and latency requirements. It builds on consistent hashing but adds mechanisms that make it a complete production storage system rather than a lookup primitive.

Virtual nodes: Each physical server owns many positions (vnodes) on the ring. This balances the load more evenly and spreads the impact of failures and joins across many existing nodes rather than concentrating it on one.

Replication: Each key is replicated on N consecutive nodes on the ring (N is typically 3). This preference list is known to all nodes.

Quorum reads and writes: A write completes after W replicas acknowledge it; a read completes after R replicas respond. With R + W > N, every read is guaranteed to overlap with the most recent write. With N=3, R=2, W=2, the system tolerates one replica failure without blocking either reads or writes.

Conflict resolution: Dynamo uses eventual consistency. Concurrent writes may produce conflicting versions. Dynamo attaches vector clocks to values so that the system (and application) can detect divergence. Applications are responsible for merging conflicts; for a shopping cart, this means taking the union of both versions.

Hinted handoff: When a preferred replica is unavailable, a write goes to a different node with a “hint” indicating its intended destination. The hinted node delivers the write to the intended replica once it recovers.

Gossip protocol: Membership and failure information propagate through the cluster via gossip – periodic exchanges with randomly selected peers rather than through a central coordinator.

Dynamo vs. Classic DHTs

Property Chord Dynamo
Purpose Decentralized routing Production key-value store
Routing O(log n) hops through intermediaries One-hop to target (full ring state known)
Consistency Not a data store; n/a Eventual consistency
Replication Not included N replicas, quorum (R, W)
Deployment Designed for open internet/churn Stable data center environment
Conflict handling n/a Vector clocks + application merge

Domain Name System (DNS)

DNS maps human-readable domain names to IP addresses. It is a hierarchical, distributed, cacheable database that handles hundreds of billions of queries per day without a central server.

Hierarchy and Delegation

The DNS namespace is a tree. Responsibility is delegated at each level: the root delegates TLDs, TLD operators delegate second-level domains, organizations delegate subdomains. Each delegated zone is managed independently by its owner. No single server knows all mappings.

Resolution

A recursive resolver (typically provided by an ISP or public service) resolves names on behalf of clients. It navigates the hierarchy by querying root servers, then TLD servers, then authoritative servers, collecting referrals at each step. This is iterative resolution. The final answer is returned to the client.

Caching and TTL

Every DNS response carries a Time To Live (TTL). Resolvers cache responses for the TTL duration, answering subsequent queries without hitting authoritative servers. Caching is aggressive and layered: the recursive resolver, the OS, and the browser all cache. TTL is a tunable trade-off: longer TTL means lower load on authoritative servers but slower propagation of updates; shorter TTL means faster updates but higher query load.

Limitations

DNS is not strongly consistent: clients may hold stale cached answers until the TTL expires. It is read-mostly: updates propagate through TTL expiry, not through a distributed write protocol. It is not searchable: you can look up a name but not query across names. And it was not designed with security: DNS spoofing and cache poisoning are real attacks; DNSSEC adds cryptographic signatures to mitigate them but deployment is incomplete.

DNS as a Design Pattern

DNS illustrates what makes a distributed system scale globally: hierarchical organization, delegation, aggressive caching with bounded staleness, and a workload that is overwhelmingly read-heavy. These properties keep the query load from concentrating on any single server while allowing each zone owner to manage its own data independently.


What You Don’t Need to Study


Next: Terms you should know

Back to CS 417 Documents