Decentralized Storage

We cover three distinct approaches to distributing storage and lookup across many nodes: data-intensive distributed file systems (GFS and HDFS), distributed hash tables (CAN, Chord, and Dynamo), and DNS as a planet-scale distributed data store.

Distributed File Systems: GFS and HDFS

Goals and Workload

The Google File System (GFS) was designed for a very specific workload: enormous files (multi-GB), sequential reads and writes, a high rate of concurrent appends, and an environment where commodity hardware failures are routine. It prioritizes throughput over latency and fault tolerance by design rather than by exception.

The key insight driving the design is that if you know your workload in advance, you can build a system optimized for it rather than a general-purpose system that handles everything adequately.

Separation of Data and Metadata

GFS separates the control plane (metadata) from the data plane (file content). A single master node holds all metadata in memory: the namespace, the mapping of files to chunks, and chunk replica locations. Clients contact the master only to find out where data lives. All actual data transfer happens directly between clients and chunkservers, bypassing the master entirely.

This design keeps the master out of the data path, allowing it to handle many concurrent clients without becoming a throughput bottleneck.

Chunks and Replication

Files are divided into fixed-size chunks of 64 MB, each identified by a unique 64-bit handle. Each chunk is stored as an ordinary file on a chunkserver’s local disk and replicated on three chunkservers by default. Replication is what makes the system fault-tolerant: if one chunkserver fails, two others still hold the data.

Master State and Fault Tolerance

The master stores metadata in memory for fast access. Namespace and file-to-chunk mappings are persisted to an operation log that is replicated on remote machines. Chunk locations are not persisted; the master rebuilds them by polling chunkservers at startup.

The operation log is the authoritative record of the file system. The master periodically checkpoints its state so that log replay after a crash is fast.

The Two-Phase Write

Writes in GFS involve two distinct phases, and understanding why they are separated matters.

Phase 1 (data transfer): The client pushes data to all replicas in a pipelined chain, from one chunkserver to the next. No writes occur yet; the data is simply buffered at each replica.

Phase 2 (write request): Once all replicas confirm receipt, the client sends a write request to the primary replica, the one holding the current lease for that chunk. The primary assigns a serial number to the mutation and applies it locally. It then forwards the write request to all secondaries, which apply the mutation in the same order. The primary acknowledges success to the client after all secondaries respond.

Separating data flow from control flow improves network efficiency. The pipelining in phase 1 ensures each network link carries the data exactly once. The primary’s lease serializes concurrent mutations in phase 2, ensuring all replicas apply changes in the same order.

Additional Operations

GFS adds two operations beyond the standard file interface. Record append lets GFS choose the offset and guarantees that the data appears atomically at least once, even with concurrent appenders; this supports producer-consumer workflows without explicit locking. Snapshot creates a copy of a file or directory tree cheaply using copy-on-write: chunks are shared with the original until one is modified.

Fault Detection

Chunkservers send heartbeats to the master. A chunkserver that stops sending heartbeats is marked as failed; its chunks are re-replicated from surviving copies. Each chunk stores checksums over its data; a chunkserver detects corruption by verifying checksums on read or during background scans.

When granting a lease, the master also increments the chunk version number and notifies all replicas. Any replica that missed the update because it was offline will have a stale version number; the master will not direct clients to it and will schedule re-replication from an up-to-date copy.

Client Caching

GFS clients do not cache file data. They do cache chunk location metadata returned by the master, with a timeout, so they can read from chunkservers repeatedly without re-querying the master for every request.

The Namespace

GFS does not use per-directory data structures. The namespace is a single flat lookup table mapping full pathnames to metadata. There are no hard links or symbolic links.

HDFS: What Carried Over and What Changed

HDFS (Hadoop Distributed File System) is essentially an open-source re-implementation of GFS in Java. The one-to-one mapping is:

master = NameNode
chunkserver = DataNode
chunk = block

The architecture, replication strategy, and fault-detection mechanisms remain the same.

Some key differences are:

HDFS uses a 128 MB default block size (vs. GFS’s 64 MB).
The original HDFS NameNode was a single point of failure with no automatic failover; later versions added high availability using ZooKeeper.
HDFS Federation allows multiple independent NameNodes for different portions of the namespace, addressing the metadata scalability limit of the single-master design.

Beyond GFS

GFS served Google well for nearly a decade. As files became smaller and more numerous, the single master became a metadata bottleneck. Google replaced GFS with Colossus, which distributes the metadata function. HDFS followed the same direction with Federation. Both converged on the same lesson: at sufficient scale, metadata is itself a distributed systems problem.

Dropbox

Dropbox applies the same core principle as GFS (separate data from metadata) to a consumer file synchronization service. Its design decisions are worth understanding because it faced a different workload and scaling problem than internal batch-processing systems.

The original design goal of Dropbox was to keep a designated folder synchronized across all of a user’s devices. Any change on one device should propagate to the server and then to all other connected devices, transparently and in the background.

Unusual read/write ratio: Most content sites (social media, news, streaming) are heavily read-dominated: data is written once and read many times. Dropbox is close to a 1:1 ratio, because stored data is rarely read except to push a change to another device. This means Dropbox must handle an unusually high volume of uploads relative to its user base.

Chunking and deduplication: Dropbox splits files into fixed-size blocks, identifies each block by its SHA-256 hash, and uploads only the blocks the server does not already have. If two users store identical content, only one copy exists on the server. This dramatically reduces storage and upload volume.

The metadata server: Like GFS, Dropbox separates block data from metadata. Block data (actual file content) lives in a scalable block store. The metadata server stores file names, directory structure, the list of blocks that make up each file, and version history. When a client downloads a changed file, it asks the metadata server for the current block list and then fetches only the blocks it is missing from the block store. The metadata server is never in the data path for actual file content.

Polling vs. notifications: Early Dropbox clients polled the server every few seconds, asking “anything new?” With tens of thousands of clients polling simultaneously, most server responses were just “no.” The solution was a notification server. Clients hold a persistent connection to the notification server. When a change occurs, the server pushes a message to the affected clients, telling them to sync. Clients then contact the metadata server to find out what changed. This shifts the server from constantly answering empty polls to only sending messages when there is actually something to say.

Distributed Hash Tables

A Distributed Hash Table (DHT) is a decentralized system that provides a key-value lookup interface: given a key, find the value. There is no central server. Responsibility for keys is divided among participating nodes, and any node can route a query to the correct node in a bounded number of hops.

The motivation was peer-to-peer networks. Centralized index servers (like Napster) are vulnerable to shutdown. A DHT provides the same lookup functionality without any central authority.

Consistent Hashing

All DHTs rely on consistent hashing. Keys are hashed into a large identifier space, and each node is responsible for a contiguous range of values within that space. Adding a new node takes over a portion of an existing node’s range; removing a node transfers its range to neighboring nodes. In either case, only the keys in the affected range move. This minimizes key movement under membership changes. Different DHTs implement the range assignment differently: CAN uses geometric zones in a multi-dimensional space; Chord uses a one-dimensional ring with successor pointers.

CAN

CAN (Content Addressable Network) maps keys to points in a d-dimensional Cartesian coordinate space. Each node owns a rectangular zone. A lookup for a key hashes to a point in the space and routes to that point: each hop moves closer. The expected lookup time is O(d · n^(1/d)) hops with O(d) neighbors per node (roughly 2d neighbors for a d-dimensional space). Node joins split an existing zone; departures merge zones with a neighbor.

Chord

Chord organizes nodes and keys on a one-dimensional ring using SHA-1 hashes. A key belongs to its successor: the node with the smallest ID greater than or equal to the key’s hash. Pure successor-pointer routing takes O(n) hops. The finger table at each node stores pointers to nodes at exponentially increasing distances around the ring, enabling O(log n) lookup. Node joins and departures are managed by a stabilization protocol that keeps successor pointers and finger tables up to date.

Amazon Dynamo

Dynamo is Amazon’s internal key-value store, designed to power shopping cart and similar services with strict availability and latency requirements. It builds on consistent hashing but adds mechanisms that make it a complete production storage system rather than a lookup primitive.

Virtual nodes: Each physical server owns many positions (vnodes) on the ring. This balances the load more evenly and spreads the impact of failures and joins across many existing nodes rather than concentrating it on one.

Replication: Each key is replicated on N consecutive nodes on the ring (N is typically 3). This preference list is known to all nodes.

Quorum reads and writes: A write completes after W replicas acknowledge it; a read completes after R replicas respond. With R + W > N, every read is guaranteed to overlap with the most recent write. With N=3, R=2, W=2, the system tolerates one replica failure without blocking either reads or writes.

Conflict resolution: Dynamo uses eventual consistency. Concurrent writes may produce conflicting versions. Dynamo attaches vector clocks to values so that the system (and application) can detect divergence. Applications are responsible for merging conflicts; for a shopping cart, this means taking the union of both versions.

Hinted handoff: When a preferred replica is unavailable, a write goes to a different node with a “hint” indicating its intended destination. The hinted node delivers the write to the intended replica once it recovers.

Gossip protocol: Membership and failure information propagate through the cluster via gossip: periodic exchanges with randomly selected peers rather than through a central coordinator.

Dynamo vs. Classic DHTs

Property	Chord	Dynamo
Purpose	Decentralized routing	Production key-value store
Routing	O(log n) hops through intermediaries	One-hop to target (full ring state known)
Consistency	Not a data store; n/a	Eventual consistency
Replication	Not included	N replicas, quorum (R, W)
Deployment	Designed for open internet/churn	Stable data center environment
Conflict handling	n/a	Vector clocks + application merge

Domain Name System (DNS)

DNS maps human-readable domain names to IP addresses. It is a hierarchical, distributed, cacheable database that handles hundreds of billions of queries per day without a central server.

Hierarchy and Delegation

The DNS namespace is a tree. Responsibility is delegated at each level: the root delegates top-level domains (TLDs, like .edu), TLD operators delegate second-level domains (like rutgers.edu), organizations delegate subdomains (like cs.rutgers.edu). Each delegated zone is managed independently by its owner. No single server knows all mappings.

Resolution

When an application looks up a name, it calls a local library function that hands the query to a stub resolver: a minimal client built into the operating system that checks the local hosts file, checks its cache, and if neither has the answer, forwards the query to a recursive resolver.

The recursive resolver (typically provided by an ISP or a public service like 8.8.8.8) does the actual work. It performs iterative resolution by walking down the hierarchy.

It starts by contacting a root server. Root servers are themselves authoritative servers for the root zone: they do not know the final answer, but they know which servers are authoritative for each TLD and refer the resolver there.
The resolver then contacts the TLD authoritative server, which refers it to the authoritative server for the specific domain. That server holds the actual record and returns the answer.
The resolver passes the result back to the stub resolver, which returns it to the application.

Each level in this chain is authoritative for its own zone and only its own zone. No server needs global knowledge; each one knows only who to refer to next.

Caching and TTL

Every DNS response carries a Time To Live (TTL). Resolvers cache responses for the TTL duration, answering subsequent queries without hitting authoritative servers. Caching is aggressive and layered: the recursive resolver, the OS, and the browser all cache. TTL is a tunable trade-off: longer TTL means lower load on authoritative servers but slower propagation of updates; shorter TTL means faster updates but higher query load.

Limitations

DNS is not strongly consistent: clients may hold stale cached answers until the TTL expires. It is read-mostly: updates propagate through TTL expiry, not through a distributed write protocol. It is not searchable: you can look up a name but not query across names. And it was not designed with security: DNS spoofing and cache poisoning are real attacks; DNSSEC adds cryptographic signatures to mitigate them but deployment is incomplete.

DNS as a Design Pattern

DNS illustrates what makes a distributed system scale globally: hierarchical organization, delegation, aggressive caching with bounded staleness, and a workload that is overwhelmingly read-heavy. These properties keep the query load from concentrating on any single server while allowing each zone owner to manage its own data independently.

What You Don’t Need to Study

The specific block size used by Dropbox
Specific hash function bit widths or ring sizes used in Chord or Dynamo implementations
The original GFS, Chord, CAN, or Dynamo paper authors and years
GFS cluster sizes from the Google paper
The DNSSEC protocol (secure DNS - adding signatures to DNS responses)
Specific DNS record types (A, AAAA, MX, CNAME, etc.) beyond understanding that they exist
The history of HDFS version numbers or release dates
Any details of Colossus’s internal design (it is not published)