When & Where
The final exam will be held in our regular classroom on Monday, May 11, 2026. It will start promptly at 4:00 pm and you will have until 7:00 pm to complete it.
Exam rules
Be sure to arrive on time. If you arrive after the exam starts, you will not be allowed to take it.
This will be a closed book, closed notes exam. Calculators, phones, augmented reality glasses, laptops, and tablets are neither needed nor permitted. If you have these devices, you must turn them off, put them out of sight, and not access them for the duration of the exam.
No other electronic devices are permitted except for hearing aids, pacemakers, electronic nerve stimulators, other implanted medical devices, or electronic watches that function only as timekeeping devices or chronographs.
Bring a couple of pens or pencils with you, an eraser, and your ID.
Plan to use a pen only if you are supremely confident in not changing your mind about your answers. . Check here for information about pencils, sharpeners, and the craft of pencil sharpening.
Past exams
The exam will be similar in structure to mid-semester exams but will cover material for the entire course.
You can use my past exams as a guide to what this exam may look like, but realize there are differences in topics and in the sequencing of the topics. Expect around 25 multiple-choice questions. I do not refer to old exams when I come up with a new one, so it is likely that many of the topics that I considered important in past exams will show up on future exams. Some material may have changed, however, so do not worry about questions that appear to relate to topics we have not covered.
Study guide
You are responsible for the course material up through, and including, the week 13 lecture on clusters.
The study guide is a concatenation of the study guides from the past lectures. It attempts to cover most of the material you should know. It is not a substitute for the lectures, lecture material, and other reading matter. All the material may not be in the guide. My goal is to put most of the information you need to know a concise with fewer elaborations.
You can also prepare your own guide, which would be a much better way to prepare for the exam!
Topics
Here is a list of topics you are expected to know for the exam. It should be viewed as a coverage guide rather than a catalog of detailed algorithms or definitions to memorize.
Use it as a checklist. Go through each item and ask yourself whether you understand the concept and could recognize or apply it in context. If not, review the relevant notes.
Because the exam is multiple choice, your focus should be on understanding and being able to reason about the material, not on memorizing or reciting formal definitions.
Topics that you be familiar with and may be on the exam include:
Introduction & Key Concepts
-
Distributed system - definition
-
Single system image
-
Moore’s law
-
Vertical vs. horizontal scaling
-
Transparency goals
-
Nines of Availability (not the calculations)
-
Failure models
-
Partial failure
-
Single point of failure
-
Redundancy
-
Availability vs. reliability
-
Series vs. parallel failures (understand the impact but you don’t need to memorize the formulas)
-
-
Types of failures: Fail-stop, fail-restart, omission, delays, partition, Byzantine failure
-
Latency: synchronous and asynchronous transmission modes
-
State, replicas, caching
-
Caching vs. replication
-
Stale data
-
Consistency
-
Global state
-
-
Transparency goals (just a high-level understanding)
- Location, Migration, Replication, Concurrency
-
Service models
-
Centralized
-
Client-server
-
Multi-tier (how does it compare with layered architectures?)
-
Peer-to-peer
-
Hybrid P2P
-
-
You don’t need to know any of the “-as-a-service” models
Networking
-
Understand the concept of protocols and layering
- I will not ask you to enumerate the OSI or IP protocol stack, but you should understand the functions of the data link, network, transport, and application layers.
-
Data Link Layer
-
Local Area Network (LAN)
-
Ethernet - service guarantees
-
-
Network Layer: purpose
-
Transport Layer: TCP & UDP
-
Transport endpoints (ports) vs. network endpoints (addresses)
-
Port numbers
-
TCP: connection-oriented service, full-duplex connections, reliable & in-order data delivery, flow control, and congestion control
-
UDP: connectionless service, no retransmits, drop bad packets
-
- What are advantages of using TCP or using UDP?
-
-
QUIC: why was it created?
-
Protocol encapsulation concept
-
Sockets
-
Concept
-
Operations at the system call level: create, bind, connect, listen, shutdown, communication (know what they do, not what the parameters are)
-
-
General principles of the design of the Internet
-
Connection of disparate networks via routers
-
Unreliable communication
-
End-to-end principle
-
Fate sharing
-
Best-effort packet delivery
-
Remote procedure calls
-
Concept of a remote procedure call (RPC)
-
Stub functions (proxies and skeletons)
-
Stub generation
-
Interface definition language (IDL) concept
-
Marshalling
-
Data representation, endianess
-
Serialization: canonical vs. receiver-makes-right
-
Pass by reference issue
-
Explicit versus implicit typing
-
-
Semantics of remote procedure calls
-
Failure of RPCs
-
at most once and at least once semantics
-
-
Idempotent vs. non-idempotent functions, retries
-
Idempotency keys
-
Purpose of an RPC compiler
-
Purpose of an RPC name service
-
Versioning
-
Timeouts vs. deadlines
-
Surrogate process (Microsoft)
-
Marshaling and serialization examples
-
RPC-specfic solutions: XDR, NDR
-
Google Protocol Buffers
-
JSON
-
XML
-
-
Garbage collection
- Reference counting, leases, pinging (heartbeats)
Web services
-
Motivation for web services
-
XML-RPC: general concept
-
SOAP: just that it extended XML-RPC; purpose of WSDL
-
REST
-
What problems was it trying to solve?
-
How is its approach different from RPC?
-
CRUD operations: POST, GET, PUT, DELETE
-
-
Stateful vs. stateless services
-
Advantages of HTTP/2 vs older HTTP
-
gRPC: why was it created? General principles
-
Need for observability; request IDs
Clock synchronization
-
What is UTC?
-
System clock, wall time
-
What’s an epoch?
-
Accuracy, resolution, precision
-
Clock drift, offset, and jitter
-
Compensation: slewing and stepping
-
Clock discipline
-
Cristian’s algorithm
-
Understand the goal and formula for Cristian’s algorithm
-
Error bounds. Remember that errors are cumulative (additive).
-
-
NTP
-
NTP synchronization subnet
-
NTP strata
-
What is jitter?
-
You don’t need to memorize the NTP(SNTP) formula but be sure to understand how it gives you the same result as Cristian’s
-
You don’t need to know how dispersion and jitter are calculated
-
You don’t need to know the NTP message structure or validation tests
-
-
PTP
-
Goal of PTP vs. NTP
-
Role of grandmaster
-
What is the best master clock? (Don’t remember the priority order)
-
Don’t memorize the formula
-
Logical clocks
-
Causal vs. concurrent events
-
Lamport’s algorithm
-
Goals
-
Happened-before relationship
-
Partial ordering vs. total ordering
-
Algorithm: know how it works
-
Deficiencies
-
Remedy for creating total order
-
-
Vector clocks
-
Goals
-
Algorithm
-
Know how to identify concurrent events by comparing timestamps
-
Understand how to represent vector clocks if the group membership is unknown (sets of node:clock values)
-
-
Hybrid logical clocks
-
Goals
-
Role of the two values
-
Understand when L and C change
-
Group communication
-
Unicast, broadcast, multicast
-
I won’t ask about anycast
-
Sending vs. receiving vs. delivering; holdback queue
-
Understand each of these failure modes:
-
Crash failure
-
Omission failure
-
Byzantine failure
-
Partition failure
-
-
Reliability of message delivery
-
Unreliable, best-effort delivery
-
Best-effort reliable
- feedback implosion
-
Reliable multicast
-
Durable multicast
-
Publish-subscribe model (concept)
-
-
Message Ordering
-
Unordered
-
Single Source FIFO (SSF): understand how to achieve this
-
Causal (partial) ordering: understand how this is achieved
- When is a message not delivered?
-
Total order: understand how to achieve this
-
Synchronous order: understand what this means
- What is a sync primitive?
-
-
Atomic multicast: What’s cmobined?
-
IP multicasting
-
IP multicast address – what is its purpose?
-
Internet Group Management Protocol (IGMP)
-
Purpose (host to edge router)
-
Understand the basic protocol: membership query, membership report, leave group
-
What is pruning?
-
-
Protocol Independent Multicast (PIM)
-
IGMP vs. PIM
-
PIM Dense Mode multicast: flooding
-
PIM Sparse Mode multicast
-
Rendezvous points
-
-
Failure detection
-
What does the FLP impossibility result tell us about reaching agreement?
-
Push-based heartbeating
-
Pull-based heartbeating/pinging
-
Phi accrual failure detector
-
Just understand what it tries to do
-
You don’t need to know how φ maps to probabilities
-
Virtual Synchrony
-
Group membership service: purpose
-
What is a view and a view change?
-
What is a stable vs. unstable message?
-
Steps in a view change: what does flushing do?
-
What is a state transfer and when is it needed?
Mutual exclusion
-
Understand properties: safety, liveness, fairness
-
Types: centralized, token-based, contention-based
-
Central-server algorithm: messages involved
-
Distributed mutual exclusion algorithms:
-
Token ring
-
Lamport
-
Ricart & Agrawala: optimizatin over Lamport’s
-
Understand pros and cons of each
-
Leader election algorithms
-
Bully algorithm: how does it work?
-
Ring algorithm
-
How does it work?
-
How does it avoid duplicate elections?
-
Consensus
-
Partitions, split-brain problem
-
What is quorum?
-
Purpose of replicated state machines
-
What are replicated logs?
-
Understand consensus requirements
-
Validity
-
Agreement
-
Termination (= progress)
-
-
FLP Impossibility Result: what does it say?
-
What is the purpose of Paxos (not the algorithm or phases)?
-
Raft
-
Goal: fault tolerant consensus algorithm
-
Raft quorum
-
States: follower, candidate, leader
-
Terms and term numbers
-
Leader election, timeouts, heartbeat, split votes
-
AppendEntries
-
Log matching, committing entries
-
You don’t need to know about membership changes
-
Coordination Services
-
Goals, replication
-
Chubby
-
Purpose: lock manager, name server, and simple file system
-
Architecture of a Chubby cell: master and replicas
-
Named resources (files, locks)
-
Event notification (& purpose)
-
Whole file reads and writes
-
Don’t try to memorize the API
-
Client cache consistency
-
Consensus for replication
-
-
Zookeeper
-
Persistent vs. Ephemeral nodes
-
You don’t have to know about sequential znodes
-
You don’t hae to know the consistency model
-
You don’t hae to know locks can be constructed
-
-
etcd
-
Similarities with Chubby and Zookeeper (replicas via consensus-based log replication)
-
Key difference: leases are different objects
-
-
What is the purpose of fencing tokens?
Network-Attached Storage
-
Transparency via VFS, mounting filesystems
-
Access models: remote access vs. upload/downloa
-
Semantics: sequential vs. session semantics
-
Stateful vs. stateless design
-
Caching options
-
Case studies
-
NFS (original design)
- Goals
-
You do not need to know the automounter (we did not discuss it in class)
-
UDP vs TCP transport
-
Directory and file access protocol (don’t memorize the list of RPC calls but understand why they exist, what lookup does, why there’s no open)
-
Caching, validation, possible inconsistencies
-
Know that a user-level lock manager was added to NFS but otherwise don’t bother with features of successive versions.
-
-
AFS (original design)
-
Goals
-
Caching model, semantics, callbacks
-
Volumes, uniform global namespace
-
-
Coda
-
Goals
-
Replicated volumes
-
Accessible Volume Storage Group
-
Client modification log
-
Resolution
-
Disconnected operation
-
Reintegration
-
-
SMB
-
Goals
-
Caching model: oplocks/leases (don’t memorize oplocks or leases but know what their purpose is)
-
Microsoft DFS: just know that it adds consistent naming, similar to AFS
-
-
-
Enhancements to NFS and SMB
-
You don’t have to know version numbers (e.g., NFSv4 vs SMB2)
-
You don’t have to knw anything about the SMB3 features
-
Just understand the point of the mechanisms
-
SMB2 – understand these concepts that improved performance
-
Pipelining
-
Compounding
-
-
NFSv4
-
the protocol is now stateful
-
supports better caching (similar to oplocks)
-
compound RPC (compounding)
-
-
Distributed Lookup (Distributed Hash Tables)
-
Purpose: distributed, highly available key-value store
-
What is consistent hashing?
-
Content-Addressable Network
- Multi-dimensional hashes, zones, routing, node insertion
-
Chord
-
Ring, successor nodes
-
How do finger tables make query routing more efficient?
-
-
Amazon Dynamo
-
Understand the goal: distributed, highly available key-value store
-
Functional interface: get, put
-
Consistency model for replication (eventually consistent, but tunable via N, W, R)
-
Goals: incremental scalability, symmetry, decentralization, heterogenous systems
-
Conflict resolution
-
Who does it?
-
Use of vector timestamps for conflict detection
-
-
Partitioning among multiple nodes
-
What is a virtual node and why are they a good idea?
-
Hinted handoff
-
Distributed Lookup: DNS
-
Stub resolver
-
Recursive resolver
-
Authoritative server
-
Caching/consistency
Distributed file systems
-
GFS
-
Design goals, data & metadata
-
chunkservers - what do they do?
-
master - what does it do?
-
Google cluster environment: colocate computation and data on same set of machines
-
operation log
-
why large chunks?
-
writing & replication: data flow vs. control flow
-
You don’t have to know record append and snapshot features
-
You don’t have to know how checksums and chunk version numbers are used
-
-
HDFS
-
HDFS is practically a clone of GFS. Focus on GFS terminology. If I use any HDFS terminology, I will identify the GFS equivalent
-
NameNode = Master
-
DataNode = Chunkserver
-
blocks = chunks
-
Transaction log = Operation log
-
You do not have to know any of the differences between GFS and HDFS (they’re minor)
-
-
Dropbox (from the homework video and notes)
-
Goal
-
How does server traffic load differ from most other sites that store data (facebook, twitter, …)?
-
What’s a metadata server (metaserver)?
-
Why was a notification server added? Polling vs. notifications
-
Distributed transactions
-
Transactions, write-ahead log, commit, abort
-
ACID properties
-
Concurrency control
-
Optimistic vs. pessimistic
-
Read locks vs. write locks
-
Two-phase locking
-
Strong Strict Two-Phase Locking (SS2PL)
-
Private workspace (for OCC)
-
Multi-version concurrency control (MVCC)
-
Snapshot isolation
-
-
Deadlock
-
Wait-for-graph
-
Centralized detection, phantom deadlocks
-
Chandy-Misra-Haas algorithm, probe messages
-
Timestamp-based prevention: wait-die and wound-wait
-
You don’t have to remember the difference between wait-die and wound-wait
-
-
Two-phase commit protocol
-
Understand the need for a log
-
What happens in each of the phases?
-
Problems if coordinator or participants die
-
Uncertain state
-
-
Three-phase commit protocol
-
Precommit phase
-
Supporting coordinator recovery
-
Main problem with 3PC
-
-
Consistency models
-
Linearizability
-
Sequential consistency
-
Causal consistency
-
Eventual consistency
-
-
Scaling via replication and Brewer’s CAP Theorem
-
Consistency, Availability, Partitions
-
PACELC: how does it extend CAP?
-
-
Eventual consistency: BASE approach vs. ACID
Distributed Databases
-
NoSQL models
-
Key-value stores
-
Column-family or wide-column stores
Bigtable
-
Row keys, column families, column qualifiers
-
Timestamps
-
Sparse storage
-
Tablets, splitting, versions
-
Tablet servers, master server
-
Memtable, SSTable
-
Consistency model
-
Single-row atomicity; no transactions across rows
Cassandra
-
Peer-to-peer design
-
Distributed hash table, hash ring, virtual nodes
-
Data model: Partition key, Clustering columns
-
Replication factor
-
Tunable consistency
-
Memtable, SSTable, compaction
Spanner
-
Splits and Paxos groups
-
Two-phase commit
-
Strict two-phase locking
-
Wound-wait
-
Multiversion concurrency control
-
TrueTime and commit wait
-
Read-only transactions
-
Snapshot reads
-
External consistency
Database Comparisons
-
Key-range partitioning vs. hash partitioning
-
Bigtable vs. Cassandra
-
Paxos vs. two-phase commit
-
Serializability vs. external consistency
Distributed Computation
MapReduce
-
Master and workers
-
Shards
-
Map tasks, reduce tasks
-
Intermediate key-value pairs
-
Partitioning by key
-
Shuffle and sort
-
Locality
-
Stragglers, speculative execution
-
Failure handling
BSP
-
Supersteps
-
Barrier synchronization
-
Message passing
-
Checkpointing
Pregel / Giraph
-
Vertex-centric computation
-
Supersteps
-
Message passing
-
Vote to halt
-
Active and inactive vertices
-
Global termination
Spark
-
Driver program, workers, executors
-
RDDs, immutability
-
Partitioning
-
Transformations and actions
-
Lazy evaluation
-
Lineage
-
Caching
-
Narrow vs. wide dependencies
-
Shuffle
Distributed Machine Learning
- I won’t ask about Distributed Machine Learning
Comparisons
-
MapReduce vs. Spark
-
BSP vs. Pregel
-
Batch vs. iterative workloads
Data in Motion
-
Event streaming
-
CDNs
-
Peer-to-peer distribution
Kafka and event streaming
-
Queuing vs. publish-subscribe
-
Brokers
-
Producers, consumers, consumer groups
-
Topics and partitioned logs
-
Offsets
-
Ordering within a partition
-
Replication: Leaders and followers
-
Scale and fault tolerance
-
Retention and replay
Stream prcessing
-
Stateless vs. stateful processing
-
Idempotent sinks
-
Backpressure
-
Event time vs. processing time
-
Tumbling, sliding, session windows
-
Watermarks
-
Spark Structured Streaming
-
Micro-batch model
-
Unbouned table
-
Checkpointing
-
-
Apache Flink
- Just know that it does record-at-a-time streaming
CDNs
-
Flash crowd problem
-
Browser caching, caching proxies, load balancing, mirroring
-
Origin servers and edge servers
-
Push CDN, pull CDN
-
Static content, dynamic content, streaming video
Akamai
-
Overlay network, routing
-
Caching hierarchy
-
Request mapping
-
Edge caching
BitTorrent
-
Peer-to-peer architecture
-
.torrent file
-
Swarm
-
Trackers, seeders, and leechers
-
Pieces, Rarest-first
-
Tit-for-tat incentive model
-
Use of DHT (not how it is implemented)
-
Scalability and fault tolerance
Edge computing
- I won’t ask you about edge computing
Comparisons
-
RabbitMQ vs. Kafka
-
Event streaming vs. traditional messaging
-
CDN delivery vs. direct origin delivery
-
DNS-based routing vs. anycast
-
CDN vs. BitTorrent distribution
Security in Distributed Systems
-
Security goals: Confidentiality, integrity, authentication, authorization, non-repudiation
-
Lateral movement
Cryptography
-
Symmetric and asymmetric cryptography
-
Hashes, MACs, digital signatures
-
Replay attacks, freshness mechanisms
TLS and certificates
-
TLS vs. mTLS: what they do, what they authenticate, not the protocol steps
-
Certificates, certificate authorities (CA), certificate chain concept: purpose of each
-
Expiration, rotation, revocation
Identity and access
-
Authentication vs. authorization
-
OAuth, OpenID Connect: what is each used for?
-
Access tokens, ID tokens, refresh tokens
-
JWTs and claims: just the concept
-
Workload identity, SPIFFE - just what it does
-
Principle of least privilege
Architecture and operations
-
Zero Trust concept
-
Micro-segmentation
-
API gateway: purpose
-
Service meshes: purpose
-
Don’t bother remembering north-south, east-west
-
You don’t have to know secret management
-
Short-lived credentials vs. long-lived keys
Design mistakes
-
Trusting the internal network
-
Authenticating users but not services
-
Long-lived shared secrets
-
Missing authorization checks
-
Over-privileged roles
-
Storing secrets
Clusters
-
Definition, single system image
-
Commodity hardware
-
Clustering types: high performance (HPC), high availability (HA), load balancing. storage (just the goals)
-
Top-of-rack (ToR) switch, Spine-leaf topology
-
Purpose of Remote DMA (RDMA)
-
Handling failure: heartbeats, fencing, split brain problem, quorum
-
Borg: purpose
-
Borgmaster role
-
Borglet role
-
Scoring and filtering
-
I won’t ask about cgroups, namespaces, jobs, tasks, and allocs
-
-
Kubernetes: purpose
-
Pod, scheduler, kubelets
-
I won’t ask about etcd, controller manager, or cloud controller manager.
-
I won’t ask about container runtime or Kube-proxies
-
I won’t ask about kubernetes controllers or Services
-
-
Load balancing
-
I won’t ask about load-balancing strategies
-
Advantage of stateless services
-
-
I won’t ask about health checks
-
Cluster interconnect: understand there’s an overhead to leaving the rack and to leaving the datacenter
-
What is a clustered file system? How does it differ from a network (or distributed) file system?
-
What is a heartbeat network?
-
Active/active vs. active/passive operation
-
Failover types: cold, warm, hot; multidirectional, cascading
Last update: Tue May 05 12:24:11 2026