Spring 2020 Reading List (superset)

Introduction

Jim Waldo, Geoff Wyant, Ann Wollrath, Sam Kendall, Sun Microsystems A Note on Distributed Computing (1994). IEEE Micro, 1994.
Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, Stephen Wolff. Brief Histroy of the Internet. October 15, 2012, Internet Society.

B. Clifford Neuman. University of Southern California. Scale in Distributed Systems, In Readings in Distributed Computing Systems. IEEE Computer Society Press, 1994.

This paper is somewhat long; read only pages 1-5.

David D. Clark, Massachusetts Institute of Technology. The Design Philosophy of the DARPA Internet Protocols. Proc. SIGCOMM ‘88, Computer Communication Review Vol. 18, No. 4, August 1988, pp. 106–114.

This is old material but it discusses the thinking behind the design of the Internet.

David Ott. Optimizing Applications for NUMA. Intel® Developer Zone Content Library. July 9, 2009.

The browser-friendly version is here.

Per Stenstrom, Truman Joe, and Anoop Gupta. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. Stanford University, © 1992 ACM.

Just read the first two pages, focusing on section 2.1.

J. H. Saltzer, D. P. Reed, D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems (TOCS), Volume 2 Issue 4, Nov. 1984 Pages 277-288.

This paper introduced the end-to-end principle, summarized in this wikipedia article.

Distributed Objects

Wollrath, Ann, Roger Riggs, and Jim Waldo. A Distributed Object Model for the Java System. Proceedings of the USENIX 1996 Conference on Object-Oriented Technologies 9 (June 1996): 219-232.

Clocks

How Does It Work? from the Network Time Foundation, ntp.org
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558-565, 1978.

State machine replication

Ken Birman, A History of the Virtual Synchrony Replication Model. 2010.

The original paper is:
Ken Birman, Thomas Joseph, Cornell University. Exploring Virtual Synchrony in Distributed Systems. SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles. Pages 123-138.

Consensus

Leslie Lamport, SRI International. Using Time Instead of Timeout For Fault-Tolerant Distributed Systems. ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 6 Issue 2, April 1984, Pages 254-280.
Dale Skeen, A Quorum-Based Commit Protocol. Cornell University, TR 82-483, February 1982.

Describes the quorum technique for ensuring single-copy serializability. Also, read this Wikipedia article.

The Paper Trail, Consensus Protocols: Three-phase Commit. November 29, 2008.
Lampson, Butler, How to Build a Highly Available System Using Consensus, Microsoft Research, October 1, 1996

Leslie Lamport, Paxos Made Simple, Nov 1, 2001.

The classic paper explaining Paxos. Read these for additional clarification:

Paper Trail, Consensus Protocols: Paxos, February 3, 2009.
Butler Lampson, How to build highly available systems with consensus.
Paxos Made Live - An Engineering Perspective.

Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), May 20, 2014
The Raft Consensus Algorithm: web page with an interactive visualization.

Distributed commit

Eric A. Brewer, Presentation: Towards Robust Distributed Systems, PODC Keynote, July 19, 2000.

Consistency

Werner Vogels, Eventually Consistent - Revisited. Communications of the ACM, 52(1):40-44, 2009.

Doug Terry, Microsoft Research Silicon Valley Lab, Mountain View. CA. Replicated data consistency explained through baseball. Communications of the ACM Volume 56 Issue 12, December 2013, Pages 82-89.

Seth Gilbert, Nancy Lynch. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News. Volume 33 Issue 2, June 2002 Pages 51-59.

File Systems

Robert Morris, Case study of the Network File System (NFS), course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Draft Version of January 25, 2006.
Mike Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, Google Inc.,
OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. (pdf)

The definitive paper on Google's Chubby lock service.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System , 19th ACM Symposium on Operating Systems Principles, October, 2003.

Distributed Lookup

A Scalable Content-Addressable Network, Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications , Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
Amazon's Dynamo

All Things Distributed

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, Dynamo: Amazon’s Highly Available Key-value Store , SOSP’07, October 14–17, 2007, Stevenson, Washington, USA. Copyright 2007 ACM.

The definitive paper on Amazon Dynamo

Fault tolerance

Bressoud, Thomas, and Fred Schneider. Hypervisor-based Fault-tolerance. ACM Transactions on Computer Systems 14, no. 1 (February 1995): 80-107.

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. PNUTS: Yahoo!'s hosted data serving platform. Proc. VLDB Endowment, 1(2):1277-1288, 2008.

Systems and frameworks

Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35-40, 2010.

Jeffrey Dean and Sanjay Ghemawat, Google, Inc.. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth USENIX Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004.

See the slides for this paper here.
Also, read The Paper Trail article, The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google.

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford, Google, Inc.. Spanner: Google’s Globally-Distributed Database. OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. Pages 251-264.

Google's Spanner distributed database

Mike Burrows, Google, Inc. The Chubby lock service for loosely-coupled distributed systems. OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation. Pages 335-350.

Google's Chubby

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Google, Inc. Bigtable: A Distributed Storage System for Structured Data, OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7.

Google's Bigtable

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010 Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135-146.

Google's Pregel

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, University of California, Berkeley, NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012.

The paper introducing Resilient Distributed Datasets (RDDs), the key idea behind Spark.