Distributed Systems

Introduction

Paul Krzyzanowski

September 12, 2022, September 11, 2023

The way we use computers evolved over time from mostly running self-contained programs on a single computer (e.g., editing a document) to being dependent on services running on other computers. This ranges from editing documents on remotely hosted files to doing something like a Google search, which uses approximately 1,000 computers to provide answers in around 200 ms, or playing online games such as Fortnite, which may have over 3 million concurrent users interacting with servers in 24 data centers, or over 17 million players playing Minecraft concurrently on 54,000 servers.

We define a distributed system as a collection of independent computers connected through a communication network that work together to accomplish some goal.

By independent computers, we refer to multiple stand-alone computers. Each runs as a separate system, with self-contained hardware booting its own operating system and running its own set of processes. None of these computers are dependent on one another to function although we may run additional software on these computers to coordinate activities.

Being independent, these computers have no shared memory. Since processes in a distributed system work together, they need to exchange data and coordinate their activity. Without shared memory, it means they must use a communication network.

Independent computers also have no shared clock. This can make it challenging to know the precise order in which events occur. We cannot expect two computers to report the same time of day. We also cannot rely on processing times to be the same across the systems: they may have different clock speeds, processors, operating system schedulers, and system loads.

Why do we want distributed systems?

Computer networks and distributed systems existed in some form practically since the birth of computers. For example, SAGE was an air defense system deployed in 1957 that are to a set of 24 connected IBM AN/FSQ-7 computers. For fault tolerance, each site contained two computers, with one ready to take over if the other failed. The network between them used telephone lines. In 1960, Sabre, the first online airline reservation system, was deployed on two IBM 7090 computers. Over the next several years, these computers were connected to 2,000 terminals via phone lines.

The deployment of widely-used distributed systems took another several decades and began to accelerate greatly with widespread access to the Internet in the early 1990s.

Why are distributed systems more interesting now than they may have been a few decades years ago? Several advances in various areas of computing technology as well as the availability of computing and networking to the public had a profound effect on the value and design of distributed systems.

Since computer networking went mass market in the 1980s, local area network speeds increased by a factor of a thousand and wide-area (Internet) speeds by even more.

Connectivity within a local area network (LAN) moved from shared to switched networking, allowing the network to scale without increasing congestion. Internet access has become available to the population at large, not just to universities and companies working on Department of Defense projects in 1989, and the World Wide Web was launched in 1993.

Processor performance, system memory, and disk capacity also increased by more than a thousandfold over the past few decades.

Let’s explore a few reasons that make distributed systems appealing.

Scale

Scaling deals with being able to get better performance. It’s no surprise that computers are getting faster. This has been happening continuously since the 1940s when computers were first built.

In 1965, Gordon Moore, a co-founder of Intel, looked at the increasing number of transistors that could be built on an integrated circuit. He predicted that the number of transistors in an integrated circuit would approximately double every year for the next decade. He adjusted this in 1975 to forecast that the number would double every two years.

This observation has been used to predict that processor performance would also increase at a comparable rate.

Amazingly, Moore’s prediction, called Moore’s Law (not a real law; it’s just an observation he made), continued to be relevant (approximately) to the present time.

Moore’s Law getting stressed

Unlike Gordon Moore’s prediction, this growth in performance didn’t happen just because we can put more transistors on a chip. It has been getting increasingly difficult for chipmakers to keep up with the predictions of Moore’s Law. We’re reaching the limits of physics and it’s more difficult to make transistors smaller.

Intel created a process to stack components atop one another in vertical layers. Similarly, AMD uses 3D stacked memory on its processors. Chip die sizes have also been getting larger, allowing for more transistors to be placed on them.

With more transistors, processors have become increasingly more sophisticated with pipelined architectures and speculative execution (pre-executing future instructions and unrolling the effects in case conditional branches or other operations took a different path than expected). The expected growth in performance simply due to more transistors and higher operating frequencies began tapering off in the early 2000s.

We squeezed out more performance by adding more cores per chip, but this only benefits multithreaded programs. There are also limits on how much parallelism you can hope to extract from programs (see Amdahl’s Law, another fake law).

Putting more transistors on a chip also increases the cost and the power that the chip consumes. For example, Intel’s Xeon W-3175X has 28 cores per chip on 8 billion transistors but sucks 255 watts and costs almost $3,000. AMD’s EPYC 7601 processor has 32 cores per chip on 19.2 billion transistors. It consumes 180 watts and costs over $4,000.

In the 2010s, engineers extracted performance by turning to heterogeneous computing: building chips that, instead of containing a bunch of identical cores, also contain specialized processors. These include GPUs (graphics processing unit), Image processors, cryptography processors, and network processors. For example, Apple’s M2 Max processor contains 12 CPU cores, a 38-core GPU, 16-core Neural Engine, two video encode engines, two ProRes engines, and a Secure Enclave on-board coprocessor.

The quest to squeeze out more performance continues. In 2023, the suggested approach is system technology co-optimization (STCO). This optimizes the transistor and interconnect designs inividually for each functional component on a chip.

Scaling has limits

If we want more performance than we can get from a single processor we need to use multiple processors. With multiprocessor systems, have limits to how many you can put in a system. Moreover, the cost goes up tremendously as more processors are supported since more complex circuit boards and interconnects between memory and processors are required.

Vertical scaling refers to adding more power to an existing system, such as replacing it with a faster system, more multiprocessors, or adding more memory.

Horizontal scaling allows us to scale performance by adding more machines; that is, building distributed systems. With distributed systems, we have the potential to achieve incredible performance since we can put together a collection of hundreds of thousands of computers.

Computing needs exceed CPU advances

We assemble distributed systems because there are many applications where we need more computing power than we could ever get from a single computer. For example, to render the movie Toy Story back in 1995, animators used 117 computers running 24 hours a day. Each frame would take between 45 minutes and 30 hours to render. They had to render a total of 114,240 frames for the movie. Twenty-four years later, Toy Story 4 came out. The quality of rendering was much higher but it took even more time to render each frame of video – even with the faster computers that are twenty-four years newer. Each frame required between 60 and 160 hours of rendering time.

Pixar used 150,000 compute cores and 7.3PB of storage to render the 2023 movie Elemental vs. 24,000 cores in 2020’s Soul. The cluster of computers needed direcct access to 2PB of data vs. the 300–500TB of data that earlier movies required.

Google averages 63,000 searches per second and consults an index of over 130 trillion pages. They use hundreds of thousands of servers to make this happen. Every query travels, on average 1,500 miles to a data center and back to return the answer to the user, uses 1,000 computers, and takes two-tenths of a second to complete.

Collaboration

We rely on distributed computers to work and play together. Many of the applications we rely upon for social connectivity, commerce, and media consumption all rely on distributed computing.

Metcalfe’s “Law” states that the value of a telecommunications network is proportional to the square of the number of connected users of the system. As with Moore’s Law, this isn’t a real law, of course. The “square” part comes from the fact that the number of edges in a fully-connected graph is proportional to the square of the number of vertices. A vertex represents a person and an edge represents the communication path from one person to another. Simply put, there’s a lot of value (for some definition of value) in being able to communicate with a lot of people. Without it, services such as TikTok, Twitter, eBay, Instagram, Facebook, or the web in general would not be nearly as useful or valuable.

Reduced Latency

Computers in a distributed system may be far away. Distance results in communication delays – extra latency.

We can deploy distributed systems to reduce this latency by setting up computers to cache data closer to the users that need the data. A cache is a temporary copy of frequently-accessed data that is located close to where it is needed.

Processors use caches to store frequently-used regions of memory to avoid the overhead of accessing main memory. In distributed systems, we use caching so we can stream videos or download OS updates for our phones.

Caching differs from replication. Both are copies of data but replication is used for fault tolerance and is a complete copy of data that shouldn’t be deleted under normal operation. A cache is just a temporary copy for optimization.

Some popular services that use caching include Akamai, Cloudflare, Amazon’s Cloudfront, Apache Ignite, and even Dropbox.

Mobility

Laptops, phones, and tablets depend on distributed systems software. Current (2023) estimates are that there are around 6.8 billion users of smartphones.

There are all sorts of other mobile or remote networked devices as well. These include cars, traffic cameras, toll collection systems, and shipping containers. There are also vending machines, home lighting, home assistants, vacuums, and other home appliances. We often don’t think about these Internet-connected devices but, since 2017, there have been more of them than humans.

High availability & fault tolerance

One of the most important aspects of designing distributed systems is finding ways to keep services running. We want high availability – so the service is always working. And we want fault tolerance – so we can handle situations when things break.

Failure is a fact of life in distributed systems. It’s simply a matter of statistics: if you have a collection of thousands of systems, it is very likely that on any given day something goes wrong: a computer or disk dies, a switch goes bad, a network cable is unplugged, or something loses power.

We will explore this in greater depth a bit later.

Incremental growth & cost

Distributed systems can make it economical to deploy services. You don’t need to start with the most powerful computers you can buy and you don’t need to start with a data center full of servers.

In 1996, Google ran on a handful of servers with 10 4GB drives – a total of 40GB. The simplest phones have more memory than that! Facebook started on one rented server that cost them $85/month. eBay’s deployment started off as Perl code on a hosted FreeBSD server.

With the right design, one can add computers for more computing power and more storage incrementally. The same software you implement on a $35 Raspberry Pi can be expanded to run on a massive network of systems spanning the globe.

Delegated operations

Another benefit of distributed systems is that they let us have someone else manage systems and software for us.

In the 20^th century, it was common practice for companies to use their own computers and storage servers. They would have their computers on their premises and be responsible for doing their own maintenance and backups. With some exceptions, most companies now can simply use virtual machines from Amazon, Microsoft, Google, or other companies.

The responsibility of keeping those computers updated and running properly becomes someone else’s problem. You can be up and running within minutes because you don’t have to order or set up any hardware.

You can deploy as many systems as you want and isolate services on individual computers if you’d like. You have access to unlimited storage. If you run low, you can seamlessly get more.

It’s super convenient. For businesses, it sometimes helps with accounting since companies only incur operational expenses, avoiding the hassle of large up-front capital expenses and depreciation.

Taxonomy of computer architectures

One way of classifying system architectures is via Flynn’s taxonomy, proposed by Michael J. Flynn way back in 1966. He categorized computers based on the number of concurrent instruction streams and the number of data streams.

SISD: Single instruction stream, single data stream. Refers to conventional single-processor systems.
SIMD: Single instruction stream, multiple data streams. Refers to single processor computers where each instruction may process a collection of data. Vector and array processors fall into this category. SIMD includes graphics processors, cell processors, Intel’s Streaming SIMD Extensions (SSE4) in Intel’s Core microarchitecture and AMD’s K10 family of processors, Intel and AMD processors supporrt Advanced Vector Extensions (AVX) and Streaming SIMD Extensions (SSE2), both of which add support for vector processing. ARM^® architectures support the NEON™ SIMD engine.
MISD: Multiple instruction streams, single data stream. This category does not really make a lot of sense since it implies that multiple processors all process the same data. The term has occasionally been used to refer to used to replicated fault-tolerant systems.
MIMD: Multiple instruction streams, multiple data streams. Refers to any computers with multiple processors, where each processor operates on its own stream of data. This category covers both parallel (multiprocessor) and distributed systems.

MIMD can be further categorized by identifying whether the system has shared memory or not. Systems with shared memory are known as multiprocessor systems. Examples are conventional PCs with multiple processors on a single system bus or multi-core systems.

An architecture where multiple processors communicate with shared memory is called a multiprocessor system.

Multiprocessor systems are characterized by three features: (1) the processors all share the same memory, (2) they all share the same clock, and (3) they exhibit an all-or-nothing property to system failure. What this last item means is that if the system is dead, none of the processes are running. With a multicomputer system, it is certainly possible to have one computer functioning while another is not. This is called partial failure.

The most common architecture for a multiprocessor system is symmetric multiprocessing, or SMP. In this kind of system, all processors are connected to the same shared memory and run the same operating system. No one processor has faster or prioritized access to memory or system peripherals than any other.

Systems without shared memory are collections of separate computers, each with their own memory. They must rely on a network to communicate and are sometimes referred to as networked computers or multicomputers.

Interconnect

The communication between processors may be a bus or a switch – either for access to memory access or networks among computers.

A bus is a shared communication line. It has the advantage that any system can look at what write operations are being sent by other processors and update any cached copies it might have. This examination of the bus is called snooping.

The downside of a bus is that everyone shares it so it can become a point of congestion. Most modern architectures use switched connections. These provide a dedicated link between a processor and a region of memory so the chances of two processors competing for the same connection are greatly reduced. Switches give us scalable bandwidth. In communication networks, switched connections allow a pair of computers to communicate without affecting the bandwidth or latency of messages flowing between other computers.

Delay and bandwidth

We can consider also consider delay when looking at the characteristics of MIMD architectures – the time it takes to send messages or read data – and bandwidth, which is the amount of data a processor can read or write per second.

There are no firm boundaries here, so we usually rely on intuition to distinguish tightly coupled from loosely coupled systems.

A box with 64 processors connected to switched backplane is clearly a tightly-coupled system. A collection of computers scattered around the world is clearly a loosely-coupled system. A set of PCs in a data center connected via a 10 gigabit per second local area network falls in between.

Distributed Systems Software

Let us consider distributed systems from the point of view of software. Distributed systems are a collection of services that are accessed via network interfaces. Processes send messages to other processes over the network. These messages result in some action being performed and, usually, some results (in the form of a message) being returned to the sender.

Being independent, the computers in a distributed system have no shared operating system; each runs its own. In some cases, however, an operating system or supporting libraries may provide services to start or coordinate processes on different computers, allow processes on different computers to communicate easily, or even migrate processes from one computer to another. Some processes may also require the availability of certain remote resources, such as file servers. However, the case where some processes expect services from remote computers does not take away from the fact that the computers are independent and run autonomously.

Transparency as a design goal

One design goal in distributed systems software is to create a single system image. A single system image is the set of software abstractions that make a collection of independent computers appear as one system to users.

By having software make them appear as a single system, the user is not aware of the distribution. For example, you might run a program and not know what actual computer ran the program. Or you might run a program that, in turn, interacts with multiple other programs via messages in order to do its work. Creating a single system image involves hiding the fact that the system or software is distributed. The software should “just work.”

A single system image creates various forms of transparency. Transparency is about hiding distribution and complexity.

At a high level, it is about hiding distribution from users of the system, so they don’t have to think about multiple systems. You do a google search at google.com without thinking about which servers you connect to.

At a low level, transparency is about hiding distribution from software, so that programmers don’t have to think about distribution. An underlying framework can move things to the right places and collect results.

A few areas where we want to provide transparency include:

Location: The user should not be aware of where the software is actually running or where resources reside.
Migration: The user should not be aware of the fact that the location of resources may have moved from one place to another or that a process was restarted on another computer, possibly in another data center.
Replication: The user should not be aware that data might be replicated for fault tolerance or for proximity in order to provide faster access.
Concurrency: The user should not be aware that multiple processes might be accessing resources at approximately the same time. Results should appear as if all the processes ran one after another in some order. This means that some processes might be temporarily locked from being able to access a set of resources while another process is using them.
Parallelism: Different operations can take place in parallel without users thinking about them. A single program, for example, could dispatch tasks onto multiple computers that will run in parallel.
Failure: A user will not be aware that some components may have failed. Things just work. The software can retransmit undelivered network messages or send requests to a backup computer if a server isn’t working or cannot be reached.

Why are distributed systems different and challenging?

There are three main difficulties in distributed systems design: concurrency, latency, and partial failure.

Concurrency

Concurrency relates to multiple things occurring at the same time. Because programs process multiple requests at more or less the same time, keeping all data consistent can be a problem.

You do not want a situation where one session is modifying some data that another session is still using. To deal with concurrency, we will need to understand critical sections and mutual exclusion. We may also need to handle different versions of data and understand event ordering. Mutual exclusion, which is simply locking, can hurt performance so we want to find ways to minimize using it but ensure that our results are valid.

Replication makes concurrency even more challenging. Any modifications to data now must be propagated to all replicas in exactly the same order. And we need to consider the case where some process might read an old version of data from a replica.

Latency

Latency is the delay caused by communications. Depending on the network and the distance between computers, some messages can take a long time to arrive. It’s great if we can know when we can expect to receive a message.

Networks fall into three categories. In a synchronous network model, there is some upper time bound between when a node sends a message and another node receives it. If we know the value of this maximum delay, then a program can distinguish between a node that has failed and a node that is taking a long time to respond. If a response doesn’t come within that upper time bound, then the node is dead.

In a partially synchronous network model, there’s still an upper bound for message communication but the programmer doesn’t know it – it must be discovered. Once discovered, programs that communicate will work correctly only if all messages are received within that time limit.

Finally, we have the asynchronous network model. Here, messages can take arbitrarily long to reach another node. We can make no guarantees on the worst-case time for message delivery. Unfortunately, the asynchronous model is what we get from the Internet and this is the model we will assume exists when we develop algorithms that rely on communications.

Because we cannot make assumptions on when, or even if, we will get a response, we cannot tell if a message is lost or delayed. This can cause a system to create duplicate messages when it thinks that a message has been lost when it is only delayed. It can also cause a system to assume that another system is dead or disconnected from the network. Because messages might arrive out of order, dealing with concepts of time and sequencing is more challenging.

Caching and latency

If we turn away from thinking about the reliability of messages and just consider the delay, we can’t do much about that. We are often limited by the speed that electrons or photons move in their medium and the distance they must travel. In a communication network such as the Internet, we have the overhead of moving data through multiple switches and routers.

The easiest way to reduce latency and make it quicker to get data from a network service is by caching the data. As we saw earlier, caching is a technique of making a temporary copy of frequently used data. For example, operating systems use a buffer cache to keep frequently accessed disk blocks in memory. When we look at remote file systems, we’ll see that some designs make a temporary copy of data onto the local disk.

The challenge with caching is making sure that the data is not out of date. If the main copy of the data is updated but the cached copy is not, then the cached data is said to be stale. Cache coherence is the technique of keeping the data correctly updated at all cached copies.

Partial failure

Our third area of design challenges is the phenomenon of partial failure.

Failure is one of the main recurring themes in distributed systems design. We cannot ignore it and pretend it doesn’t exist. In non-distributed systems, if a component fails, we usually experience total failure. Our program – or our entire computer – stops working because something broke.

A single computer, such as a multiprocessor system, exhibits an all-or-nothing failure: if something fails then the entire system is dead. In distributed systems, one system may fail while others continue to function.

In distributed systems, something outside of our computer might have failed while everything else continues to work. In fact, we might not be able to tell whether a remote system failed or a network link is broken. We simply don’t get a response and don’t know what went wrong. This is partial failure.

A distributed environment with asynchronous networks makes it impossible to know for sure if a system is dead. There is no central place that can be examined to tell us definitively whether something is working or not. Update messages cannot be guaranteed to arrive in a timely manner and network connections might be severed.

We need to live with this and work around it. If failures do occur, we need to design our software so that the state of the entire system can recover after the failure – or the software can function even with the failure.

Fault tolerance

Handling failure involves creating mechanisms to detect that something failed – or probably failed. It involves creating ways to recover from the failure and ways to restart services either on other systems or the original system if it’s still functional.

Two aspects of dealing with failure are availability and reliability.

Availability is a measure of the fraction of time that a system is usable. For example, a system that is 99.9% available may have a total downtime of 8 hours and 46 minutes per year.

We address availability by adding redundancy but that’s not always simple because we need any redundant copies to be consistent.

Reliability can be confused with availability but is distinct from it. It is a measure of how long a system may function before failing. A common measure is MTBF (mean time betfore failure), although this isn’t statistically useful for the way many electronic devices, such as disk drives and flash memory, fail. A system may be unreliable but highly available. For instance, a computer may crash frequently but recover quickly.

Reliability also covers other topics, such as data not getting lost. This loss can be accidental, such as data corruption because of noise on the network, or malicious, such as an intruder changing the data. Thus, security mechanisms can help with reliability.

Handling failure is one of the central themes of distributed system design. We need to be able to handle detection, recovery, and restart.

Detection: Identify the cause of the failure
Recovery: Services - the distributed algorithms - need to work around the failure and continue to function properly. This might involve starting a service on a different system, electing a new coordinator, or stopping any attempts at communicating with the failed system.
Restart: At some point, the failed element may be brought back into the distributed system. It may be moments later, when a network cable is reinserted, or a far longer time if, for example, a failed processor needs to be replaced. Regardless of the elapsed time. the restarted system needs to reintegrate itself into the whole system. It may have missed messages and its knowledge of the world is outdated. For example, it may have hosted a replicated object store and missed getting information about new, updated, and deleted objects.

Fault tolerance needs to address both availability and reliability. Availability refers to the fraction of time that the system as a whole is usable. Since individual systems may fail, we achieve high availability via redundancy: deploying duplicate, triplicate (and more) systems. The design of redundant systems has to consider all of the three aforementioned points. Failure to properly address the restart of a system may create consistency problems, where one replicated server returns different data than another one.

Redundancy assumes that the overall system is designed to tolerate the failure of some components. For instance, we have two systems, each with a downtime probability of 5% and need only one to be functioning, the probability that both systems will be down at the same time is P(A and B) = P(A) × P(B), or 5% × 5% = 0.25% (0.05 × 0.05 = 0.0025). Uptime is simply 100%-downtime. By adding a redundant component, we increased the uptime from 95% to 100%-0.25%=99.75%.

This is a simple example of how redundancy can give us higher availability. If you cannot contact one computer, ask another one and hope that both aren’t down at the same time.

The converse to redundancy is when we design a system that requires all components to be Functioning. With the same 5% downtime as in the previous example, the probability that both systems are down is 100% - P(system A is up AND system B is up), which is 1-(1–5%)×(1–5%), or 1 - 0.95 × 0.95 = 9.75%. Uptime is 1-downtime, so in this case we have an uptime of 90.25% versus 90% for a single system. As we depend on more and more systems, the probability of any system being down approaches 100%.

These examples illustrate series systems* versus parallel systems. A series system fails if any of its components fail while a parallel system fails only if all of its components fail. We try to avoid designing series systems.

Reliability deals with the integrity of data. We can have systems that appear to be functioning well but transmit garbled data. Or we might have malicious interference where an intruder is sending messages to confuse the systems. We will can address message integrity with error detection but will also need to address issues of message authentication.

Failure can manifest itself in different ways:

Fail-stop: With fail-stop failure, the failed component simply stops functioning. Ideally, it will be able to detect its own failure and notify other members of the system first but this is most often not possible. Halting refers to the explicit case where a component stops without any notice. We can try to detect failed components by sending messages over a network and setting timeouts for a response. Unfortunately, this is not foolproof because network latency is variable and the response might arrive after our timeout. Moreover, we can have problems with network connectivity between some hosts.
Fail-restart: Fail-restart is when a component restarts after a failure. The restart may be nearly instantaneous, so other systems didn’t notice, or it may be after a long interval. As we discussed earlier, the danger is stale state. The restarted component may have missed messages and hence has a view of the world that is obsolete.
Omission: Omission failure deals with networking. It is the failure to send or receive messages. This can be due to data corruption, queue overflows in routers, or overflows in the receive buffer in the operating system. An omission failure may cause a query or its response to get dropped, resulting in one system assuming that another one has failed.
Timing: With asynchronous networks such as IP, messages may take longer to arrive than we might expect. This can lead us to assume that a system is not responding and hence not functioning when it acturally is operating. Another problem that is based on timing is that each system has its own clock and hence its own concept of time of day. This can create undesirable behavior with process coordination, message ordering, and system logs.
Partition: A network of computers may be working but a link between two groups of systems may fail. For example, an Ethernet switch connecting two racks may fail or a cable may be disconnected. In this case, the network effectively fragments into two or more sub-networks that cannot communicate with each other. Each group of systems thinks the other group is dead.
Byzantine: Byzantine failures cover any failures where a component does not cease to function but instead produces faulty data. This can be due to bad hardware, software logic errors, network problems or it can be due to malicious interference. To a large extent, we will address byzantine failures on the network with the use of cryptography.

Regardless of the type of failure, a basic goal in distributed systems is to design a system that avoids a single point of failure. This is the case where one component is crucial to the functioning of the entire system. For example, we might have one process on one system that serves as a coordinator to dispatch and check on computation taking place on thousands of other systems (or keeps track where various blocks of data live in a distributed storage system). A failure of the coordinator effectively causes the entire system to fail.

Global state

In a distributed environment, it helps for one process to know what other systems are doing. For instance, a process may need to know the currently active members of a group of processes that hold replicated data. A problem with distributed systems design is that nobody has the true global state of a system. Because we lack shared memory, we cannot instantaneously see the data or liveness of other processes. Any data that changes over the execution of a program is referred to as state. For example, state may be lists of live processes, group members, contents of a database, computation progress, lists of processes that have remote files open, etc.

A process obviously knows its own state. It can periodically report its state to other processes via network messages and it may receive updates from other processes over the network. However, these updates are neither continuous nor instantaneous. A process will only know the last reported state of other processes, which may not be equivalent to the current state of those processes.

One type of state is not just information about group membership or the status of processes but the data stored by network file systems, databases, and object stores. This data is shared among systems that are designed to act as replicas – redundant systems that will give us instantaneous backups in case one system dies or allow us to load balance requests among multiple systems. Here we also have to deal with the fact that not all processes will be updated instantaneously.

A restricted form of replication is a cache. A cache is simply local storage of frequency-accessed data to reduce access latency. Instead of making a network request, a process can have a stored copy of the results. For example, a process may store the result set of common database queries. Caching can do a wonderful job in improving performance but also poses the risk of stale data – cached copies of data that are no longer valid since the original data has been modified.

Key approaches to distributed systems design

We will examine different types of systems. For many types of problems, we will see certain design elements come up repeatedly.

One technique that comes up repeatedly for handling large data sets is divide & conquer. This relies on breaking huge data sets into smaller chunks, called shards. Each system can then work on a shard. After that, the results can be merged together. Merging is usually faster than the processing of the data, especially if a lot of the data gets discarded, such as when doing a search.

High availability is always important. For that, we will turn to replication. For reducing latency, we’ll turn to caching, which is a form of replication. In both cases, we have to ensure any replicated data is consistent with the original – even if systems fail and restart at a later time.

Finally, the concepts of consensus and quorum will allow us to get a group of systems to agree on something, whether it’s the order in which to process requests, whether a certain process should be in charge of coordination, whether a certain process should get exclusive access to some resource, or whether a set of transactions can be made permanent.

Service models

In software design, we often turn to layered architectures, where we break up application functionality into multiple layers of abstraction. each layer presents well-defined interfaces and hides the specifics of its implementation. For example, a typical computer system has an operating system that provides well-defined access to system resources, middleware that is linked to the application as a set of libraries that abstract things such as message encoding, communication, encryption, and database access, and various layers of abstraction created by the application designer.

With network systems, we often experience similar layers of abstraction but this time across systems. When our network-based software architecture mimics a layered design, we use autonomous processes that communicate with each other via a network interface rather than procedure calls. Each such layer of abstraction is known as a tier in a multi-tier model. It is a generalization of a client-server model.

Centralized: The original, non-networking, computing model is a centralized one, where all computing takes place on a single system.
Client-server: This is the dominant model of interaction in a networked system. One application, called the client (and usually run by the end user), requests something from another application, called a server. The server provides a service. Examples of this are a web browser (client) requesting a web page from a web server, aa mail application (client) accessing a mail server to get mailbox contents, or a print server being given content to print. In this model, clients communicate with the server and not with other clients. The model can be enhanced with multiple “layers,” or services, to mimic a layered architecture, resulting in a build a multi-tier system.
Microservices: In a microservices architecture, a service is implemented as a collection of individual components. Each component is a microservice – an autonomous piece of software that fulfills a specific function and is accessed as a service (effectively a client-server model) by other components.
Peer-to-peer: A peer-to-peer architecture employs a collection of applications, any of which can talk to any other. These applications are peers and are generally run by a collection of end users rather than some service provider. The name peer implies that there is no leader: applications all have equal capabilities. An appealing aspect of a peer-to-peer design is self-scalability. As more and more computers join the collection of peers, the system has more peers to do the work and can hence handle a large workload. Examples of peer-to-peer architectures are BitTorrent and Skype.
Hybrid: A difficulty with peer-to-peer architectures is that one often needs to do things such as keep track of peers, identify which system can take on work or has specific content, and handle user lookup and authentication. This led to a variation of the peer-to-peer model where a coordinator, a central server, is in place to deal with these centralized needs. However, the peers still handle all the bandwidth-intensive or compute-intensive work.
Processor pool: A processor pool model is one where we have a pool of computers offering processing capabilities and we can dispatch processes to them. A coordinator tracks who is running what and when processes terminate. This is a common environment for render farms in movie production but is also common for large workloads in big data analytics and machine learning.
Cloud computing: The term cloud computing is much more of a business and marketing term than a technical one. Engineers often drew the network as a cloud and marketing people ran with that visual. Cloud computing basically refers to accessing services that run on computers you don’t own. There are different commonly-used forms of this.; Software as a Service, SaaS, refers to remotely hosted software. Examples are the G suite of Google Apps, Microsoft 365 apps, and salesforce.com. In this environment, you’re directly running specific programs on some remote server, usually through a web browser.; With Platform as a Service (PasS), instead of an application, you are now given some application platform. For example, you might get a database, a web server, or a software development environment. You can configure these platforms however you want but you don’t have access to the underlying machine.; With Infrastructure as a Service (IaaS), you can assemble your own data center. You can get one or more virtual machines and storage servers. You can define connections to a local area network and even configure load balancers and firewalls. Microsoft Azure, the Google, and Amazon Web Services (AWS) IaaS provide forms of these. For example, Google’s IaaS is a collection of their Compute Engine, Cloud Storage, Virtual Private Cloud (managed networking), and Persistent Disk (a service that looks like a disk drive to a virtual machine).; Finally, there are dedicated cloud services for remote storage. Dropbox, Box, Google Drive, and Microsoft’s OneDrive are common examples of this. There are other forms of this. Amazon S3 Glacier is designed for inexpensive long-term archival storage while S3, Amazon’s simple storage service, provides object storage through a web interface. Databases are also available as cloud services.

References (partial)

Andrew S. Tanenbaum, Maarten Van Steen, Distributed Systems: Principles and Paradigms (2nd Edition). © 2006 Prentice Hall
B. Clifford Neuman. Scale in Distributed Systems. In Readings in Distributed Computing Systems. IEEE Computer Society Press
George Coulouris, Jean Dollimore, Tim Kindberg, Gordon Blair, Distributed Systems: Concepts and Design (5th edition). © 2011 Addison Wesley
Intel Streaming SIMD Extensions Technology, Intel, September 10, 2018.