Exam 1 study guide

The one-hour study guide for exam 1

Paul Krzyzanowski

Latest update: Mon Dec 12 13:09:44 EST 2016

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally. This document is approximately 20 pages if you print it out.

Introduction

Go here for lecture notes

Why are distributed systems more interesting now than they may have been one or two dozen years ago? A number of advances in computing technology had a profound effect on distributed systems. Since computer networking went mass market in the 1980s, local area network speeds increased by a factor of a thousand and wide-area (Internet) speeds by even more. Connectivity within a local area network (LAN) moved from shared to switched networking, allowing the network to scale without increasing congestion. On the wide area, Internet access has become available to the population at large, not just to researchers on Department of Defense projects. In 1965, Intel co-founder Gordon Moore predicted that the number of transistors on integrated circuits, and hence processor performance, would double approximately every two years. This prediction, known as Moore’s Law has turned out to hold true. Processor performance, system memory, and disk capacity has also increased by more than a thousandfold over the past couple of decades.

Even though these improvements make it easier to store, process, and move data between systems, what are the motivations for distributing computing? There are several. Performance does not scale linearly with an increase in price with a single computer; with a collection of computers this scaling may be possible. Secondly, distributing systems makes sense in certain environments: databases may reside in different locations than the user, for example. Thirdly, we may need to interact with other people or remote data that are geographically dispersed. Metcalfe’s “Law” states that the value of a telecommunications network is proportional to the square of the number of connected users of the system. As with Moore’s Law, this isn’t a real law, of course. The “square” part comes from the fact that the number of edges in a fully-connected graph is proportional to the square of the number of vertices. A vertex represents a person and an edge represents the communication path from one person to another. Simply put, there’s a lot of value in being able to communicate with a lot of people. Without it, services such as Skype, Google, Twitter, eBay, Instagram, Facebook, and countless others would not be nearly as useful.

Taxonomy

One way of classifying system architectures is via Flynn’s taxonomy, proposed by Michael J. Flynn in 1972. He categorized computers based on the number of concurrent instruction streams and the number of data streams. SISD (single instruction stream, single data stream) refers to conventional single processor systems. SIMD (single instruction stream, multiple data streams) refers to single processor computers where each instruction may process a collection of data. Vector and array processors fall into this category. SIMD includes graphics processors, cell processors, Intel’s Streaming SIMD Extensions (SSE4) in their x86 family of processors, Intel’s Advanced Vector Extensions, (AVX), the PowerPC’s AltiVec instructions, and the ARM® NEONTM SIMD engine. Finally, MIMD (multiple instruction streams, multiple data streams) refer to any computers with multiple processors, where each processor operates on its own stream of data.

MIMD can be further categorized by identifying whether the system has a shared memory space or not. Systems with shared memory are known as multiprocessor systems. Examples are conventional PCs with multiple processors on a single system bus or multi-core systems.

An architecture where multiple identical processors communicate with a single shared memory is called a multiprocessor. Systems without shared memory are collections of separate computers, each with its own memory. They have to rely on a network to communicate and are sometimes referred to as networked computers or multicomputers.

Multiprocessor systems are characterized by three features: (1) the processors all share the same memory, (2) they all share the same clock, and (3) there is an all-or-nothing property to system failure. What this last item means is that if the system is dead, none of the processors are working. With a multicomputer system, it is certainly possible to have one computer functioning while another is not.

The most common architecture for a multiprocessor system is symmetric multiprocessing, or SMP. In this kind of system, all processors are connected to the same shared memory and run the same operating system. No one processor has faster or prioritized access to memory or system peripherals than any other.

Cache coherence

On a bus-based multiprocessor system, all the processors, system memory, and peripherals are connected to a shared system bus. Only one device can access the bus at any time. Since processors need to access memory continuously, there is constant contention for the system bus. Cache memory alleviates this, as it is a small amount of memory that is local to each processor with the expectation is that most programs exhibit a certain amount of locality, allowing most memory references to be satisfied from the cache.

The problem with using cached data is that if a processor modifies a memory location, it will only modify the local copy in the cache. Other processors will read the previous value. This is unacceptable but can be remedied by having write operations pass through to the system bus so that main memory can be updated. This is known as write-through. There is a performance penalty for doing this because we need to use the bus for memory write operations, but read operations usually far outnumber write operations. The second problem is that, even if a processor updates the data in main memory, another processor may still access its own cached data, which is now obsolete. This can be remedied by having each processor’s cache controller monitor write operations on the system bus and detect whether others are modifying any cached addresses. This operation is known as snooping. Together, write-through and snooping comprise a snoopy cache. Virtually all bus-based multiprocessor systems employ a snoopy cache to ensure cache memory coherence.

Improving multiprocessor scalability

Because of its shared system bus, bus-based multiprocessor systems face increasing bus congestion as the number of CPUs in the system increases. Beyond eight or so processors, the effects of this congestion can become quite severe, rendering the architecture impractical for highly parallel systems.

From a performance point of view, a crossbar switch is an ideal solution. It allows the switching of any processor to any chunk of memory without interfering with the traffic from any other processor accessing any other chunk of memory. With a sufficiently large crossbar switch, memory can be divided into a lot of chunks and delays will only occur when two processors need to access the same region of memory at the same time.

The problem with a crossbar switch is that it is quite expensive, particularly in the sizes needed to support a large number of processors and a fine-grained partitioning of memory. An attempt to alleviate the cost of a crossbar switch is to decrease the aggregate number of switching elements (crosspoint switches) by increasing the number of switching stages. This is known as an omega network. Unfortunately, this slows down all memory accesses, as each request to memory has to pass through a number of crosspoint switches.

A compromise solution is a non-uniform memory architecture, or NUMA. This architecture associates part of the global system memory with each processor or a group of processors (node), typically on the same board. Every processor (or group/node) is able to access a portion of the system memory at high speed through a local bus. The remaining memory is accessible through a shared (but slower) backplane. The idea is that a program may be loaded in such a way that most memory references are from local memory (and fast).

NUMA is supported in many of today’s operating systems and processors. For example, in AMD’s HyperTransport technology (HTT), used in 64-bit Opteron CPUs, each CPU has its own bank of DDR memory and communicates with inter-processor memory over the HTT link. Intel announced support for NUMA in 2007 with their Nehalem and Tukwila processors. These processors can be connected to shared memory via a Quick Path Interconnect (QPI). The Intel Xeon supports an Integrated Memory Controller that provides a fast channel to local memory.

NUMA places more burden on software in order to achieve good performance. Since accessing memory on another processor takes more time than accessing local memory, the operating system’s memory manager needs to be aware of how memory is associated with processors and try to assign local memory. Moreover, when a process is rescheduled to run, the operating system should try to reschedule them to run on the same processor they used in the past so that most memory references can remain local. This is called processor affinity.

In older operating systems, a traditional scheduler used a single run queue to schedule processes onto available processes. A multi-queue scheduler with a separate run queue per processor was developed in the Linux 2.5 kernel to support the NUMA model (2002). Microsoft added NUMA support in their Windows Server 2003 and added further optimizations in Windows Server 2008.

Cache coherence in switched multiprocessor systems

Cache coherence is another issue that needs to be revisited with a NUMA architecture (or any switched architecture, for that matter). A snoopy cache will not work since the cache controllers on other processors do not see local memory traffic between other processors and their memory. The Stanford DASH project is an implementation of a cache coherent NUMA architecture (CC-NUMA), although the design works for any switched architecture.

Each processor is assigned a region of memory. This processor is known as the home node for that block of memory. Each processing node also maintains a directory: a table that identifies which remote nodes are using chunks (“lines”) of its local memory. It also identifies whether one of these remote nodes has the most recently modified (dirty) version.

When a processor gets a cache miss and needs to read main memory, it will either read memory from its home node or from a remote node. If the processor modifies a cached copy of remote memory, it sends a message to the home node informing it that it has the latest version. The home node will then send an invalidate message to each remote cache controller listed in its directory to invalidate its copy of remote memory. Similarly, when the processor modifies its local memory, it sends an invalidate message to each remote cache controller listed in its directory to invalidate its copy of remote memory.

Intel uses a similar approach in their processors. A CPU that needs to read memory sends a request over a inter-CPU switched network (Quick Path Interconnect) to the home agent, who is responsible for that region of memory. The home agent maintains a table called a directory that identifies the last processor to request (and hence cache) data at that location. The home agent forwards the request to the processor that has used that location previously and hence has the latest version of the memory’s contents in its cache (this processor is called the caching agent. The caching agent sends the results back to the requesting processor and a status update to the home agent. This technique is called home snoop.

An alternative technique, called source snoop, reduces the number of network hops needed to get the memory contents from three to two. Instead of just contacing the home agent, the requesting processor sends the request message to all processors. The one that has a cached version of the contents responds back with the contents.

Multicomputers

When we switch focus away from multiprocessors to multicomputers, the fundamental difference is that the processing elements no longer have access to the same shared memory system. Since computers need to communicate, an alternate communication mechanism is needed. This mechanism is the network interconnect. As with multiprocessors, the network interconnect may be bus-based or switched. A bus-based interconnect means that all systems are connected to the same communications bus and can see all the traffic on the network. The original design of the Ethernet was bus-based. Today, it is generally not seen on local area networks as switches are cost-effective. A switched interconnect allows any pair of computers to communicate without affecting the bandwidth of other systems. However, switches generally simulate the features of a bus-based network to allow things like network broadcasts and multicasts to work.

Software

One goal in some distributed systems software is providing a single-system image. This is software that makes a collection of independent computers appear as a single system to the users. This involves hiding the fact that system or computation is distributed. The software should “just work.” A few areas when we want to provide transparency include:

Location
The user should not be aware of where the software is actually running or where resources reside.
Migration
The user should not be aware of the fact that the location of resources may have moved from one place to another or that a process was restarted on another computer, possibly in another data center.
Replication
The user should not be aware that data might be replicated for fault tolerance or for proximity in order to provide faster access.
Concurrency
The user should not be aware that multiple processes might be accessing resources at approximately the same time. Results should appear as if all the processes ran one after another in some order. This means that some processes might be temporarily locked from being able to access a set of resources while another process is using them.

Service models

In software design, we often turn to layered architectures, where we break up application functionality into multiple layers of abstraction. each layer presents well-defined interfaces and hides the specifics of its implementation. For example, a typical computer system has an operating system that provides well-defined access to system resources, middleware that is linked to the application as a set of libraries that abstract things such as message encoding, communication, encryption, and database access, and various layers of abstraction created by the application designer.

With network systems, we often experience a similar layer of abstraction. When our network-based software architecture mimics a layered design, we we use autonomous processes that communicate with each other via a network interface. Each such layer of abstraction is known as a tier in a multi-tier model.

Centralized
The original, non-networking, computing model is a centralized one, where all computing takes place on a single system.
Client-server
This is the dominant model of interaction in a networked system. One application, called the client (and usually run by the end user), requests something from another application, called a server. The server provides a service. Examples of this are a web browser (client) requesting a web page from a web server, aa mail application (client) accessing a mail server to get mailbox contents, or a print server being given content to print. In this model, clients communicate with the server and not with other clients. The model can be enhanced through a layered architecture to build a multi-tier system.
Peer-to-peer
A peer-to-peer architecture employs a collection of applications, any of which can talk to any other.These applications are peers and are generally run by a collection of end users rather than some service provider. The name peer implies that there is no leader: applications all have equal capabilities. An appealing aspect of a peer to peer design is self-scalability. As more and more computers join the collection of peers, the system has more peers to do the work and can hence handle a large workload. Examples of peer-to-peer architectures are BitTorrent and Skype.
Hybrid
A difficulty with peer-to-peer architectures is that one often needs to do things such as keep track of peers, identify which system can take on work or has specific content, and handle user lookup and authentication. This led to a variation of the peer-to-peer model where a coordinator, a central server, is in place to deal with these centralized needs. However, the peers still handle all the bandwidth-intensive or compute-intensive work.

Networking

Go here for lecture notes

Goal: Enable computers to communicate with each other; create the illusion of machine-to-machine and process-to-process communication channels.

We need a way for collections of systems (computers or other endpoint devices) to communicate. To do this, they use a communication network. If every communicating pair of hosts has a dedicated physical connection, then there is no sharing and we have a true physical circuit. This is not practical since it limits the ability for arbitrary computers to talk with ech other concurrently. It’s also incredibly wasteful in terms of resource use: the circuit would be set up even if no data is flowing.

What is needed is a way to share the network infrastructure among all connected systems. The challenge in doing so is to avoid collisions, the case when two nodes on a network transmit at the same time, on the same channel, and on the same connection. Both signals get damaged and data does is not transmitted. This is the multiple access problem: how do you coordinate multiple transmitters on a network to ensure that they can send their messages?

There are three broad approaches that enable us to do this:

  1. Channel partitioning divides the communication channel into “slots”. If the network is divided into short, fixed-length time slots, we have Time Division Multiplexing, or TDM. Each host must communicate only during its assigned time slot. Routers may connect multiple networks together. When two hosts need to communicate, they establish an end-to-end route called a virtual circuit. The setup of this route configures routers and assigns communication slots. This provides the illusion of a true circuit switched network in that all messages are guaranteed to arrive in order with constant latency and a guaranteed bandwidth. The switched telephone network is an example of virtual circuit switching, providing a maximum delay of 150 milliseconds and digitizing voice to a constant 64 kbps data rate.

    If the network is partitioned into frequency bands, or channels, then we have Frequency Division Multiplexing, or FDM. This defines a broadband network. Cable TV is an example of a broadband network, transmitting many channels simultaneously, each in using a well-defined frequency band.

    The problem with a channel partitioning approach is that it is wasteful. Network bandwidth is allocated even if there is nothing to transmit.

  2. Taking turns ensures that we create some means of granting permission for a system to transmit. A polling protocol uses a master node to poll other nodes in sequence, offering each a chance to transmit their data. A token passing protocol circulates a special message, called a token, from machine to machine. When a node has a token, it is allowed to transmit and must then pass the token to its neighbor.

    The problem with the taking turns approach is that a dead master or lost token can bring the network to a halt. Handling failure cases is complex.

  3. A random access protocol does not use scheduled time slots and allows nodes to transmit at arbitrary times in variable size time slots. This is known as packet switching. Network access is achieved via statistical multiplexing. A data stream is segmented into multiple variable-size packets. Since these packets will be intermixed with others, each packet must be identified and addressed. These networks generally cannot provide guaranteed bandwidth or constant latency. Ethernet is an example of a packet-switched network.

Packet switching is the dominant means of data communication today. The packets in a packet-switched network are called datagrams, and are characterized by unreliable delivery with no guarantees on arrival time or arrival order. Each datagram is fully self-contained with no reliance on previous or future datagrams. This form of communication is also known as connectionless service. There is no concept of a connection and routers do not have to maintain any state as they have to do with a virtual circuit. However, each packet must be addressed with its destination.

Data networking is generally implemented as a layered stack of several protocols – each responsible for a specific aspect of networking. The OSI reference model defines seven layers of network protocols. Some of the more interesting ones are: the network, transport, and presentation layers.

1. Physical
Deals with hardware, connectors, voltage levels, frequencies, etc.
2. Data link
Sends and receives packets from the physical network. Ethernet packet transmission is an example of this layer.
3. Network
Relays and routes data to its destination. This is where networking gets interesting because we are no longer confined to a single physical network but can route traffic between networks. IP, the Internet Protocol, is an example of this layer.
4. Transport
Provides a software endpoint for networking. Now we can talk application-to-application instead of machine-to-machine. TCP/IP and UDP/IP are examples of this layer.
5. Session
Manages multiple logical connections over a single communication link. Examples are SSL (Secure Sockets Layer) tunnels, remote procedure call connection management, and HTTP 1.1.
6. Presentation
Converts data between machine representations. Examples are data representation formats such as XML, XDR (for ONC remote procedure calls), NDR (for Microsoft COM+ remote procedure calls), and ASN.1 (used for encoding cryptographic keys and digital certificates).
7. Application
This is a catch-all layer that includes every application-specific communication protocol. For example, SMTP (sending email), IMAP (receiving email), FTP (file transfer), HTTP (getting web pages).

Ethernet is the most widely used data network for local area networking. Ethernet provides packet-based, unreliable, connectionless communications. It occupies layers one and two of the OSI model.

IP Networking

The Internet Protocol (IP) handles the interconnection of multiple local and wide-area networks. It is a logical network whose data is transported by physical networks (such as Ethernet, for example). The IP layer provides unreliable datagram delivery of packets between nodes (e.g., computers).

Since IP is a logical network, any computer that needs to send out IP packets must do so via the physical network. Typically, this is Ethernet, which uses a 48-bit Ethernet address that is completely unrelated to a 32-bit IP address. To send an IP packet out, the system needs to identify the physical destination address (MAC, or Media Access Control address) that corresponds to the desired IP destination. The Address Resolution Protocol, or ARP, accomplishes this. It works by broadcasting a request containing an IP address (the message asks, do you know the corresponding MAC address for this IP address?) and then waiting for a response from the computer with the corresponding IP address. To avoid doing this for every outgoing packet, it maintains a cache of most recently used addresses.

There are two transport-layer protocols on top of IP: TCP and UDP. TCP (Transmission Control Protocol) provides reliable byte stream (connection-oriented) service. This layer of software ensures that packets arrive in order to the application and lost or corrupt packets are retransmitted. The transport layer keeps track of the destination so that the application can have the illusion of a connected data stream. UDP (User Datagram Protocol) provides datagram (connectionless) service. While UDP drops packets with corrupt data, it does not ensure in-order delivery or reliable delivery. Port numbers in both TCP and UDP are used to allow the operating system to direct the data to the appropriate application (or, more precisely, to the communication endpoint, or socket, that is associated with the communication stream).

On a packet switched network, connection-oriented service tries to give a datagram some of the characteristics of a circuit-switched or virtual circuit network. The software will typically send packet sequence numbers along with the data, buffer them in memory so they can be presented to the application in order, acknowledge received packets, and request a retransmit of missing or corrupt packets. The software will also keep track of source and destination addresses. We now have the illusion of having a network-level virtual circuit with its preset connection and reliable in-order message delivery. What we do not get is constant latency or guaranteed bandwidth.

The design of the Internet employs the end-to-end principle. This is a design philosophy that states that application-specific functions should, whenever possible, reside in the end nodes of a network and not in intermediary nodes, such as routers. Only if the functions cannot be implemented “completely and correctly,” should any logic migrate to the network elements. An example of this philosophy in action is TCP. TCP provides in-order delivery, reliable delivery, enforces flow control (ensuring that the receiver is not overwhelmed with data), and implements congestion control (dropping its transmission rate if packet losses increase). All this is implemented via software on the sender and receiver: routers are blissfully unaware of any of this.

A related principle is fate sharing, which is also a driving philosophy of Internet design. Fate sharing states that it is acceptable to lose the state information associated with an entity if, at the same time, the entity itself is lost. For example, it is acceptable to lose a TCP connection if a client or server dies. The argument is that the connection has no value in that case and will be reestablished when the computer recovers. However, it is not acceptable to lose a TCP connection if a router in the network dies. As long as alternate paths for packet delivery are available, the connection should remain aliive.

Sockets are a general-purpose interface to the network provided to applications by the operating system. By this, we mean that they were not designed to support one specific network but rather provide a generic mechanism for inter-process communication. They are created with the socket system call and assigned an address and port number with the bind system call. For connection-oriented protocols, a socket on the server can be set to listen for connections with the listen system call. The accept call blocks until a connection is received, at which point the server receives a socket dedicated to that connection. A client establishes a connection with the connect system call. The “connection” is not a a configuration of routers as with virtual circuits; it is just state that is maintained by the transport layer of the network stack in the operating system at both endpoints. After this, sending and receiving data is compatible with file operations: the same read/write system calls can be used. When communication is complete, the socket can be closed with the shutdown or close system calls.

With sockets that use a connectionless protocol (e.g., UDP), there is no need to establish a connection or to close one. Hence, there is no need for the connect, listen, or shutdown system calls. The sendto and recvfrom system calls were created to send and receive datagrams since the read and write system calls do not enable you to specify the remote address. sendto allows you to send a datagram and specify its destination. recvfrom allows you to receive a datagram and identify who sent it.

Protocol encapsulation

We saw that if we want to send an IP packet out on an Ethernet network (IP is a logical network, so there is no physical IP network), we needed to send out an Ethernet packet. The entire IP packet becomes the payload (data) of an Ethernet packet. Similarly, TCP and UDP have their own headers, distinct from IP headers (they need a port number, for example). A TCP or UDP packet is likewise treated as data by the IP layer. This wrapping process is known as protocol encapsulation.

Remote Procedure Calls

Go here for lecture notes

Goal: Provide a layer of abstraction for process-to-process communication that enables a process on one system to invoke a function, or service, on a another system without having to deal with the problems of formatting data and parsing the request.

One problem with the interface offered by sockets was that it offered a send-receive model of interaction. However, most programs use a functional (procedure call) model. Remote procedure calls are a programming language construct (something provided by the compiler, as opposed to an operating system construct such as sockets). They provide the illusion of calling a procedure on a remote machine. During this time, execution of the local thread stops until the results are returned. The programmer is alleviated from packaging data, sending and receiving messages, and parsing results.

The illusion of a remote procedure call is accomplished by generating stub functions. On the client side, the stub (sometimes known as a proxy) is a function with the same interface as the desired remote procedure. Its job is to take the parameters, marshal them into a network message, send them to the server, await a reply, and then unmarshal the results and return them to the caller. On the server side, the stub (sometimes known as a skeleton) is responsible for being the main program that registers the service and awaits incoming requests for running the remote procedure. It unmarshals the data in the request, calls the user’s procedure, and marshals the results into a network message that is sent back to the recipient.

There are a few hurdles to overcome in implementing remote procedure calls:

Parameter passing
Most parameters in our programs are passed by value. That is easy to do remotely: just send the data in a network message. Some parameters, however, are passed by reference. A reference is a memory address of the parameter. The problem with passing this is that memory is local and the memory address passed from a client to a server will now refer to memory on the server rather than to the contents on the client. There is no good solution to this except to understand the data that is being referenced, send it to the remote side (pass by value), where it will be placed in some temporary memory. A local reference can then be passed to the server function. Because the contents might have been modified by the function, the data will need to be sent back to the calling client and copied back to its original location.
Marshaling
All data that is sent needs to be represented as a series of bytes that can be placed into one or more network messages. This is known as marshaling. Not only must any data structure be sent in a serialized format: a sequence of bytes with no pointers, but the format of this marshaled data must be standardized between the client and server so that the server can make sense of the data it receives and vice versa. Different processors and languages may use different conventions for integer sizes, floating point sizes and formats, placement of most significant bytes, and alignment of data.

The marshaled data that is sent over the network may contain data that makes it self-describing: identifying the individual parameters by name and type. Such a format is known as explicit typing. JSON and XML formats are examples of this. In contrast, implicit typing does not send such information and requires the remote side to know the precise expected sequence of parameters.

Generating stubs
Since most languages do not support remote procedure calls natively, something has to generate client and server stubs. That is often a stand-alone program known as an RPC compiler. The RPC compiler takes an interface specification as input and generates client-side stubs (proxies) and a server-side proxy (skeleton). The interface specification is written in an interface definition language (IDL) and defines remote classes, methods, and data structures. It contains all the information that the RPC compiler needs to generate stub functions.
Looking up services
A client process needs to find out how to set up a network connection to an appropriate service that hosts the remote procedures: which host and port to use. An RPC name server is a network service that a client can communicate with to query the host and port of a desired remote interface. The client sends the server a name identifying the interface. That “name” is often a number that uniquely identifies the service that hosts a set of functions on the server. The server returns a host and port number for the service that implements the functions. In many cases, the name server resides on the machine where the remote procedures run and the server will return only the port number. When a service starts up, it will register its interfaces with the RPC name server.
Handling failures
We don’t have a concept of local procedure calls not working. With remote calls, however, problems can arise. The server can stop working or network connectivity may break or experience unexpected delays. These may prevent or delay requests reaching the server or responses reaching the client. To combat this, RPC libraries may attempt to retransmit requests if a response is not received in time. This may have the side-effect of invoking a procedure more than once if the network is slow. In some cases, no harm is done in doing this. Functions that may be run multiple times without undesirable side-effects are called idempotent functions. Functions that have undesirable side-effects if run multiple times (e.g., transfer $500 from my checking account to my savings account) are called non-idempotent functions. Most RPC systems offer at least once semantics, in which case a remote procedure will be executed one or more times (if there are network delays) or at most once semantics, in which case a remote procedure library will avoid resending the procedure request even if it does not get a timely response. Software that uses remote procedure calls has to be prepared to deal with errors that can arise from problems in contacting or getting a response from a remote procedure.

Remote Procedure Calls: case studies

Go here for lecture notes

Sun (ONC) RPC

Sun’s RPC, formally called ONC (Open Network Computing) RPC was one of the first RPC systems to achieve widespread use. It is still in use on virtually all UNIX-derived systems (Linux, OS X, *BSD, SunOS). It uses an RPC compiler called rpcgen that takes input from an interface definition language (IDL) file. This is a file that defines the interfaces to the remote procedures. From this, rpcgen creates client stub functions and a server stub program. These can be compiled and linked with the client and server functions, respectively.

Every interface (set of functions) is assigned a unique 32-bit number, known as a program number. When the server starts up, it binds a socket to any available port and registers that port number and interface’s program number with a name server, known as the portmapper, running on the same machine. A client, before invoking any remote procedure calls, contacts the portmapper on the desired server with the program number to find the port to which it needs to send its requests.

The choice of transport protocol, UDP or TCP, can be specified at run-time. All incoming and return parameters are marshaled into a standard format called XDR, or eXternal Data Representation.

DCE RPC

The Distributed Computing Environment, defined by the Open Group, created its own flavor of RPC which was similar to Sun’s. They also had the programmer specify an interface in an IDL, which they called the Interface Definition Notation (IDN).

To avoid the problem of picking a unique 32-bit identifier for the interface, DCE RPC provides the programmer with a program called uuidgen. This generates a unique universal ID (UUID) – a 128-bit number that is a function of the current time and ethernet address.

The Distributed Computing Environment also introduced the concept of a cell, which is an administrative grouping of machines. Each cell has a cell directory server that maintains information about the services available within the cell. Each machine in the cell knows how to contact its cell directory server. When a server program starts up under DCE RPC, it registers its port and the interface’s UUID with a local name server (the DCE host dæmon, dced, which is similar to the portmapper). It also registers the UUID-to-host mapping with the cell directory server. This allows for location transparency for services: a client does not need to know what machine a service lives on a priori.

A standard way of encapsulating data (marshaling) is crucial since encodings may differ between different machines. Sun defined a standard format called XDR (eXternal Data Representation). Every participating system must convert (marshal) data into this format. DCE defined a format called NDR (Network Data Representation). However, instead of creating a single set of definitions, NDR defines a set of data representations that can be used. The hope is that the sender can find a format that will require minimal or no data conversion (hence, greater efficiency). If the client uses the same system architecture, it also will not need to convert data. This is known as a multi-canonical approach to data conversion.

As object oriented languages gained popularity in the late 1980s and 1990s, RPC systems like Sun’s and DCE’s proved incapable of handling some object oriented constructs, such object instantiation or polymorphism (different functions sharing the same name, with the function distinguished by the incoming parameters). Creating objects, in particular, requires a need for memory allocation on the server and cleanup when these remote objects are no longer needed. This is called distributed garbage collection. A new generation of RPC systems dealt with these issues.

Microsoft COM+/DCOM & ORPC (MS-RPC)

Microsoft already had a mechanism in place for dynamically loading software modules, called components, into a process. This was known as COM, the Component Object Model and provided a well-defined mechanism for a process to identify and access interfaces within the component. The same model was extended to invoke remotely-located components and became the Distributed Component Object Model (DCOM), later fully merged with COM and called COM+. Because remote components cannot be loaded into the local process space, they have to be loaded by some process on the remote system. This process is known as a surrogate process. It runs on the server, accepting remote requests for loading components and invoking operations on them.

COM+ is implemented through remote procedure calls. The local COM object simply makes RPC requests to the remote implementation. Microsoft enhanced the DCE RPC protocol slightly to create what they called Object RPC (ORPC). This is essentially DCE RPC with the addition of support for an interface pointer identifier (IPID). The IPID provides the ability to identify a specific instance of a remote class. Interfaces are defined via the Microsoft Interface Definition Language (MIDL) and compiled into client and server side stubs. The client-side stub becomes the local COM object that is loaded when the object is activated. Like DCE, ORPC supports multi-canonical data representation.

Since objects can be instantiated and deleted remotely, the surrogate process needs to ensure that there isn’t a build-up of objects that are no longer needed by any client. DCOM accomplishes this via remote reference counting. This is an explicit action on the part of the client where the client sends requests to increment or decrement a reference count on the server. When the reference count for an object drops to zero, the surrogate process deletes that object. To guard against programming errors or processes that terminated abnormally, a secondary mechanism exists, called pinging. The client must periodically send the server a ping set – a list of all the remote objects that are currently active. If the server does not receive this information within a certain period, it deletes the objects. This is a form of leasing, where the object expires if a lease is not renewed periodically. However, there is a subtle difference. With leasing, the lifetime of an object is generally renewed whenever an object is accessed, so there is no need for the client to ping the server to renew a lease unless it has not accessed the object for the duration of the lease.

Java RMI

When Java was created, it was designed to be a language for deploying downloadable applets. In 1995, Sun extended Java to support Remote Method Invocation (RMI). This allows a programmer to invoke methods on objects that are resident on other JVMs.

Since RMI is designed for Java, there is no need for OS, language, or architecture interoperability. This allows RMI to have a simple and clean design. Classes that interact with RMI must simply play by a couple of rules. All parameters to remote methods must implement the serializable interface. This ensures that the data can be serialized into a byte stream (marshaled) for transport over the network. Serialization is a core aspect of marshaling: converting data into a stream of bytes so that it can be sent over a network or stored in a file or database. All remote classes must extend the remote interface. A remote interface is one whose methods may be invoked from a different Java virtual machine. Any class that is defined as extends Remote can be a remote object. Remote methods within that class must be capable of throwing a java.rmi.RemoteException. This is an exception that the client RMI library will throw if there is a communication error in calling the remote method.

RMI provides a naming service called rmiregistry to allow clients to locate remote object references. These objects are given symbolic names and looked up via a URI naming scheme (e.g., rmi://cs.rutgers.edu:2311/testinterface).

Java’s distributed garbage collection is somewhat simpler than Microsoft’s COM+. There are two operations that a client can send to the server: dirty and clean. When the first reference to a remote object is made, the client JVM sends a dirty message to the server for that object. As long as local references exist for the object, the client will periodically send dirty messages to the server to renew the lease on the object. When the local JVM’s garbage collector detects that there are no more references to the object, it sends a clean message for that object to the server.

Web Services

In the late 1990s, the web browser became the dominant model for user interaction on the Internet. It used HTTP, the Hypertext Transfer Protocol, over TCP for requests and responses and formatted web pages with HTML, the Hypertext Markup Language. While this created a decent user experience, dealing with fully-formatted pages was not good for programmatic access to that data. The user interface content (tables, images, text formatting) was a major component of the content. Approaches such as site scraping were often employed, where a program would request and receive an HTML page and then parse through tons of formatting directives to access the data it needed.

What we wanted was remotely-hosted services that programs, not users, can access. This enables machine-to-machine communication. Remote procedure calls would seem to be a natural choice for this but they also had issues when used on the Internet outside of an organizations. Because of the convenience of “you don’t need to pick a port”, RPC solutions typically ran services over an arbitrary range of ports where the operating system selected some unused port and registered it with an RPC name server. This led to an administrative nightmare where an administrator could not set a firewall rule to allow or block a specific port. Even though some RPC solutions were designed to support multiple languages, most worked well with just one.

Web services are a set of protocols by which services can be published, discovered, and used over the Internet in a technology neutral form. This means that they are designed to be language and architecture independent. Applications will typically invoke multiple remote services across different systems, sometimes offered by different organizations.

The general principles of web services are:

  • Use text-based payloads, called documents, marshaling all data into formats such as XML or JSON. This ensures that the marshaling format does not favor a specific processor architecture or programming language. It also ensures that content-inspecting firewalls do not cause problems.

  • Use HTTP over TCP/IP for transport. This allows us to use existing infrastructure: web servers, firewalls, and load balancers.

  • Tolerate high latency. Servers are likely not to be on a local area network and may be slow to respond either due to their own load or due to network latency. This means that, where possible, programmers should strive for asynchronous interactions: dispatch a request and see if you can do something else useful before you get the response.

  • Tolerate a large number of clients. Applications that interact with web services tend to be more loosely-coupled than those that use distributed objects (remote procedure calls). Services may be run by different organizations and servers cannot count on well-behaved clients or a small number of them. When possible, the ideal design is a stateless one where each client request contains all the necessary data and the server does not have to store any state between requests. Documents, the unit of message exchange in web services, tend to be self-describing. That is, they will identify and itemize all the parameters (explicit typing) and also describe any additional state needed for the interaction.

The use of web services leads to a programming model called Service Oriented Architecture (SOA). Under SOA, an application is the integration of network-accessible services where each service has a well-defined interface. These services, or components, are unassociated and loosely coupled. By unassociated we mean that neither service depends on the other one; they are all mutually independent. By loosely coupled we mean that neither service needs to know about the internal structure of other services. To maintain this independence, web services generally forgo distributed garbage collection.

Functionally, you can do anything with web services that you can with distributed objects (RPC). The differences are usually philosphical. Web services focus on document exchange and are designed with high latency in mind. Document design is central to web services. Distributed objects tend to look at the world in a way where interfaces are the key parts of the design. The data structures (“documents”) passed in these interfaces just package the data for use by them.

XML RPC

XML-RPC was created in 1998 as a simple protocol that marshals all requests and responses into XML messages. It is essentially a marshaling protocol and the standard does not define an IDL or stub function generator. There are a lot of libraries to support XML RPC and some languages implement it more transparently than others.

SOAP

XML RPC took an evolutionary fork and, with the support of companies such as Microsoft and IBM, evolved into something known as SOAP, the Simple Object Access Protocol. The acronym has since been deprecated since SOAP has grown to the point where it is neither simple nor confined to accessing objects. XML RPC is a subset of SOAP. In addition to remote procedure calls, SOAP added support for general purpose messaging (send, receive, asynchronous notification) of messages. SOAP invocations are XML messages sent via an HTTP protocol. SOAP services can be described via a Web Services Description Language (WSDL) document. This is another XML document that essentially serves as an interface definition and defines all the names, operations, parameters, destination, and format of requests. WSDL is somewhat complex for human consumption. Typically, one creates an interface definition in a language such as Java and then uses a tool to translate that definition into a WSDL document. That WSDL document can then be fed to another tool (often by another programmer) to generate code that can be used to invoke remote functions or send remote messages.

In addition to WSDL, SOAP was also augmented with a directory service for storing and serving information about web services. The service is called UDDI, for Universal Description, Discovery, and Integration and uses SOAP to communicate, providing WSDL documents as a result. UDDI never really achieved widespread popularity or deployment.

JAX-WS: Java Web Services

As web services became popular, quite a few services to support them were created for the Java platform. One of the more popular ones, and supported by the Oracle (the owner of Java), is JAX-WS (Java API for XML Web Services). The goal of JAX-WS is to invoke Java web services using Java RMI. Unlike traditional Java RMI, interoperability is important since the remote side (client or server) may not necessarily use Java. JAX-WS uses SOAP and WSDL to achieve platform independence.

Microsoft .NET Remoting

A problem with Microsoft’s DCOM was that it was a somewhat low-level implementation. Clients had to provide reference counting of their objects explicitly and the convenience of using DCOM depended very much on the language (easy in VB, difficult in C++). Moreover, the use of operating system selected random ports and a binary data format made it difficult to use over the Internet where firewalls may block certain ports or requests need to in an XML format and be sent over HTTP.

With .NET, Microsoft provided (among other things) a runtime environment, a set of libraries, and a web server that provides inbuilt support for web services. For supporting functional access to web services, .NET provides a remoting capability that allows a process to interact with objects that reside on other processes and other machines.

.NET Remoting was created to be a successor to DCOM that would work more easily over the Internet (no random ports, for example, as well as the ability to use XML over HTTP). As with other RPC systems, client and server stub functions (proxies) are created. Microsoft’s Visual Studio application development environment hides the mechanics of this and integrates remote procedure calls directly into the language.

These stubs rely on the .NET runtime system to marshal parameters into one of several types, which include SOAP or binary formats. The .NET runtime system then sends the message over a particular channel that that was set up for transport. This channel may be HTTP using TCP/IP to send SOAP messages, TCP/IP with no wrapping protocol to send binary messages on a local-area network, or named pipes to send messages between processes on the same machine. Microsoft’s web server, IIS, can be configured to directs certain URLs that contain the SOAP (encapsulated in HTTP) request to specific objects running under the .NET framework, which then sends a response back to the web server, which is then returned to the caller.

.NET supports two forms of object activation:

  1. server activated objects are those where the client has no control in managing the lifespan of the object, and

  2. client activated objects are those where the lifetime is controlled by the client and the distributed garbage collection must be handled. These behave like traditional remote objects.

Of server activated objects, two forms are supported:

a. Single call objects instantiate a new instance of the object for a call and immediately clean it up upon return. There is no ability to keep state.

b. Singleton objects share the same instance of the object among all callers. This is much like traditional function calls in non-object-oriented systems. Everyone shares the same state.

Client activated objects are created when the client requests a new object. This is similar to the way objects are handled in DCOM. .NET manages object lifetime of client activated objects via a Leasing Distributed Garbage Collector (LDGC). The lease manager on the server manages object lifetime by checking the object’s lease timer. When an object is created, it is given an initial lease time (five minutes by default). Each time a method is called on an object, the lease timer is renewed. The object will be cleaned up unless the client either references its objects or makes explicit calls to renew the lease time. This is a very similar concept to Java RMI with the exception that a client never sends a message stating that there are no more object references (Java sends a clean message to the server). There is no more reference counting as under DCOM.

REST

REST (REpresentational State Transfer) is a departure from the approach of SOAP. Instead of using HTTP simply as a conduit for sending and receiving XML messages where everything you need is contained in the body, REST incorporates itself into the HTTP protocol. In particular, the URL incorporates the request and list of parameters. The protocol intself defines the core nature of the operation:

  • HTTP PUT: create something
  • HTTP GET: read something
  • HTTP POST: update something
  • HTTP DELETE: delete something

The body of the message will contain the document, which will be formatted data and not, as in the case, a structure that also identifies the operations to be performed on the document.

Marshaling formats

When web services were first developed, the obvious marshaling format to use was XML. This was, in a rough way, what HTML used for describing the content of web pages. Its use was adopted by XML-RPC and SOAP and it remains heavily in use. However, it turned out to be a rather text-heavy protocol that was complex to parse. A lightweight alternative that gained much popularity is JSON (JavaScript Object Notation). Despite the JavaScript in the name, it was designed to be language-independent and easy to parse. It was touted as the “fat-free alternative to XML.” Even more efficient is the use of Google Protocol Buffers. This is a binary marshaling protocol and is not always suited for web services over HTTP but is phenomenally efficient for local services and for storing serialized data (e.g., saving objects in a file system).

Clock synchronization

Go here for lecture notes

Goal: Enable clocks on multiple machines to be synchronized to the same value.

No two clocks tick in perfect synchrony with each other. The difference between two clocks at any given instant is the clock skew. The rate at which the clocks are drifting is the clock drift. A linear compensation function adjusts the rate at which time is measured on a computer (e.g., number of ticks that make up a second).

Cristian’s algorithm sets the time on a client to the time returned by the server plus an offset that is one half of the transit time between the request and response messages: Tclient = Tserver + ½(T>received - Tsent). It also allows one to compute the maximum error of the new time stamp. The error is ± ½[(round-trip time) - (best-case round-trip time)]. Errors are additive. If you incur an error of ±50 msec and the server’s clock source has an error of ±80 msec, your clock’s error is now ±130 msec.

The Berkeley algorithm does not assume the presence of a server with an accurate time (i.e., one that keeps track of UTC time). Instead, one system is chosen to act as a coordinator. It requests the time from all systems in the group (including itself) and computes a fault-tolerant average (an arithmetic average, dismissing systems whose time values differ by more than a certain amount). It then sends each machine an offset by which to adjust its clock.

The Network Time Protocol, NTP, was created to allow a large set of machines to synchronize their clocks. A set of machines act as time servers. This collection of machines is known as the synchronization subnet. The subnet is hierarchical, with the time server’s stratum defined as the number of hops it is from a machine that synchronizes from a direct time source. Machines that are directly connected to a time source (e.g., a GPS receiver) are at stratum 0. Machines that synchronize from a system at stratum 0 are at stratum one, and so on. The Simple Network Time Protocol, SNTP, is a restricted form of NTP that does not support peer-to-peer synchronization of time servers. The peer-to-peer mode maintains state on drift and synchronization time and is intended for time servers to synchronize with each other. SNTP is essentially the same as Cristian’s algorithm. The formula for NTP and SNTP is time_offset = ½ (T2 - T1 + T3 - T4) where T1 is the time the message left the client, T2 is the time it arrived at the server, T3 is the time the response left the server, and T4 is the time that it arrived at the client. If we let TS = ½(T2+T3) then we arrive at Cristian’s formula.

NTP encourages clients to try several servers. For each server contacted, the protocol computes

Offset:
The difference between the client’s clock and the server’s. This is how much the client needs to offset its clock to set it to the server’s time.
Delay:
This is an estimate of the time spent sending or receiving the message. It is one half of the round-trip time minus the estimate of time spent on the server.
Jitter:
Jitter measures the variation in delay among multiple messages to the server. It gives us an idea of how consistent the latency is between the client and server.
Dispersion:
Dispersion is the estimated maximum error for the computed offset. It takes into account the root delay (total delay, not just to the server but the delay from the server to the ultimate time source), estimated server clock drift, and jitter.

Given the choice of several NTP servers, NTP picks the server with the lowest dispersion. That is, the server from which it can get the most consistent and accurate time. If there is a tie, it then picks one with the lowest stratum; that is, one closest to the master time source. Note that this may result having a client synchronize from a higher-stratum server even if it can contact a lower-stratum one (one that is closer to the time source). This may happen if the client can access that server more reliably and with less delay and jitter than a “better” server.

The Precision Time Protocol (PTP, an IEEE standard), was designed with different goals than NTP. While NTP was designed for synchronizing time with machines across the Internet and remote servers, PTP was designed for LANs: networks with low jitter and low latency. PTP is intended to achieve sub-microsecond precision among cooperating machines. Clock synchronization is initiated by a PTP master (unlike NTP, where the client host initiates the sync). The master announces its time to clients via a sync message. If a client is then interested in synchronizing its clock, it must communicate back to the server in order to discern the delay in communication. The client sends a delay request message and notes the time it was sent. The server sends back a delay response message containing the time of arrival of the delay request message. With this data, the client can compute the server’s timestamp with an adjustment for the network transit delay.

Logical clocks

Go here for lecture notes

Goal: Allow processes on different systems to identify causal relationships among events, particularly among messages between these systems.

Lamport clocks allow one to assign sequence numbers (“timestamps”) to messages and other events so that all cooperating processes can agree on the order of related events. There is no assumption of a central time source and no concept of total ordering (when events took place). The central concept with logical clocks is the happened-before relation: a→b represents that event a occurred before event b. This order is imposed upon consecutive events at a process and also upon a message being sent before it is received at another process. Beyond that, we can use the transitive property of the relationship to determine causality: if a→b and b→c then a→c. If there is no causal relationship between two events (e.g., they occur on different processes that do not exchange messages or have not yet exchanged messages), the events are concurrent.

Lamport’s algorithm states that every event is timestamped (assigned a sequence number) and each message carries a timestamp of the sender’s clock (sequence number). A message comprises two events: (1) at the sender, we have the event of sending the message and (2) at the receiver, we have the event of receiving the message. The clock is a process-wide counter (e.g., a global variable) and is always incremented before each event. When a message arrives, if the receiver’s clock is less than or equal to the message’s timestamp, the clock is set to the message timestamp + 1. This ensures that the timestamp marking the event of a received message will always be greater than the timestamp of that sent message.

Vector clocks

One result of Lamport timestamps is that multiple events on different processes may all be tagged with the same timestamp. We can force each timestamp to be unique by suffixing it with a globally unique process number. While these new timestamps will not relate to real time ordering, each will be a unique number that can be used for consistent comparisons of timestamps among multiple processes (e.g., if we need to make a decision on who gets to access a resource based on comparing two timestamps).

A second deficiency with Lamport timestamps is that, by looking at timestamps, one cannot determine whether there is a causal relationship between two events. For example, just because event a has a timestamp of 5 and event b has a timestamp of 6, it does not imply that event a happened before event b.

A way to create timestamps that allow us to discern causal relationships is to use a vector clock. A vector clock is no longer a single value but rather a vector of numbers, with each element corresponding to a process. Before affixing a vector timestamp to an event, a process increments the element of its local vector that corresponds to its position in the (for example, process 0 increments element 0 of its vector; process 1 increments element 1 of its vector). When a process receives a message, it sets the vector of the event to one that contains the higher of two values when doing and element-by-element comparison of the original event’s vector and the vector received in the message. This becomes the new per-process vector from which future events on that process will be timestamped. Think of a vector timestamp as a set of version numbers, each representing a different author.

For example, in an environment of four processors, P0, … P3, P1 will “own” the second element of the vector. If one event on P1 is (2, 4, 0, 1) then the next event will be (2, 5, 0, 1). If the event after that is the receipt of a message with a timestamp of (3, 2, 9, 8) then the timestamp will be created by setting an element-by-element maximum of (2, 6, 0, 1) and (3, 2, 9, 8), which is (3, 6, 9, 8). We can illustrate this with the following pseudocode where e is the system’s vector counter and r is the received vector timestamp:

/* receive message with vector timestamp r */
e[myproc]++;    /* increment our process' index */
for (i=0; i < num_procs; i++) {
    if (e[i] < r[i])
        e[i] = r[i];
}

Two events are concurrent if one vector timestamp is neither greater than or equal nor less than or equal to the other element when doing an element-by-element comparison. For example, events (2, 4, 6, 8) and (3, 4, 7, 9) are not concurrent (i.e., they are causally related) because every element of the first vector is less than or equal to the corresponding element of the second vector. The vectors (2, 4, 6, 8) and (1, 5, 4, 9) represent concurrent events. Neither vector is less than or equal to the other. For instance, 2 ≥ 1 (first element of the first vector is greater than the first element of the second vector) but 4 ≤ 5 (second element of the first vector is less than the second element of the second vector).

Note that in an environment where the overall set of processes is not known by each process, a vector timestamp may consist of a set of <process, number> tuples. In such a case, the timestamping and comparison process is exactly the same: numbers are compared with those of matching processors and any missing process is treated as having a value of 0 in that vector.

Group communication

Goal: Provide mechanisms for a process to send a message to a group of processes instead of to one process. Consider the issues of reliability and message ordering.

Point-to-point communication is known as unicast. This is what we generally use to communicate a single client and server. There are other modes of communication. Broadcast is the sending of data to every host on the network. Anycast is point-to-point communication (as in unicast) but the receiver is the nearest one of receivers with the same address (for example, IPv6 uses this to allow a host to update the routing table of the nearest host). Anycast is a special purpose form of unicast that will will not discuss here.

An alternative to point-to-point communication is group communication, or point-to-multipoint messaging. Group communication is known as multicast and provides the abstraction of allowing a process to send a single message that is delivered to all group members. True multicast is handled by the network and offers the efficiency of avoiding replicated messages. The sender sends a single message on a special multicast address. Interested parties listen on that address and the network takes care of the delivery. If multicast is not available, it can be simulated by invoking multiple unicasts, one to each recipient.

There are two considerations in group communication (multicast): reliability and message ordering.

Message receipt versus message delivery

We talk about a process receiving a message, which means that it is received from the network and handled by a multicast receiving algorithm. This algorithm decides when to deliver the message to the application logic.

Since receivers cannot control the order in which messages are received over the network, a layer of software is responsible for transmitting a multicast message and, at the receiver, receiving it and deciding when and if to make it available to the application (i.e., deliver it). When a message is received, the multicast receiving algorithm may take one of three actions:

  1. Discard the message. This may be done if the receiver is no longer a member of the group or if the message is a duplicate.

  2. Deliver the message. The message is placed into a FIFO (first-in, first-out) queue from which the application reads incoming multicast messages.

  3. Hold the message. The message is not ready to be delivered to the application. Most likely, this is because it has not arrived in the expected order and the algorithm must first receive an earlier message. Alternatively, it may need to be held to get feedback that all other members have received it before passing it to the application. In cases like this, the message is placed on a hold-back queue. When the next message is received, the algorithm may check the hold-back queue to determine whether any held messages can now be delivered or discarded.

Reliability

An atomic multicast requires that a message must reach all group members (if one host cannot get the message, no others can process it). This multicast must survive machines going down – both the sender and/or receivers. If a recipient is not certain that a message has been received by all group members, it cannot deliver it to the application. Because of this, it requires the most overhead in implementation, often employing persistent logs.

A reliable multicast is a multicast where the sender tries its best to ensure that messages get to group members. The sender sends a message and waits for an acknowledgement. If it doesn’t receive the acknowledgement in time, it will retransmit the message. It will try this again and again until, eventually, after a longer interval of time, it will give up and assume the receiver is dead.

An unreliable multicast doesn’t wait for acknowledgements and will generally use whatever underlying multicast mechanism is provided.

Message ordering

The issue in message ordering is that multiple hosts can be sending messages to the entire group at the same time. Will each host receive all the messages in the exact same order? Will each host receive all the messages in the order they were sent?

Global time ordering requires that all messages arrive in the exact order they were sent: if host A sent a message 5 nanoseconds before host B did then all hosts should receive the message from host A first. This is impossible to implement (clocks can’t be that perfectly synchronized and the chance of a tie is always present; also, networks will not be able to guarantee this ordering if routing is involved). A more practical approach is total ordering. This requires that all messages arrive in the same order at all hosts. This can be easily achieved by providing a group-wide mechanism for obtaining a totally sequenced message ID: for example, from a sequence number server.

Causal ordering means that messages that are causally related (according to Lamport’s happened before definition) will be delivered in order to all group members. Concurrent messages may be delivered in any order. One way of implementing causal ordering is by having each process keep a precedence vector and sending it along with every message that is sent to the group. A precedence vector is similar to a vector timestamp with the exception that an event counter is not incremented for received messages. The vector applies only to events that relate to sending messages to a specific group. Each entry in the precedence vector represents the latest message sequence number that the corresponding process (group member) knows about.

When a process Psender sends a message, it increments the element of the vector that corresponds to its own entry:

Vsender[sender] = Vsender[sender] + 1

This vector is sent along with the message.

When a process Preceiver receives a message from Psender, it checks two conditions:

(1) The message must be the very next message from Psender. That is, the value of the sequence number in the received vector that corresponds to Preceiver must be exactly one greater than the one we have for that process in our vector:

Vsender[i] == Vreceiver[i] + 1

This tells us that we are receiving messages in sequnce from Psender. If the value is more than one greater, it means that we missed one or more messages from that sender. This message will have to go on the hold-back queue until the the earlier messages arrive.

(2) The message should not be causally dependent on another message that the receiver has not yet seen. This means that every other element of the vector has to be less than or equal to the corresponding element of the vector at the receiver.

∀i, i ≠ sender: Vsender[i] ≤ Vreceiver[i]

If the vector from the sender contains some sequnce number that is greater than a corresponding sequence number in the receiver’s vector, that means that the sending process has seen a message from some other process that the receiver has not yet processed. Since the precedence vector is used for group communication and the reciver is part of the group, that means the receiver needs to wait for that message to come.

If both conditions are satisfied, then the received message can be delivered to the application immediately. Otherwise, it is placed in the hold-back queue until the conditions can be satisfied. Causal ordering has the advantage that there is no need for a global sequencer as in total ordering. Note that the precedence vector technique requires reliable message delivery. Otherwise, the receiver will not know if a message was lost or if a message has just not yet arrived.

Sync ordering requires a special type of message — a sync message. When this is issued, any messages that were already sent have to be processed (and acknowledged to the senders). The sync assures that any messages sent before the sync will be processed before any messages after the sync (globally – for all hosts).

FIFO (first in, first out) ordering on messages ensures that messages from each source are delivered in order but messages from multiple sources may be interleaved in any order at the receiver. For example, if host A sends messages m1, m2, m3 and host B sends messages n1, n2, n3, it is valid for host C to receive the sequence m1, m2, m3, n1, n2, n3 and for host D to receive the sequence m1, n1, n2, m2, n3, m3.

Finally, an unordered multicast doesn’t impose any message ordering among messages from other hosts.

Mutual exclusion

Go here for lecture notes

Goal: Allow processes to request and be granted exclusive access to named resources. The algorithms should be designed to avoid starvation and ensure that exactly one requesting process gets the resource even if multiple processes make requests concurrently.

Mutual exclusion is responsible for making sure that only one process or thread accesses a resource at a time. In non-distributed systems, it is accomplished through mechanisms such as semaphores and monitors and, at a low level, through instructions such as test-and-set locks. None of these mechanisms work across networks of computers. We need an algorithm that will allow processes to compete for access to a resource and ensure that at most one process will be granted that access. The algorithm must be fair: it has to ensure that all processes that want a resource will eventually get a chance to get it. If this is not the case, starvation occurs, where a process will have to wait indefinitely for a resource.

The “resource” is an arbitrary name that all processes agree to use. It may refer to a critical section of code, a physical device, or some network service. A mutual exclusion algorithm allows a process to request access to this resource and be granted exclusive access to it. This is known as a lock. If the lock has a timeout associated with it, then it is known as a lease.

Distributed algorithms fall into three categories. A centralized approach uses a central coordinator that is responsible for granting access permissions.

A token-based approach allows a process to access a resource if it is holding a “token”; that is, it received an unsolicited message where ownership of the message allows the process to access the resource. Sending the message (token) means forfeiting access.

A contention-based algorithm is one where all processes coordinate together on deciding who gets access to the resource.

The centralized algorithm is a server that accepts REQUEST messages for a resource. If nobody is using the resource, it responds with a GRANT message. If somebody is using the resource, that client simply does not respond. When a client is done with a resource, it sends the server a RELEASE message. The server then sends a GRANT message to the next client in the queue (if there is one).

The token ring algorithm creates a logical communication ring among the processes in the group. A message, called a token, is created for each resource. This token is passed from process to process along the ring. If a process receives a token and does not need to access that resource, it simply passes the token to the next process. Otherwise, it will hold on to the token until it is done with the resource.

Lamport’s algorithm requires each process that wants a resource to send a timestamped request for the resource to all processes in the group, including itself. Receipt of each message is immediately acknowledged by every process. Each process places the received message in a local priority queue that is sorted by the Lamport timestamp that is in each message (or any totally unique timestamp). A process decides whether it can access the resource by checking whether its own request is the earliest (first) one in the queue of all requests that it has received. If so, it accesses the resource and, when done, sends a release to all members (including itself). The receipt of a release message causes each process to remove the process from the ordered queue. If a process now finds itself as the earliest process in the queue, it knows that it is now its turn to access the resource.

Summary: send a request message to everyone in the group, including yourself. Get an acknowledgement from everyone. Each process stores request messages in a priority queue sorted by timestamp. The process ID at the head of the queue can access the resource. When it is done, it sends a release message to everyone.

The Ricart & Agrawala algorithm. like Lamport’s, is also based on using reliable multicasts. A process that wants to access a resource sends a request to all other processes in the group and waits for all of them to respond. If another process is currently using the resource, that process delays its response until it is done. If process A and process B sent out requests concurrently (i.e., two systems want the same resource concurrently), each system compares the timestamp of that request with that of its own request. If process A received a request from process B that is older (a lower timestamp) than the one it sent, then process A will give process B priority to access the resource by sending a response to process B . Otherwise, if process A has the earlier timestamp, it will queue the request from B and continue to wait for all acknowledgements, sending a response to B (and any other processes who wanted the resource) only when it is done using the resource. The key point is that processes A and B will make the same comparison and exactly one will hold back on sending the response.

Summary: send a request message to everyone in the group and wait for acknowledgements from everyone. Any process using the resource will delay sending an acknowledgement.

The Lamport and Ricart & Agrawala algorithms are similar in that they are both contention based and truly distributed. Ricart & Agrawala’s requires fewer messages since there is no explicit release message that needs to be sent; acknowledgements are simply delayed. Neither are efficient compared with the centralized algorithm.

Election algorithms

Go here for lecture notes

Goal: Allow all processes in a group to elect exactly one of the processes in the group. All processes should agree to elect the same process.

Election algorithms are designed to allow a collection of processes to agree upon a single coordinator. All candidate processes are considered to be equally capable of acting as a coordinator. The task is to select exactly one of these processes — and make sure that every process is aware of the selection. In designing these algorithms, we assume that every process has a unique ID (for example, a process ID combined with the machine’s address). The algorithms we examine in this section will choose a winner (a coordinator) by selecting either the highest available process ID or the lowest one. It is important that whatever criteria is used will never result in one process choosing a different winner than another.

The bully algorithm selects the largest process ID as the coordinator. If a process detects a dead coordinator, it sends an ELECTION message to all processes with a higher ID number and waits for any replies. If it gets none within a certain time, it announces itself as a coordinator. When a process receives an ELECTION message it immediately sends a response back to the requestor (so the requestor won’t become the coordinator) and holds an election to see if there are any higher-numbered processes that will respond. If not, then it declares itself as coordinator.

The ring election algorithm requires a logical communication ring among the processes in the group (as we had in the token ring mutual exclusion algorithm). When a process detects a non-responding coordinator, it creates an ELECTION message containing its own process ID and sends it to the next process in the ring. If the process is dead, then it sends the message to the following process. This sequence continues, with each process sending an ELECTION message to the next process in the ring, skipping dead processes until it finds a process that can receive the message. When a process receives an ELECTION message, it adds its own process ID to the body and sends that message out to its neighboring process. When the election message circulates back to the original sender (the sender detects this by seeing its process ID at the head of the list, indicating that it started the election), the sender looks at the list of active processes in the message body and chooses a coordinator from among them. Since multiple elections may have been started concurrently, the algorithm for choosing the leader must yield the same result among the different list. Hence, selecting the highest or the lowest process ID will work. Selecting the first or last process in the list will not.

The Chang and Roberts ring algorithm optimizes the ring algorithm in two ways. First, there is no need to keep the entire list of live processes in the election messages; we can perform the election as we go along. If the goal is to vote for the highest-numbered live process ID and a process receives an election message, it does one of two things: (1) pass it untouched if its process ID is smaller than the one in the message, or (2) replace the process ID In the message with its own if the process ID is greater than the one in the message. When a process receives a message that contains its own process ID, it knows that the message has come full circle and it is the winner. It will then notify each process of the outcome.

The second optimization is to avoid having multiple election messages circulate if multiple processes have initiated elections concurrently. To do this, a process keeps track if it has participated in an election (i.e., has initiated or forwarded an election message). If a process receives a message with a smaller process ID than its own and it is a participant in an election, it simply discards the message. If, on the other hand, the received message contains a greater process ID than its own, it will forward the message even if it is a participant in an election since the message contains a candidate that takes priority over the previous one that was forwarded. Once election results are announced, each process clears its status of participaing in an election.

One problem with election algorithms is when a network gets partitioned (also known as segmented): one set of processes is separated from another and cannot communicate with the other. In this case, each segment may elect its own coordinator and problems can arise. This is known as a split brain situation. To combat this, a redundant network is needed or some alternate communication mechanism to enquire about the state of remote processes.

Another approach that is often taken is to require a majority of prospective coordinators to be available. A quorum is the minimum number of processes that need to be available to participate in an operation. In this case, the operation is an election and we need a quorum of over 50% of the processes. If a network is partitioned into two or more segments, at most one of the partitions will have over 50% of the processes. Those processes that do not will refuse to elect a coordinator.