Final exam study guide

The three-hour study guide for the final exam

Paul Krzyzanowski

Latest update: Fri May 6 15:41:35 EDT 2016

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the three hour time window in the title literally.

Terminology from the history of data communication

Data communication, the process of conveying information from one place to another, existed long before computers and dates back to earliest recorded history. As various pre-computer techniques for conveying data reliably over distances were invented, so too were a number of underlying mechanisms that proved to be useful in digital networks.

The most basic form of message delivery is moving a message from one entity (person) to another. This is known as point-to-point, or unicast delivery. Transmitting a single message so that everyone gets it is known as a broadcast. It’s the difference between a messenger and a smoke signal. Two crucial categories of data are control and message data. Message data is the information that needs to be delivered. Control data comprises the information that is used to manage the delivery of the message. It includes things such as acknowledgements, retransmission requests, and rate controls. In some cases, control data is sent via a separate communication channel than message data. In other cases, it is intermixed with some syntax for identifying which data is control and which is message.

Synchronization, in the context of messaging, is the coordination of activity between the sender and receiver of messages. It includes controls such as ready to send, ready to receive, transmission complete.

Congestion is the inability of a network element to receive or transmit messages at the desired rate, leading to a buildup or possibly a loss of messages and a deterioration in the quality of service. Flow control is the aspect of synchronization that deals with the rate of transmission, controlling when the sender transmits messages to keep the receiver or network from getting overwhelmed with data traffic. Rate control is the aspect of flow control that controls just the speed of transmission. Other elements of flow control may be directives to delay for a specific time or wait until some feedback is received. By reducing the flow of traffic, congestion can be alleviated.

An acknowledgement (also known as a positive acknowledgement) is a control message that states that a message was received. A negative acknowledgement is a control message that states that a message was not delivered to the destination or that it was received with errors. It is a form of error notification.

Packets may be lost or corrupted during transmission. With best-effort message delivery the network does the best it can do in providing service but makes no guarantee that data will be delivered to its destination or the time it will take for it to be delivered. Reliable delivery ensures that data does arrive reliably. Reliable delivery may be implemented as a layer of software on top of a network that provides best-effort delivery. With reliable delivery, if a message does not arrive or arrive correctly, the transmitter will be notified and will attempt to resend the message. Delivery time of the message can vary (since there is an extra delay to detect lost data dnt to retranmit it). With flow control mechanisms to relieve congestion as well as reliable delivery, there is generally no guarantee on the resulting bit rate between two communicating parties.

A key part of any communication is the encoding of data. This ranges from how data is represented in the medium (the component that carries data, whether it is radio frequency, a wire, or a piece of paper) to the meaning of the messages themselves. To expedite messaging in the past, table-based encoding was sometimes used to have a letter or a number represent an entire message instead of transmitting each character in the message.

A repeater, also known as an extender, is a device that regenerates a signal to cover longer distances. Pre-networking, this would be a relay station where one would switch horses or runners. These days, it is a often a device such as an ethernet extender that regenerates incoming signals to full strength, allowing longer runs of ethernet cable.

Origins of the Internet

The precursor to the Internet was ARPANET, a research project funded by DARPA, the Defense Advanced Research Projects Agency. This was an experiment in using packet switched networking to share computing resources over long distances. ARPANET got its start in late 1968 and was inspired by three developments. First, Leonard Kleinrock’s concept of packet switching: breaking messages into small chunks called packets and have them compete with packets from other messages on the same shared communication link. Second, J.C.R. Licklider’s vision of an “Intergalactic Network”, a globally connected set of computers where you can access data from or run programs on remote machines. Third, the demonstration of the viability of a wide-area network accomplished by connecting a computer in Massachusetts to one in California over a telephone line via a dial-up connection.

The crucial component in the early ARPANET was the Interface Message Processor, or IMP. This was the device that served as the interface between one or more connected host computers and the external packet-based ARPANET, processed the flow of packets, and sent them to other IMPs or to a connected host . It was the predecessor to the router. The ARPANET went live in 1969 with two computers and two more by the end of the year. In the early ARPANET, all the protocols for delivering packets were implemented in the IMPs. The software that ran on the computer and interfaced with the IMP was the Network Control Program, or NCP. This allowed applications to interface with the network and provided them with the abstraction of having a dedicated communication stream. It also handled flow control to the IMP and retransmission. As the ARPANET evolved, NCP became TCP, which handled reliable application-to-application communication.

Core principles

ARPANET was designed to support the interconnection of networks (inter-networking). It is a network of networks rather than a single network to which computers connect. For an organization to join the ARPANET, there was no requirement for it to have a specific internal network. This layer of networking would be a logical layer on top of any underlying physical network.
Because there was no assumption on the design of any physical network, the ARPANET assumed that communication is not guaranteed to be reliable. Instead, it would provide best effort packet delivery. Software, originally NCP and later TCP, provided the ability to detect errors or lost packets and request retransmission.
Routers connect the various individual networks and links together that make up the Internet. While routers are crucial in determining where to send packets, they were designed to not store information about the flow of packets. Any packet can be routed on its own with no a priori connection setup.
Finally, the entire network is decentralized. There is no central administration or control point. This aspect not only makes the network scalable (no single point of congestion) but aids in making it a fault tolerant network. If a router is not working, it is likely that there may be alternate paths to the destination.

The ARPANET clearly demonstrated the value of wide-area, hardware agnostic networking but access to it was restricted to organizations working on U.S. Department of Defense projects. Other networks were created to cater to non-defense communities and some of these, such as NSFNET and CSNET, also chose to use the IP platform.

NSFNET was a government funded (by the NSF, the National Science Foundation) network that was created in 1985 to connect NSF-funded supercomputing centers. It initially did permit commercial traffic on its networks.

By the late 1980s, the NSF was looking for commercial partners who would provide wide area networking services. A collection of these partners removed any need for government funding. In 1990, the ARPANET was decommisssioned and in 1995, the NFSNET backbone was transitioned to commercial networks, leading to the Internet of today.

LAN and Internet structure

The Internet comprises the network edge and the network core. The network core is the set of interconnected networks that provide wide area connectivity to the customers of the network. The network edge is the set of devices and local networks that connect to the core network. Your computers, TV sets, thermostats, and local network constitute a network edge. Your Internet service provider and its Internet service provider are components of the network core.

Local area networks

A local area network, or LAN, is a data communication network that covers a relatively small area, typically a building. It uses a the same network access protocol and usually the same transmission medium (e.g., Ethernet), allowing message delivery without the need to route messages through different networks. Devices that send and receive messages on the network are called hosts or nodes. These devices are peers, meaning that no device has more control or coordination of the network than any other device. Any host can initiate a data transfer with any other host on the LAN. LANs usually exhibit very low latency and a high data rate: typically 10s to a gigabit per second (Gbps) for wireless networks and a 1 gigabit per second (Gbps) or more for wired connections (although speeds as high as 10 and 100 Gbps are available).

Nodes connect to a local area network with an adapter. These are usually integrated onto the main circuit board but may be separate components, such as a USB ethernet adapter. Another term for these is NIC, which stands for Network Interface Controller.

The physical data communication links that the adapter uses to send and receive data are called media. Common examples are unshielded twisted pair (UTP) copper wire (e.g., ethernet cable), radio frequency (e.g., the 5 GHz frequency bands used by 802.11ac), coaxial cable (e.g., used by cable TV in the home and the MoCA standard, multimedia over coax), and optical fiber (which is not commonly used in the home).

The other end of the media terminates at a switch or hub. A hub is a device that acts as a central point for multiple LAN cables. It takes any data that comes in one port and sends it to all the other ports. A switch is similar to a hub but smarter. It looks at incoming data and determines the port or ports on which to transmit it. Switches have largely replaced hubs. They provide scalable bandwidth in that they do not introduce more network congestion as you add more hosts onto your LAN. Switches and hubs are link-layer devices (more on that later). That is, they move ethernet packets to their destination as opposed to relaying data between networks. They are responsible for creating the physical network. For wireless networks, a wireless access point serves as link-layer switch.

Access Network

The connection between the LAN and the Internet (via the Internet Service Provider) is called the access network. A residential gateway (a type of router) or access router connects a home or office LAN to the Internet. A modem, which stands for modulator/demodulator converts data between various analog formats as needed by the underlying media. Think of a modem as back-to-back NICs, each converting data to their type of media. Modems are generally built into access routers. Examples of access links are:

DSL (digital subscriber line): DSL uses existing copper telephone wiring between your home and the phone company’s central office. Since voice uses only the 0 - 4 kHz range of frequencies, there is a lot of untapped bandwidth in a phone wire. DSL uses the 4 kHz through 50 kHz frequency band for upstream data (data you send) and the 50 kHz through 1 MHz band for downstream data (data you receive). A DSL modem serves as an access router and modem. At the phone company’s central office, the access link terminates at a DSLAM (digital subscriber line multiplexor). From there, the data signals are sent to the data network (Internet) and the voice signals are sent to the phone network.
Cable: Internet service provided by a TV cable company uses the same coax cable that provides TV signals. With cable TV, hundreds of channels are transmitted at once, each occupying a different frequency band (high definition channels occupy a 6 MHz band and standard definition digital channels typically occupy a 1.5 GHz band). This type of transmission is called broadband. A certain number of channels are not used for TV services but instead are used for Internet access. The customer has an access router/modem that conforms to the DOCSIS (Data Over Cable Service Interface Specification) standard. Some number of channels are devoted to downstream communication (data you receive). Each channel provides 38 Mbps service. Another set of channels are devoted to upstream service, with each channel providing 27 Mbps service. The cable terminates at the cable company’s headend on a device called the CMTS, the cable modem termination system. Here, the Internet service frequencies are filtered to separate out the data and send it to the cable company’s Internet link. A key distinction between DSL and cable service is that the phone wire between the DSL modem and the phone company’s central office is a dedicated, non-shared line. The coax cable of cable TV service is shared among a neighborhood. However, even though the channel is shared, the capacity of coax media is greater than that of a phone line and its signal attenuation over distance is far lower.
FTTH (Fiber to the Home), FTTN (fiber to the Neighborhood): Fiber offers even greater bandwidth than cable and can propagate signals longer distances without degradation. Fiber to the Home (FTTH) is an optical fiber link that connects a home to a central office and delivers a combination of TV, telephone, and Internet service. Verizon’s FiOS service is an example of a service that delivers fiber to the home. Access links in this architecture often use optical splitters to allow a single fiber from the central office to fan out to several homes (typically 16–128). FTTH requires an end-to-end fiber infrastructure, which is costly to deploy (Verizon spent $23 billion through 2010 deploying FiOS at a cost of over $800 per home). An alternative architecture that is designed to be more cost effective is Fiber to the Neighborhood (FTTN). AT&T’s U-verse service is an example of this. A fiber runs from the provider’s central office to the neighborhood (within a half mile of the customers’ homes). There, it goes to a mini headend that uses copper wiring to connect to each customer’s home via VDSL (very high bitrate DSL).

ISPs

The organization that provides Internet service to a customer is called an Internet Service Provider (ISP). One ISP does not have access links to every Internet user in the world so there are thousands of ISPs: approximately 12,700 worldwide; 7,000 in the U.S. (about 100 or so larger-sized ones and lots of tiny regional ones). Most smaller ISPs that serve end users purchase Internet connectivity from other ISPs. ISP networsk are categorized by tiers. There isn’t a formal definition of what constitutes a tier but there are generally accepted conventions.

At the very top of the hierarchy, Tier 1 ISPs own the infrastructure that forms the backbone of the Internet. Each Tier 1 ISP has a peering agreement with every other Tier 1 ISP. This means that they agree to forward and receive traffic from each other without charging the other ISP for it. Tier 1 ISPs have access to the Internet routing table, also known as the global routing table. What this means that they know the top-level ISP to which any IP address should be sent. They also know which of their lower-tier ISPs receive any given IP address. As such, there is no concept of a “default route” at this level. With lower-tier ISPs, a router can give up if it does not know where a certain packet should go and just send it to a higher-level ISP. Tier 1 ISPs do not pay for data transit. Examples of Tier 1 ISPs include AT&T, Verizon, CenturyLink, Level 3, Telefónica, and NTT.

Tier 2 ISPs purchase data transit from Tier 1 ISPs and other Tier 2 ISPs but may also peer with other networks for direct connectivity and cost saving. They may then resell Internet access. Examples of Tier 2 ISPs include Comcast, British Telecom, Vodafone, and Sprint Communications.

Tier 3 ISPs occupy the lowest level and solely purchase Internet transit from one or more Tier 1 and Tier 2 IPSs. A Tier 3 ISP will typically provide coverage in a limited region of a country. Examples of these are Powernet Global, and Excel.Net. They concentrate on the retail and consumer markets.

A packet will often pass through several networks en route to its destination, both within and between ISPs. Each link terminates at a router which makes a decision on where the packet should be delivered. Each transmission of a packet is called a hop.

Within ISPs, edge routers are placed at the edge of an ISP’s network and communicate with other networks. For larger organizations, an edge router may sit at the edge of a customer’s network and connect to one or more ISPs. A core router is a router that connects routes on the Internet backbone. A core router may also be used in dispersed organizations to interconnect routers from multiple locations.

Networking overview

A network is inherently a shared resource. Lots of devices send and receive data on it. How do we share the network without clobbering each other’s data?

The most basic approach is to establish a dedicated connection from the source to the destination. This is a physical circuit and is what was used in the early days of the phone system: your phone line went to the central office where it connected to a patch cord that was in turn connected to the wire of the person with whom you were speaking. The same stream of electrons flowed from one end to the other. This is not a viable use of network resources. Moreover, we will likely have multiple applications running on a machine and using the network. We need to find a way to share network links.

One way can share the medium is to have each party communicate on different frequencies. This ability to transmit different data simultaneously is called broadband communication. Each transmission is assigned a frequency band: a portion of the total bandwidth. Broadband communications uses Frequency Division Multiplexing (FDM). Note that bidirectional communication on one channel is not possible: each sender-receiver set needs its own frequency band. Cable TV is an example of broadband.

An alternate method is to have everyone take turns in accessing the medium. This is called baseband communication. Each device is allowed full access to the medium’s bandwidth but only for a portion of time. Each communication session is assigned a specific set of short, fixed-length time slots during which it can transmit. This is called Time Division Multiplexing (TDM). Time Division Multiplexing is an example of circuit switching.

Both FDM and TDM are examples of circuit switching. Circuit switching sets up a dedicated communication channel, similar to a physical circuit. The key difference from a physical circuit is that the effective bandwidth is lower than the capacity of the medium since it is shared. With Frequency Division Multiplexing, the available bandwidth is sliced into frequency ranges. With Time Division Multiplexing, the available bandwidth is sliced into time slots.

An althernate way of sharing a medium is to use variable-size time slots with no scheduling on the transmitter’s part. This is called packet switching.

Circuit switching

Circuit switching requires a connection setup (or circuit setup). A control message is sent from the source to establish a path (route) from the source to the destination. Every switching element in the path agrees to the setup of the path and allocates the appropriate time slots (and other resources, such as memory buffers). The originating node is then informed that the connection is established and communication can take place. The path and all the switching resources (e.g., time slices or frequency bands) remain allocated to the communication session whether data is being sent or not. All data travels along this predetermined path from the source to the destination. When the communication is complete, the sender “releases” the circuit. This results in the sending of another control message through the path informing routers to de-allocate the resources they were using for the session.

Benefits of circuit switching are that it offers constant latency and guaranteed bandwidth. Another benefit is that data routing decisions do not have to be made for each message that is transmitted. They are made once at the start of the communication session. Data can flow through a router without having to be stored first until a decision is made where and when to transmit it. The downside of circuit switching is that each connection ties up bandwidth and switching resources whether any data is being transmitted or not. Conversely, if a connection needs to transfer a larger amount of data, it still needs to spread it over bandwidth given to it even if the rest of the network is not being used at the moment. In short, circuit switching does not use network resources efficiently. Each circuit is allocated a fixed bandwidth whether it is used or not.

Packet switching

With packet switching, a communication stream is broken into chunks of data called packets. Each packet must contain a destination address in its message header. The packets travel from the source node to their final destination via packet switches. Routers and ethernet switches are examples of packet switches. Routers are used to transmit packets between different networks and switches are used to transmit packets within a local area network. Each packet switch decides on the disposition of the packet (to what port it should be transmitted) based on the packet’s destination address. There is no need to retain memory of a predefined route for a stream of packets that represents a communication stream. In fact, there is no real concept of a communication stream because no routes have to be set up and no resources need to be reserved ahead of time. Because packet switching is designed for baseband networks, each packet has full use of the network link. If there are no other packets transmitted on the network, a node may see its available bandwidth approach the maximum capacity of the network link.

Packet switched traffic is known as datagram service in contrast to circuit switched virtual circuit service. Think of a datagram as a telegram or letter, where each message has to be addressed individually and may take a different path through the network. Think of a virtual circuit as a telephone call where the call is first set up but then gets an established route constant bandwidth for its duration.

Packet switching employs statistical multiplexing. Multiplexing means dividing a communication channel among multiple data streams. With TDM’s circuit switching, a communication channel was divided into fixed time slots per data stream. With packet switching, we are still sharing the network but now using variable time slots. What this means is that if a node has a lot of data to transmit and others do not then it can transmit large packets (or a lot of smaller packets) and use more of the network. If a node has little to transmit, it will use less of the network and more network capacity will be available for others. Of course, there might be times when a node may have to wait longer for the network to be free. If a lot of nodes have a lot of data to transmit, they will collectively have to wait longer to use the network, some more than others. Similarly, routers may end up queuing packets for a particular outbound port. This leads to variable latency. Packet switching is characterized by variable bandwidth and variable latency. With packet switching, an entire packet needs to be received by a router before it is transmitted on an outgoing link. This is called store and forward delivery and also contributes to network latency as we will see in our discusion on delay and throughput.

Despite its variable bandwidth and variable latency, packet switching allows for far more efficient use of the network than circuit switching and has no limit on the number of concurrent communication sessions. Because, on average, applications do not use the network non-stop, switching and link resources are wasted whenever data is not flowing on an established connection. With packet switching, there is no such reservation of these resources and more streams can be accommodated while providing the same bandwidth to applications. Packet switching is the dominant means of data communication. The Internet is built around packet switching.

Units

Throughout our discussions on networking, we will bring up units of measure. The three crucial ones for us are the size of data, the speed at which it moves, and the time that it takes for it to get somewhere.

Size: The fundamental unit of size is a bit (b). Eight bits make a byte (B). Packet sizes are generally measured in bytes (watch out for the factor of eight when working with bits per second!). Networks tend to use base–10 units, so a kilobyte (KB) is 1,000 bytes rather than the 1,024 bytes we are used to in programming. A kilobit (Kb) is 1,000 bits. A megabit (Mb) is 10⁶ bits and a gigabit (Gb) is 10⁹ bits. A megabyte (1 MB) is 1,000 KB or 10⁶ bytes or 8×10⁶ bits.
Time: Time is measured in seconds (s). One second (1 s) = 1,000 ms (milliseconds) = 10⁶ μs (microseconds) = 10⁹ ns (nanoseconds).
Rate: Rate is measured in bits per second (b/s or bps). Moving a megabit (1 Mb) of data over a 10 Mbps (megabit per second) network will take (1×10⁶ b ÷ 1×10⁶; bps) = 0.1 s, or 100 ms. Transmitting 1 kilobit on a 1 Gbps link will take (kilo ÷ giga) = (10³ ÷ 10⁹) = 10^–6 = 1 μs, or one millionth of a second.

Delay and throughput in networks

As a packet flows from its source to its ultimate destination, it goes through multiple routers. Each router introduces a delay as does the transit of the packet over the communication link.

With packet switching, a packet must be fully received by a router before it can be sent out. This is store and forward packet delivery. To see how this contributes to overall delay, let us consider each link. If data is transmitted at R bits per second and a packet is L bits long, it takes L/R seconds to transmit a packet from one link to the next. Since transmission on the next link will not start until the packet is received, each link adds a delay of L/R seconds. With N links (there are N–1 routers or transmitters but we also count the delay of the initial transmission), we have a total delay of N(L/R) seconds.

Network delay is due to four factors:

Processing delay. The processing delay is the computation that a router has to do to examine the header, check for packet errors, figure out the outbound port (route), and move data around. It is usually not a significant contributor to the overall delay and consumes a few microseconds.
Transmission delay. The transmission delay is the time that it takes to get a complete packet out onto the network. This is a function of the speed of the link (e.g., 1 Gbps) and the number of bits in the packet: (packet size ÷ transmission speed). If the packet size is L and the transmission rate is R, the transmission delay is L/R.
Propagation delay. The propagation delay is the time it actually takes the signal to move from one end of the medium to the other. While we might transmit the bits onto the network at, say, 100 megabits per second, there is a delay between the time that the signal is sent and the signal is received. This is the speed of signal propagation in the medium. For electrical signals in unshielded twisted pair or for light pulses in fiber optics, this value is approximately 2×10⁸ m/s (about 67% of the speed of light in a vacuum). An electrical signal propagates in air on a wireless network at approximately 3×10⁸ m/s. Depending on the distance the packet needs to travel, the delay may be from a few nanoseconds to a few tens of milliseconds. It might be considerably longer for satellite transmission due to the longer distance covered.
Queuing delay. With packet based networks, we can only transmit one packet onto a link at a time. Any other packets that need to go out on that link will need to wait in a queue. The queuing delay is a function of the amount of bits that are ahead of the packet (number of packets × the size of each packet) and the transmission rate of the outbound link. Queuing delay can vary a lot depending on how much data traffic is flowing over any particular link. It is dependent on how much traffic arrives at a router at approximately the same time that needs to go out on the same link and on how quickly the router can transmit the data out (see transmission delay).

One useful measure for estimating the likelihood of queuing delays is traffic intensity. Traffic intensity is the average packet transmission delay (L/R; see transmission delay) multiplied by the average rate of packet arrival. If the average rate of packet arrival is a, traffic intensity is La/R. It is technically a unitless quantity since packets/second × bits/packet ÷ bits/second cancel out. [Trivia - not on the exam: the unit of this measure is called an erlang and refers to the load on a network.]

If the traffic intensity is greater than one, that means that, on average, packets arrive faster than they can be transmitted and the queue will keep growing without bound. This assures us that the queue will eventually overflow and packets will have to be dropped, leading to packet loss.

If the traffic intensity is less than or equal to one, packets are arriving slower or at the same speed that they are being transmitted. This does not mean that packets will never get queued up. A number of packets may occasionally arrive in rapid succession — a burst — and will have to be queued. As traffic intensity approaches one, the probability that there will be bursts of packets that need to be enqueued increases drastically. Hence, as traffic intensity approaches 1, queuing delay starts to increase dramatically. In some cases, this will lead to lost packets due to limited queue sizes (routers have only a fixed amount of memory to devote to queues).

The total delay for a node is the sum of the four delays we just mentioned: processing + queue + transmission + propagation. The total delay for N links in a store-and-forward network is simply N times that amount.

Protocols

Data networking is generally implemented as a stack of several protocols – each responsible for a specific aspect of networking. The OSI reference model defines seven layers of network protocols.

1. Physical: Deals with hardware, connectors, voltage levels, frequencies, etc. This layer does not care about contents but defines what constitutes a 1 or a 0. Examples of this layer are USB and Bluetooth.
2. Data link: Sends and receives packets on the physical network. It may detect and possibly correct errors but only on this link (for example, queue overflow at a router may still cause packet loss). Ethernet packet transmission is an example of this layer.
3. Network: Relays and routes data to its destination. This is where networking gets interesting because we are no longer confined to a single physical network but can route traffic between networks. IP, the Internet Protocol, is an example of this layer.
4. Transport: Provides a software endpoint for networking. Now we can communicate application-to-application instead of machine-to-machine. Each application can create one or more distinct streams of data. TCP/IP and UDP/IP are examples of this layer.
5. Session: Manages multiple logical connections over a single communication link. Examples are SSL (Secure Sockets Layer) tunnels and remote procedure call connection management.
6. Presentation: Converts data between machine-specific data representations. Examples are data representation formats such as MIME (for media encoding on email), XML, XDR (for ONC remote procedure calls), NDR (for Microsoft COM+ remote procedure calls), and ASN.1 (used for encoding cryptographic keys and digital certificates).
7. Application: This is a catch-all layer that includes every application-specific communication protocol. For example, SMTP (sending email), IMAP (receiving email), FTP (file transfer), HTTP (getting web pages).

The OSI reference model gives us a terminology to discuss and compare different networks. Any specific network may not necessarily implement all these layers. The Internet protocol stack relies on layers 1 through 4 (physical through transport) but it is up to applications to implement and use session, presentation, and, of course, application layers.

A key aspect of this layering approach is that each layer only has to interact with the corresponding layer on the other side. For example, an application talks to another application. TCP on one system deals with issues of retransmission and message acknowledgement by talking to the TCP layer on the remote system. A layer also does not need to be aware of the implementation of layers above it or below it: they are just data sources and data sinks.

Protocol encapsulation

if we want to send an IP packet (layer 3) out on an Ethernet network (layers 1 and 2), we need to send out an Ethernet packet (an Ethernet NIC or transceiver knows nothing about IP). The entire IP packet becomes the payload (data) of an Ethernet packet. Similarly, TCP and UDP, layers above IP, have their own headers, distinct from IP headers (they need a port number, for example). A TCP or UDP packet is likewise treated simply as data by the IP layer. This wrapping process is known as protocol encapsulation. Each layer of the networking stack can ignore the headers outside of its layer and treat anything from higher layers simply as part of the payload that needs to be sent.

The Application Layer

Application architectures

There are two ways that network applications are structured: client-server and peer-to-peer.

Client-server: This is the dominant model of interaction. One application, called the client (and usually run by the end user), requests something from another application, called a server. The server provides a service. Examples of this are a web browser (client) requesting a web page from a web server, a mail application (client) accessing a mail server to get mailbox contents, or a print server being given content to print. In this model, clients communicate with the server and not with other clients.
Peer-to-peer: A peer-to-peer architecture employs a collection of applications, any of which can talk to any other. These applications are peers and are generally run by a collection of end users rather than some service provider. The name peer implies that there is no leader: applications all have equal capabilities. An appealing aspect of a peer to peer design is self-scalability. As more and more nodes join the collection of peers, the system has more peers to do the work and can hence handle a large workload. Examples of peer-to-peer architectures are BitTorrent and Skype.
Hybrid: A difficulty with peer-to-peer architectures is that one often needs to do things such as keep track of peers, identify which system can take on work or has specific content, and handle user lookup and authentication. This led to a variation of the peer-to-peer model where a coordinator, a central server, is in place to deal with these centralized needs. However, the peers still handle all the bandwidth-intensive or compute-intensive work.

No matter what architecture is used, there is still a fundamental client-server relationship. One system (a client) will send some request to another (a server).

Network Application API

When we write network-aware applications, they need to use the network to communicate with each other. These applications are no different than any other processes on the computer. Any process can access the network and it is up to the operating system to coordinate this access. When writing applications, the programmer will use a set of interfaces referred to as a Network API (Application Programming Interface) to interact with the network and not worry about the lower layers of the network. For example, a programmer does not need to know about ethernet or IP to communicate with another program but needs to have available the abstraction of being able to send data from one logical port on an application to one on another application.

The communication session is the conducted by the application layer protocol. This is a definition of the valid sequence of requests, responses, and their respective message formats for a particular network service. The protocol needs to be well-defined for applications to be able to communicate with each other.

Given a well-defined protocol, any application should be able to follow the rules of the protocol and create messages that the other side can understand, regardless of the implementation language or operating system. For instance, an iPad running an Swift should be able to talk to a mail server written in Java running on a Windows platform.

The Network API will have core services, such as those related to sending and receiving data, that are provided by the operating system. These are augmented with libraries to handle other functions, such as looking up names, converting data, and simplifying certain operating-system interfaces.

As programmers writing network-aware applications, we obviously need functions for sending and receiving data but we may want to be able to specify something about the behavior of that data over the network. For example:

Do we need reliable data transfer?: Is it important to the application that data arrives reliably and in order at the destination? It seems like the answer should always be yes, but we have to realize that ensuring reliability entails the detection, request for, and retransmission of lost packets. This adds a considerable delay to the delivery of that lost packet. For streaming media applications, such as telephony, that packet may arrive too late. In this case, it is useless to the application and the application could just as easily unsed best-effort service. Moreover, some applications may choose to handle retransmission requests themselves in a different manner. Applications that can handle unreliable media streams are called loss-tolerant applications.
Throughput: An application may have specific bandwidth requirements. For example, video and voice telephony applications may have minimum bandwidth needs. Applications with such needs are bandwidth sensitive applications. Applications that can adapt to whatever bandwidth is available (for example, switch to a lower bandwidth codec) are known as elastic applications.
Delay and Jitter: Interactive applications, such as voice and video telephony may want to ensure minimal network delay to reduce the delay to the receiver. Jitter is the variation in delay and these applications would also like to see low jitter.
Security: Applications may need to ensure that they are truly communicating with the proper computer and a legitimate application on that computer. They may be concerned about the integrity of the data that is being transmitted and want to ensure that it cannot be modified or read by outside parties.

These are all legitimate desires. Unfortunately, IP gives us no control over throughput, delay, jitter, and security. We can handle security at the application layer and we will later examine mechanisms that were added to IP to support some degree of control over packet delivery.

IP Transport Layers

Applications interact with IP’s transport layer. There are two dominant transport-layer protocols on top of IP (IP is the network layer): TCP and UDP (there are a few others, such as the SCTP, but these two dominate).

TCP, the Transmission Control Protocol, provides connection-oriented service. This does not imply that an actual network connection is being set up as one would for a circuit-switched network. We are strictly talking about transport-layer services here. For layer 2 and layer 3 protocols, a connection refers to setting up a pre-defined route (circuit) and providing that connection with guaranteed bandwidth. At the transport layer (4), we still strive strive to provide the illusion of a reliable bidirectional communication channel but it is all done in software on top of unreliable datagrams (IP). At the transport layer, the software does not have any control of the route that packets take or the bandwidth that is available to the connection.

The TCP layer of software ensures that packets are delivered in order to the application (buffering them in memory in the operating system if any arrive out of order) and that lost or corrupt packets are retransmitted. TCP keeps track of the destination so that the application can have the illusion of a connected data stream (just keep feeding data to the stream and don’t worry about addressing it). TCP provides a full-duplex connection, meaning that both sides can send and receive messages over the same link. TCP is stream oriented, meaning that data is received as a continuous stream of bytes and there is no preservation of message boundaries.

UDP, the User Datagram Protocol is designed as a very thin transport layer over IP. It provides connectionless service, also known as datagram service. While UDP drops packets with corrupt data, it does not ensure in-order delivery or reliable delivery. UDP’s datagram service preserves message boundaries. If you send n messages, you will receive n messages; they will not be combined into one message.

Port numbers in both TCP and UDP are used to allow the operating system to direct the data to the appropriate application or, more precisely, to the socket that is associated with the communication stream on the application. A port number is just a 16-bit number that is present in both TCP and UDP headers to identify a specific endpoint on a node.

Sockets

Sockets are an interface to the network provided to applications by the operating system. They were created at the University of California at Berkeley for 4.2BSD (a derivative of UNIX) in 1983 and most operating systems now support this interface. The purpose of sockets is to provide a protocol-independent interface for applications to communicate with each other. The underlying network does not have to by IP. Once set up, the socket looks like a file descriptor for an open file. With connection-oriented (e.g., TCP) sockets, you can use the regular file system read and write system calls to receive and send data.

Sockets are the mechanism that the operating system exposes to the user for accessing the network. A socket is created with the socket system call and assigned a local address and port number with the bind system call. The OS can fill in defaults if you do not want to specify a specific address and port. (Note that you specify an address because your system might have multiple IP addresses; one for each of its network interfaces). The socket also requires that the programmer identify the address family for the socket (e.g., the protocol stack: IP, IP version 6, Bluetooth, local) as well as the mode of communication (e.g., connection-oriented or datagrams).

Sockets for connection-oriented protocols (streams)

For connection-oriented protocols (TCP), a socket on a server can be set to listen for connections with the listen system call. This turns it into a listening socket. Its only purpose will now be to receive incoming connections.

The accept call waits for a connection on a listening socket. It blocks until a connection is received, at which point the server receives a new socket that is dedicated to that connection.

A client establishes a connection with the connect system call. After the connection is accepted by the server, both sides now have a socket on which they can communicate.

Sending and receiving data is compatible with file operations: the same read/write system calls can be used. Data communication is stream-oriented. A sender can transmit an arbitrary number of bytes and there is no preservation of message boundaries.

When communication is complete, the socket can be closed with the shutdown or close system calls.

Sockets for connectionless protocols (datagrams)

With connectionless protocols, there is no need to establish a connection or to close one. Hence, there is no need for the connect, listen, or shutdown system calls.

Unlike connection oriented sockets, data communication is message-oriented. A sender transmits a message. The size of the message is limited to the maximum size allowable by the underlying network (the MTU, maximum transfer unit).

Because you need to specify the destination as the operating system does not keep state of a “connection”, new system calls were created for sending and receiving messages. The sendto and recvfrom system calls are used to send and receive datagrams. sendto allows you to send a datagram and specify its destination. recvfrom allows you to receive a datagram and identify who sent it.

The Java interface to sockets

Java provides many methods to deal with sockets and some commonly-used ones consolidate the sequence of steps that need to be take place on the operating system. The constructor for the ServerSocket class creates a socket for the TCP/IP protocol, binds it to a specified port and host (using defaults if desired), and sets that socket to the listening state. A client’s Socket class constructor creates a socket for the TCP/IP protocol, binds it to any available local port and host, and connects to a specified host and port. It returns when the server accepts the connection. The connected socket object allows you to acquire an InputStream and OutputStream for communication.

Threads, concurrency, and synchronization

A process normally has one thread of execution, or flow of control. A process may be multithreaded, where the same program has multiple concurrent threads of execution.

In a multi-threaded process, all of the process’ threads share the same memory and open files. Within this shared memory, each thread gets its own stack, which is where return addresses from functions are placed and where local variables get allocated. Each thread also has its own instruction pointer and registers. Since memory is shared, it is important to note that there is no memory protection among the threads in a process. Global variables are freely accessible by all threads. In particular, the heap, the pool of memory that is used for dynamic memory allocation is shared and freely accessible to all threads in the process.

Advantages of threads

There are several benefits in using threads. Threads are more efficient than processes. The operating system does not need to create and manage a new memory map for a new thread (as it does for a process). It also does not need to allocate new structures to keep track of the state of open files and increment reference counts on open file descriptors. Threading maps nicely to multicore architectures and allows for the effective use of multiple process cores.

Threading also makes certain types of programming easy. While it’s true that there is a potential for bugs because memory is shared among threads, shared memory makes it trivial to share data among threads. The same global and static variables can be read and written among all threads in a process. For a network server process, threading is appealing because it becomes easy to write code that handles multiple client requests at the same time. A common programming model is to have one master thread that waits for client connections and then dispatches a worker thread to handle the request.

While thread usage differs slightly among languages, there always needs to be a mechanism to create a thread and to wait for threads to exit. When a thread is created, a specific method (function) is called in that new execution flow. The original execution flow (the one thread that started when the process began) continues normally. When that thread eventually returns from that method, the thread terminates. If a thread needs to wait for another thread, it can choose to block until the other thread terminates. This is called a join.

Mutual exclusion: avoiding stepping on each other

Because threads within a process share the same memory and hence share all global data (static variables, global variables, and memory that is dynamically-allocated via malloc or new), there is an opportunity for bugs to arise where multiple threads are reading and writing the same data at the same time. A race condition is a bug where the outcome of concurrent threads is unexpectedly dependent on a specific sequence of thread scheduling. Thread synchronization provides a way to ensure mutual exclusion, where we can have regions of code that only one thread can execute at a time. Any other thread that tries to run in that region of code will go to sleep (be blocked) until the lock is released when the current thread in that region leaves it.

Java allows a synchronized keyword to be added to a method to ensure that no more than one thread will be allowed to run in that method. If that degree of control is too coarse, Java also allows the programmer to use the synchronized keyword to define a region of code that will be locked by a variable called a monitor object. Any other thread that tries to enter any region of code that is synchronized by the same monitor object will be blocked. This region of code that provides mutual exclusion is called a synchronized block.

Domain Name System

A node on the Internet is identified by its IP address. For IP version 4, the most common version deployed today, an IP address is a 32-bit value that is expressed as a set of four bytes, each as a decimal number and separated by dots. For instance, the IP address of the Rutgers web server is 199.83.128.67. [IP version 6, which is rapidly expanding as we are out of IPv4 addresses in some areas, is a 128-bit value and is expressed as a set of 8 groups of four hexadecimal digits.] As humans, however, we prefer to identify endpoints by name rather than by a number. For example, we think of the Rutgers web server by its name, www.rutgers.edu. We will now explore the management of IP domain names and IP addresses and converting between them.

How are IP addresses assigned

IP addresses are distributed hierarchically. At the very top level, an organization called the IANA (Internet Assigned Numbers Authority) is responsible for the entire set of IP addresses. It allocates blocks of addresses to Regional Internet Registries (RIR). There are five RIRs, each responsible for a part of the world’s geography. For instance, the U.S. and Canada get addresses from ARIN, the American Registry for Internet Numbers. Countries in Europe and the mid-East get addresses from the RIPE Network Coordination Centre. These RIRs in turn allocate blocks of IP addresses to ISPs within their region. Since ISPs are tiered, an ISP may allocate a smaller block of addresses to a lower-tier ISP as well as to a company that subscribes to its services.

How are names assigned

In the early days of the ARPANET, each machine had to have a globally unique name. The Network Information Center (NIC) at the Stanford Research Institute (SRI) kept the master list of machine names and their corresponding IP addresses. This solution does not scale. As the number of hosts on the Internet grew larger, a domain hierarchy was imposed on the name space. This created a a tree-structured name space with name management delegated to the various nodes of the tree, where each node is responsible for the names underneath it. Rutgers, for example, can name a new machine on the Internet anything it wants as long as the name is unique within Rutgers and is suffixed with rutgers.edu. The textual representation of Internet domain names is a set of strings delimited by periods with each set representing a level in the naming hierarchy. The rightmost string is the highest level in the hierarchy.

The hierarchy of Internet domain names has a single root under which are top-level domains (TLDs). These are the .com, .edu, .org suffixes that you are familiar with. Currently, there are 1,239 top-level domains. They are divided into two categories: generic top-level domains and country-code top-level domains.

Generic TLDs (gTLD) include the .com, .edu, .gov, .net, etc. domains. Many of them date back to the first proposal of creating a domain hierarchy RFC 920. Each of these domain names is three or more characters long. As the Internet became international, country-specific domains were created. These Country-code TLDs (ccTLDs) are two-letter ISO 3166 country codes (e.g., .ad for Andorra, .dk for Denmark, .es for Spain, .us for the U.S.A.). The root of the domain hierarchy initially allowed only US-ASCII (Latin) characters. This rule changed in 2009 and a new set of Internationalized Domain Names for country code top-level domains (IDN ccTLD) became available. Examples of these domains are السعودية. for Saudi Arabia, .рф for Russia, and .中國 for mainland China. In 2011, internationalized domain names were approved for generic top-level domains (IDN gTLD), giving us domains such as .みんな (“everyone” in Japanese), .移动 (“mobile” in Chinese), and .дети (“kids” in Russian).

Each top-level domain has one administrator assigned to it. The IANA keeps track of the organizations that manage the various top-level domains. Until 1999, for example, a company called Network Solutions Inc. operated the .com, .org, and .net registries. Until that time, Network Solutions maintained the registry of names and processed registration requests from customers. Since then, the process has been decentralized to support shared registration. This allows multiple companies to provide domain registration services. One company is still assigned by the IANA to be the keeper of the master list for a specific top-level domain. This list of registered domain names for a particular TLD is called the domain name registry. The company that maintains this registry is called the domain name registry operator, also known as the network information center (NIC). The IANA keeps track of all these organizations. A domain name registrar is a company that provides domain registration services to customers, allowing them to register domain names for a fee. There are approximately 2,124 of these companies. Examples of these are GoDaddy (with over 60 million domains), Namecheap, eNom, and Tucows.

When you pay GoDaddy, the registrar, $11.99 to register poobybrain.com, it consults the .com domain name registry at Verisign, which is the registry operator for the .com domain. If the domain name is available, GoDaddy becomes the designated registrar for that domain. This means that Verisign knows that Go Daddy has information on the owner of poopybrain.com and that changes and requests to transfer ownership or the registrar of your domain will have to come from that registrar. Of the $11.99 that you paid GoDaddy, $7.85 went to Verisign as a registry fee (different TLDs have different fees; .net registration costs $7.46). A $0.18 yearly fee that goes to ICANN to manage the registry.

Associating names with addresses

We now saw how IP addresses are allocated and how domain names are registered. There is no correlation between the two of them. A domain name does not imply a specific address and adjacent IP address numbers may belong to completely different domain names. We need a way to look up www.rutgers.edu and find out that its address is 199.83.128.67 since the IP layer knows absolutely nothing about domain names. Since the network core has no interest in domain names, name-to-address resolution is handled at the network edge, in the application before it establishes a socket connection. The process of looking up a name is an application-layer protocol.

In the past, Stanford Research Institute’s Network Information Center maintained the entire list of hosts on the internet (in a hosts.txt file; /etc/hosts on Unix systems). This file would be periodically downloaded by every system on the Internet. Clearly, this solution was not sustainable. We already saw that it made managing unique names problematic. Moreover, with millions of hosts on the Internet, there was a lot of churn in this database. Downloading a new copy of every host on the Internet constantly just doesn’t make sense.

The system that was put in place was a database of DNS servers (Domain Name System servers). Like domain names themselves, DNS is a distributed, hierarchical database.

A DNS server is responsible for a managing a sub-tree in the domain name hierarchy. For example, a server might be responsible for everything under rutgers.edu or even just the machines under under cs.rutgers.edu. This sub-tree of a group of managed nodes is called a zone. Each authoritative name server is responsible for answering queries about its zone. The authoritative name server for rutgers.edu is therefore responsible for the rutgers.edu zone. The question now is, how do you find it?

A DNS server accepts queries from clients (called questions) and provides responses (called answers). By default, interactions with DNS servers use UDP for improved performance, although TCP is almost always supported as well.

Any DNS server can be found by starting at the top of the name hierarchy. There are 13 root name servers that can provide a list of authoritative name servers for all the top-level domains (you can download the list of root name servers here. By contacting any one of these servers, you can find out the address of a name server responsible for a specific top-level domain (such as .edu). Then, by querying a name server for that domain (e.g., the .edu name server), you can find a name server responsible for a name within that domain (such as rutgers.edu). The process can continue until you find a name server that is responsible for the zone that contains the host you need.

There are two basic approaches for name resolution when dealing with a hierarchy of name servers: iterative or recursive queries. An iterative query is a single query to a name server. That server will return the best answer it can with the knowledge it has. This can be the the information configured for its zone (e.g., the domain names for which it is responsible) or cached results. If it does not have an answer to the query, it may return a referral. A referral is the name server for the next lower layer, taking you closer to your destination. For example, the root server can return a referral to tell you how to get to the .edu name server. A query to the .edu name server can return a referral to tell you how to get to the rutgers.edu name server. The advantage of this approach is that each name server can be completely stateless. It either knows the answer or it does not.

With recursive resolution, the DNS server takes on the responsibility of performing the set of iterative queries to other DNS servers on behalf of the requestor and sends back a single answer. With recursive resolution, a DNS server may first send a query for the full domain name to the root name server. The root name server will return a referral to, for example, the .edu name server. A query to that server will then return a referral to the rutgers.edu name server. In reality, the recursive server will cache past lookups so it will likely know the addressses of recently-used top-level domains.

The advantage of recursive resultion is that it incurs less communication at the client, simplifies the client’s protocol, and allows for caching of results at all the intermediate servers. The disadvantage is that a recursive server has to keep state about the client’s request until it has completed all processing and is ready to send a response back to the client.

A DNS server is not obligated to support recursion. Most top-level DNS servers, such as root servers, do not support recursive queries.

How does a DNS query work?

The client interaction with DNS is via a DNS resolver. This is a a DNS server that is not necessarily part of the DNS hierarchy (that is, it does not have to be a server responsible for a zone). However, it is capable of taking a recursive request from a client and performing a set of iterative queries, going to the root servers if necessary, to get the result. Resolvers could be hosted on the client, within the organization, or by third parties such as Google Public DNS, or OpenDNS. Most ISPs provide a DNS resolver service. Many systems (such as Windows and Linux platforms) support extremely limited local DNS resolvers that are incapable of iterative queries and simply talk to another DNS server (e.g., a resolver hosted by the customer’s ISP). These limited DNS resolvers are called stub resolvers.

DNS resolvers maintain a local cache of frequently used lookups to avoid the overhead of repeated lookups for the same name and to avoid the overhead of iterative queries. For example, it does not make sense to look up the name server responsible for .com over and over for each query (it’s 192.5.6.30, by the way).

Let us look at the sequence of operations that a query for www.cs.rutgers.edu might take from an application. We assume that the client machine is configured to use OpenDNS as a DNS resolver service.

Rutgers registers its domain with educause.edu, the domain registrar for names in the edu TLD. It provides Educause with a list of DNS servers that can answer queries for names underneath rutgers.edu. Educause.edu, in turn, registers its DNS servers with ICANN, who is responsible for the data in the root name servers.

The client application contacts a local DNS stub resolver. This checks its cache to see if it already has the answer. It also checks a local hosts file to see if the answer is hard-coded in the configuration file. Giving up, it contacts a real DNS resolver (e.g., OpenDNS at 208.67.222.222) and sends it a query for “www.rutgers.edu”.
The OpenDNS resolver checks its cache and doesn’t know either, so it contacts one of the root name servers (let’s assume the resolver’s cache is completely empty). It sends a query of “www.cs.rutgers.edu” to the root server 198.41.0.4 (a.root_servers.net).
The root server doesn’t have the answer but it knows the DNS server responsible for the edu domain, so it sends a referral back to the OpenDNS resolver giving it a list of name servers responsible for edu.
The resolver now sends a query of “www.cs.rutgers.edu” to 192.41.162.32, one of the edu name servers (a.edu-servers.net at 192.5.6.30). It does not know the answer either but it does know the name servers for rutgers.edu, so it sends back a referral with a list of those servers.
The resolver now sends a query of “www.cs.rutgers.edu” to 192.230.122.7, one of the rutgers.edu name servers (ns8.a1.incapsecuredns.net). This happens to be an authoritative name server for the rutgers.edu zone and it returns back the address, 128.6.4.24. If cs.rutgers.edu was defined as a separate zone, the rutgers.edu DNS server would send a referral to yet another name server.
The query is now complete and the OpenDNS resolver sends the result back to the stub resolver that requested the query, which sends it back to the client application.

Inside a DNS server

DNS servers store various information about domain names. Each datum is called a resource record. A resource record contains a name, value, type of record, and a time to live value. Common records include:

Address (A record): identifies the IP address for a given host name.
Canonical name (CNAME record): identifies the real host name for an alias. For example, www.cs.rutgers.edu is really a CNAME (alias) to www3.srv.lcsr.rutgers.edu.
Name server (NS record): identifies the authoritative name servers for the domain.
Mail exchanger (MX record): identifies the mail server for a given host name.

DNS uses a simple request-response protocol. Each query message from a client has a corresponding response message from the server. The exact same binary message structure is used for all DNS messages. A flag field identifies whether the message is a query or a response and whether recursion is desired. A variable-length set of fields after the fixed-length message header contains questions (e.g., that you are looking for the A record of www.rutgers.edu) and answers (the responses to the questions).

Caching

As we mentioned earlier, DNS resolvers rely on caching to avoid performing the same queries over and over. Every DNS zone contains a time to live value, which is an estimate of how long it is safe for a resolver to keep the results for that zone cached. For example, systems under rutgers.edu have a TTL of 3600 seconds (1 hour) while systems under google.com have a TTL of 900 seconds (15 minutes).

Reverse DNS

DNS servers are also able to take an IP address as a query and resolve a domain name for the address. Doing this requires a different query path: the edu server has no idea what range of IP addresses were allocated to Rutgers; it just knows the name servers for Rutgers.

A special domain, in-addr.arpa is created for reverse lookups (arpa stands for Address & Routing Parameter Area). The IP address to be queried is written in reverse order, with the first byte last, to construct a name that looks like 24.4.6.128.in-addr.arpa for the address 128.6.4.24.

An organization has a range, or several ranges, of IP addresses assigned to it. It sets up a local DNS server with PTR (pointer) records that map IP addresses to names. It then tells its ISP what DNS servers are responsible for reverse DNS lookups. The ISP knows what range of addresses belong to the organization. If it gets a query for an address in that range, it now knows which name servers to send on a referral reply. A reverse query that starts at the root will contact the root name servers. These servers, in addition to knowing the name servers of TLDs, also know the name servers for the five RIRs (ARIN, RIPE NCC, etc.) - the entities that hand out IP addresses. The root server may return a referral for the ARIN server (responsible for IP addresses in North America). The ARIN server knows the blocks of IP addresses that were allocated to various ISPs and will send a referral to the name server for the appropriate ISP. That ISP, when queried, will then respond with a referral to the name server for the organization that owns that address.

DNS Terms Glossary

IANA: Internet Assigned Numbers Authority, the organization in charge of keeping track of IP addresses, port numbers, and other number-related aspects of the Internet
ICANN: Internet Corporation for Assigned Names and Numbers, the non-profit company that currently runs the IANA.
RIR: Regional Internet Registry, assigns IP addresses to ISPs within a geographic region.
TLDs: top-level domains.
gTLD: generic top-level domain (.com, .edu, .net, …).
ccTLD: country code top-level domain (.ac, .ae, .ie, .nl, .us).
IDN: Internationalized Domain Names.
Domain name registry: the database of registered domain names for a top-level domain.
Domain name registry operator: the company that keeps the database of domain names under a TLD.
NIC: network information center, another name for a domain name registry operator.
Domain name registrar: a company that lets you register a domain name.
DNS: Domain Name System.
Canonical name: a name that is an alias for a another domain name.
Authoritative name server: a name server that stores, and is responsible for, specific DNS records (as opposed to storing a cached copy)
Zone: a portion of the domain name space (a sub-tree) that is managed by a specific entity. E.g., rutgers.edu is a zone that manages all domains within rutgers.edu.
A (address) record: a DNS record that stores the IP address corresponding to a specific host name.
MX (mail exchanger) record: a DNS record that stores the name of the host that handles email for the domain.
DNS resolver: the client side program that is responsible for contacting DNS servers to complete a DNS query.
Reverse DNS: querying IP addresses to find the corresponding domain names.

HTTP

HTTP stands for Hypertext Transfer Protocol and is the web’s application-layer protocol for interacting between web browsers and web servers. It is a TCP, line-oriented, text-based protocol that consists of requests to the server followed by responses from the servers. The protocol is stateless. This means that the server does not store any state from previous requests. This simplifies the design of the protocol, simplifies recovery from crashes, and makes load balancing easier. Note that web application that use HTTP may impose their own state but it is not a part of the HTTP protocol.

Persistent vs. non-persistent connections

HTTP was originally designed to support non-persistent connections. This meant that the connection was alive for only a single request-response interaction. For each new request, the client had to re-establish a connection. That may have been fine in the earliest days of the web but a request for a page is now typically accompanied by multiple successive requests to download supporting files (stylesheet files and images). The overhead of the round-trip time in setting up a connection for each piece of content adds up. HTTP was enhanced to support persistent connections, where a client and server can exchange multiple request-response interactions on the same connection.

Requests and responses

The main function of HTTP is to request objects (content). These are identified in a browser by a URL (Uniform Resource Locator). A URL takes the format:

protocol://domain_name:port_number/path/to/object/the_object

Browsers support various protocols, not just HTTP. Common ones include HTTP, HTTP (HTTP that is made secure via SSL), FTP (file transfer protocol), and “file” (local files). If the protocol is “http” or “https”, the browser process it via its HTTP protocol module.

The HTTP protocol comprises requests and responses. Each of these messages is structured as a set of text headers, one per line, followed by a blank line, and optionally followed by content. The first line of a request contains a command. The three main HTTP requests are:

GET: request an object
HEAD: like GET but download only the headers for the object.
POST: upload a sequence of name/value pairs to the server. The data is present in the body of the message and is often the response to a form. An alternate way of uploading user data as a set of name/value pairs is to use the GET command and place the data as a set of parameters at the end of the URL. For example, http://www.pk.org/test?name=paul&id=12345&ref=zzz

Each HTTP response contains multiple headers, the first of which contains a status code and corresponding message.

Cookies

While the HTTP protocol itself does not require keeping state, HTTP provides a way for web servers to store state about past sessions from the browser. It does this through cookies. A cookie is a small amount of data that is associated with the web site. The data is created by the server when it gets an HTTP request from the client. It then sends that data back in a Set-Cookie line in the header of the HTTP response. Future HTTP requests to the same server will contain a Cookie line in the header and contain the data that is associated with the cookie.

This simple mechanism allows a web server to create a database entry indexed by a the cookie data to keep track of a user’s session. That database entry can include things such as shopping cart contents, authentication state, pages visited, time spent on a page, etc. The actual cookie itself does not need to store any of this; it just serves as a unique key into the database table.

Because a web page may contain content from other web sites (hence, other servers), it is possible that requests to those sites will result in the generation of cookies. A first-party cookie is one that comes from the web server that is serving your page request. A third-party cookie is one that comes from another web server that serves some content that is present on the page you originally requested. There has been concern over third party cookies in that they allow these parties, usually advertisers, to track your visits to specific web sites. Most web browsers block third-party cookies by default.

Caching

Caching avoids the need to request the same content over and over from the server. However, the challenge is to find out whether the content that is in the cache is still valid. HTTP provides a conditional GET mechanism that is triggered by two lines in the GET header.

When a browser requests content from a server via an HTTP GET message, the response headers include two lines. One is a Last-Modified header that contains the timestamp of the last modification time of that content. The second is a ETag header that contains a hash of the content. The client stores both of these values along with the copy of the content in its cache.

When the content is requested again, the client issues an HTTP GET request to the server but includes two lines in the headers. One is an If-Modified-Since line that contains the last modification time from the cache and the other is an If-None-Match line that contains the value from the ETag. This allows the server to check whether the content has changed since the version that the client cached. If it did not, the server responds with a Not Modified message and no content. If it did, the server responds just as it would with a regular GET request.

Caching proxies

Caching does not need to be handled only by the web browser. A caching proxy is an HTTP server that is hosted within a company’s LAN. All HTTP requests go to that server instead of to the actual destinations (browsers allow you to define a proxy server for some or all domains). The proxy server, in turn, passes the request out to the requested server. This seems like an extra layer of overhead (the client connects to the proxy and sends its request; the proxy then does the same). However, the proxy can cache content. If you request content that somebody else previously requested, it is likely to be cached and the organization incurs less traffic on its outside Internet connection. Even pages that change frequently will usually have significant portions that can be cached, such as CSS files, JavaScript files, and supporting graphics.

Optimizations

A web browser connects to a web server, issues an HTTP request for a web page (an html file), and then parses it to see what additional objects need to be requested to render the page. This typically includes multiple images, CSS files, and possibly additional content such as JavaScript files. Requests are then issued for these objects. One problem with issuing HTTP requests to a server one at a time is that one large response (or a slow one) can hold up all other requests that the client will make. This is called head-of-line blocking.

Parallel connections

One way to avoid head of line blocking is to have the browser open a separate TCP connection for each HTTP request. There are several downsides to this. Many web pages have many dozens or even hundreds of objects (think of a photo thumbnails gallery, for instance). Opening a large number of connections can take substantial time. Moreover, it can consumer substantial resources at the server since each connection requires kernel and application memory resources as well as CPU time. Because of this, browsers support parallel connections but usually limit them to a small number (typically four). Once you limit the number of connections, you again have the risk of head-of-line blocking. Another problem with pipelining is that there’s no assurance that it will work if a proxy is present. Just because your browser establishes several TCP connections to the proxy does not mean that the proxy will, in turn, establish those connections to the server.

Pipelining

Another performance optimization was HTTP pipelining. With pipelining, instead of waiting for each response, multiple HTTP requests can be dispatched one after another over one connection. However, the server is still obligated to issue responses in the order that the requests were received and head-of-line blocking is still an issue since one delayed or long response can hold up the responses behind it. Most browsers as well as proxies have disabled pipelining or do not implement it.

HTTP/2 Multiplexing

HTTP/2, the next major update to the HTTP protocol, which came out in 2015, supports the same commands as its predecessor, HTTP/1.1. However, it adds a number of optimizations.

HTTP/2 supports multiplexing. This allows multiple messages to be interleaved on one connection. It is a form of a session layer (implemented in the application, of course). A large response may be broken up into multiple chunks with other responses interleaved among it. The browser keeps track of the pieces and reassembles all the objects.

HTTP/2 Server Push

The HTTP/2 protocol adds a server push capability that allows the server to send objects to the client proactively. The client, upon receiving them, can add them to its cache. This is useful for objects such as stylesheets that are used by an HTML page. Normally, the browser would have to first receive the HTML page so it can parse it before issuing requests for objects that the page needs. If this information is given to the server, it can start sending these objects before the server requests them.

HTTP/2 Header compression

HTTP request and response headers tend to be verbose and are text-based. Their size often requires several round trips just to get the headers for a page out to the server. Compressing headers can make requests and responses shorter and speed up page loads.

FTP

FTP, the file transport protocol, is one of the earliest Internet protocols and was designed to transfer files between computers. The protocol uses TCP and is based on commands and responses. A command is a single line of ASCII text. A response is also a single line of text and contains a status code along with a message.

To communicate, a client establishes a TCP connection from some available port N to port 21 on the server. Commands and responses are sent over this communication channel. Some basic commands are USER to identify a user name, PASS to specify the password, GET to download a file, PUT to upload a file, and DIR to get a directory listing.

If the command is a request for data transfer (such as putting a file, getting a file, or getting a directory listing), the server initiates a TCP connection back to the client on port N+1. Data is then transferred over this channel (either from client to server or server to client, depending on the request) and the connection is then closed. FTP is unique compared with most other protocols in that it separates control and data channels. Control information is sent out of band, on a different channel than the data.

Passive mode

Because having a server connect to a client proved problematic in some environments, FTP supports an alternate mechanism, called passive mode, where the client connects to the server to set up the data channel. This is now the more popular mode of operation and some FTP clients, such as web browsers, only support this mode.

SMTP

The Simple Mail Transfer Protocol (SMTP) is designed for delivering mail to a server that hosts the recipient’s mailbox. It is a TCP-based protocol that is line-based and uses ASCII text for all interactions. An SMTP server acts as a client and a server. Typically a mail application uses SMTP to send a message to a user’s SMTP server (e.g., smtp.gmail.com). This server is often hosted by the organization that provide’s the sender’s email service (e.g., Google, Comcast, Rutgers). This SMTP server then queues the message for delivery. To deliver the message, it acts like a client. The SMTP server looks up the DNS MX (mail exchanger) record for the destination domain, connects to that SMTP server, and delivers the message. The receiving server places the message in the user’s mailbox. If the user has an account on that machine and runs the mail client locally, the mail client can access the mailbox and read the message. More often, the user is on a different system and needs to fetch messages. For that, mail retrieval protocols, such as POP or IMAP, must be used.

The SMTP protocol consists of server identification (HELO), specifying who the mail is from (MAIL FROM:), and then specifying one or more recipients, one per line (RCPT TO:). Finally, the message is send with the DATA command. The message is multiple lines of ASCII text and typically starts with the mail headers that you see in your email. It is useful to note that all those mail headers are of no value to SMTP; it just treats them as the message data. You can have a completely different list of names in the To: header than you specified in the SMTP RCPT TO commands and the mail will only be delivered to the recipients you listed with the RCPT TO commands.

SMTP is an example of a push protocol. The client takes content and sends it to the server. HTTP, on the other hand, is a pull protocol. The client connects to it and asks it for content.

Because SMTP was designed to handle only text-based interaction, sending mail containing binary data, such as a jpeg file, was problematic. To remedy this, an encoding format called MIME (Multipurpose Internet Mail Extensions) was created. This defines formats for encoding content in a suitable format for message delivery. A MIME header in the body of the email identifies the content type and encoding used. To support mail attachments and the encoding of multiple objects, multipart MIME headers in the message body allow one to identify multiple chunks of content. MIME has nothing to do with SMTP but is designed to cope with the restrictions that SMTP placed on the structure of a message (7-bit ASCII text with line breaks). It is up to mail clients to create and parse MIME encodings.

POP3

SMTP dealt only with mail delivery. POP3 is a TCP-based protocol to allow a user to connect to a remote mailbox, download, and delete messages. The entire protocol is text-based. A user authenticates with user and pass commands and then sends commands to list messages, retrieve a specific message, or delete a message.

POP3 supports two interaction models. The download-and-delete model has a client connect to a mail server, download messages to the client’s local mailbox, and then delete them from the server. With this model, the server is just a temporary repository for mail until the client gets around to downloading it. The problem with this model is that it does not work if you access mail from multiple devices. Once a message is deleted from the server, other devices cannot get it.

The download-and-keep model has the client connect to a mail server, download messages to the client’s local mailbox, but does not delete them from the server. They only get deleted when a user deletes them locally and the mail client connects back to the server and issues a POP3 delete command for those messages. With this behavior, a user can access messages from multiple devices.

The downside of POP3 is that it does not keep state across sessions. It does not know, for example, if a user marked several messages for deletion during a previous connection session.

IMAP

IMAP, the Internet Message Access Protocol was designed to operate on a mailbox remotely rather than POP’s approach of retrieving the contents of a mailbox onto a client. It can handle the case where multiple clients are accessing the same mailbox and can keep operations synchronized since state is maintained on the server.

IMAP also supports the ability to move messages into folders, search for specific messages on the server, mark messages for deletion prior to actually deleting them, and fetch headers or full messages. It allows the same offline convenience that POP does, where all content can be downloaded onto a client, but also offers full state tracking on the server.

Like POP, SMTP, and HTTP, IMAP commands are also sent as lines of ASCII text. Unlike those protocols, requests and responses can be handled asynchronously; a client can send multiple requests without first waiting for responses.

Peer to Peer Protocols

Traditional, and still the most common, network-based applications are those that follow a client-server model. A client needs a service (access to a file’s contents, for example) and contacts a server that can provide that service. A peer-to-peer model is an alternative application architecture that removes the need for dedicated servers and enables each host to participate in providing the service. Because all machines can both access as well as provide the service, they are called peers.

A true peer-to-peer architecture has no reliance on a central server. In practice, some peer-to-peer architectures are really hybrid architectures, where a central server may provide key authentication or location services. Desirable (but not necessary) characteristics of peer-to-peer application architectures are robustness and self-scalability. Robustness refers to the ability of the overall service to run even if some systems may be down. Self-scalability refers to the ability of the system to handle greater workloads as more peers are introduced into the system.

In our discussions, we focused on just one application domain: peer-to-peer file distribution.

For file distribution, there are four key operations (primitives): (1) how a peer joins and leaves a peer-to-peer system; (2) how peers register files and their metadata (names, attributes); (3) how search is handled; and (4) how files are downloaded.

The systems that we examine may or may not tackle all of these areas.

Napster

Napster is the earliest of peer-to-peer systems and is the system that put peer-to-peer file sharing on the map. It was built for sharing MP3 files. Napster is not a pure peer-to-peer architecture since it relies on a single server to keep track of which peer has which content.

A peer contacts the central server and publishes a list of files that it wants to share. Anyone who wants to find a file contacts the central server to get a list of peers that have the file. The peer then connects to any of the peers in that list and downloads the file. The download is either via a direct TCP connection to the server or, if the system is inaccessible because it is behind a firewall, it contacts the central server to send a message to the desired peer requesting that it connect and upload to the requestor.

The advantage of Napster is that it is a simple design. The use of a central server, while deviating from a true peer-to-peer model, establishes a single point of control and maintains all the information on the locations of content.

The downside is that the server can become a bottleneck with high query volumes. The failure of the central server causes the entire system to cease to operate.

Gnutella

After Napster was shut down by shutting down its central server, Gnutella set out to create an architecture that offers truly distributed file sharing. Unlike Napster, Gnutella can not be shut down since there is no central server.

Gnutella’s approach to finding content is based on query flooding. When a peer joins the system, it needs to contact at least one other Gnutella node and ask it for a list of nodes it knows about (its “friends”). This list of peers becomes its list of connected nodes. This builds an overlay network. An overlay network is a logical network that is formed by peer connections. Each peer knows of a limited set of other peers. These become its neighbors, and do not need to be physical neighbors. A peer is capable of communicating with any other peer; it is just the lack of knowing that the other peer exists that stops it.

To search for content, a peer sends a query message to its connected nodes. Each node that receives a query will respond if it has the content. Otherwise, it forwards the content to its connected nodes. This is the process of flooding. Once the content is found, the requesting peer downloads the content from the peer hosting the content via HTTP.

A facet of the original design of Gnutella was anonymity. Replies were sent replies through the same path that the queries took. A peer receiving a query would not know if it came from the requestor or from a peer just forwarding the request.

Gnutella has a significant architectural advantage over Napster. Its design is fully decentralized. There is no central directory and hence the service cannot be shut down. On the other hand, flooding-based search is inefficient compared to maintaining a single database. Search may require contacting a large number of systems and going through multiple hops. Well-known nodes (e.g., those that may be configured in default installations) may become overly congested.

A few optimizations were later added to Gnutella.

The process of routing replies through the query path was changed to sending responses directly to the requester to reduce response times.
If connecting to a peer that serves the content is not possible because of firewall restrictions at the peer, the requesting node can send a push request, asking the serving peer to send it the file.
Much of the Gnutella network was composed of end user’s personal machines and these had varying levels of uptime and connectivity. As such, not all peers are equal. With this in mind, Gnutella divided its peers into two categories: leaf nodes and ultrapeers. Leaf nodes are normal peers. They know of a small number of ultrapeers and may not have fast connections. Ultrapeers are peers that have a high degree of connectivity (32 or more connections to other ultrapeers) and can hence flood queries with more hops.

Kazaa

Kazaa was created a year after Gnutella with the core premise that not all nodes have equivalent capabilities as far as network connectivity and uptime are concerned. They introduced the concept of supernodes. These nodes have high uptime, fast connectivity, faster processors, and potentially more storage than regular nodes. They also know other supernodes. This is the same concept as Gnutella’s later enhancement with its addition of ultrapeers. A client (peer) needs to know of one supernode to join the system. It sends that supernode a list of all the files that it is hosting. Only supernodes are involved in the search process. Search is a flood over the overlay network as in Gnutella. Once a query reaches a supernode that has the requested content in its list, it sends a reply directly to the peer that initiated the query. The querying peer will then download the content from the peer that hosts the content.

BitTorrent

The design of BitTorrent was motivated by the flash crowd problem. How do you design a file sharing service that will scale as a huge number of users want to download a specific file? Systems such as Napster, Gnutella, and Kazaa all serve their content from the peer that hosts it. If a large number of users try to download a popular file, all of them will have to share the bandwidth that is available to the peer hosting that content.

The idea behind BitTorrent is to turn a peer that is downloading content into a server of that content. The more peers are downloading content, the more servers there will be for it. BitTorrent only focuses on the download problem and does not handle the mechanism for locating the content.

To offer content, the content owner creates a .torrent file. This file contains metadata, or information, about the file, such as the name, creation time, and size of the file. It also contains a list of hashes of blocks of the content. The content is logically divided into fixed-size blocks and the list of hashes in the .torrent file allows a downloading peer to validate that any downloaded blocks has been downloaded correctly. Finally, the .torrent file contains a list of trackers.

The tracker is a server running a process that manages downloads for a set of .torrent files. When a downloading peer opens a .torrent file, it contacts a tracker that is specified in that file. The tracker is responsible for keeping track of which peers have which have the content. There could be many trackers, each responsible for different torrents.

A seeder is a peer that has the entire file available for download by other peers. Seeders register themselves with trackers so that trackers can direct downloading peers to them. An initial seeder is the initial version of the file.

A leecher is a peer that is downloading files. To start the download, the leecher must have a .torrent file. That identifies the tracker for the contents. It contacts the tracker, which keeps track of the seed nodes for that file as well as other leechers, some of whom may have already downloaded some blocks of the file. A leecher contacts seeders and other leechers to download random blocks of the file. As it gets these blocks, it can make them available to other leechers. This is what allows download bandwidth to scale: every downloader increases overall download capacity. Once a file is fully downloaded, the leecher has the option of turning itself into a seeder and continue to offer serving the file.

BitTorrent scales very well. The more participants there are, the greater the aggregate bandwidth is. Peers may be given an incentive to share since BitTorrent software may choose to block downloads if you don’t offer uploads. The downside of BitTorrent is that unpopular files will not have leechers and will not offer this benefit of scale. Block sizes tend to be large (the default is often 256 KB with a maximum size of 4 MB). This makes the architecture not suitable for small files as the distributed download aspect won’t come into play unless a large number of leechers choose to act as future seeders. Finally, search is not a part of the protocol. A user needs to turn to some other mechanism to actually get the .torrent file.

Distributed Hash Tables

The systems we covered use one of three approaches for locating content:

Central server (Napster)
Flood (Gnutella, Kazaa)
Nothing (BitTorrent). Search is out of scope for BitTorrent and it relies on separate solutions to allow users to locate a .torrent file for the desired content.

Flooding can be an inefficient and indeterminate procedure for finding content. Some nodes may be slower than others and some may have fewer connections than others, resulting in more hops to query the same number of machines. Gnutella and Kazaa tried to ameliorate this somewhat by creating ultrapeers (supernodes) but the mechanism of the flood still exists.

In standalone systems, hash tables are attractive solutions for high-speed lookup tables. A hash function is applied to a search key. That result becomes an index into a table. Hash tables result in O(1) lookup performance versus the O(log N) time for a binary tree or search through a sorted table. Since there is a chance that multiple keys hash to the same value (known as a collision), each table entry, called a slot (or bucket), may contain a linked list or additional hash table.

A distributed hash table, or DHT, is a peer-to-peer version of a hash table: a distributed key, value database. The interface we want for a DHT is that a client will query a DHT server with a key to get the corresponding value. This DHT server may be a separate collection of peer-to-peer systems, all acting as one server from the client’s point of view or the querying client may also be a peer. The DHT software finds the host that holds the key, value pair and returns the corresponding value to the querying host. This should be done without the inefficiency of a flood. The specific implementation of a DHT that we examine is called Chord and it creates an overlay network that is a logical ring of peers.

Chord takes a large hash of a key (e.g., a 160-bit SHA–1 hash). Each node in the system is assigned a position in the ring by hashing its IP address. Because the vast majority of bucket positions will be empty, key, value data is stored either at the node to which the key hashes (if, by some chance, the key hashes to the same value that the node’s IP address hashed) or on a successor node, the next node that would be encountered as the ring is traversed clockwise. For a simple example, let us suppose that we have a 4-bit hash (0..15) and nodes occupying positions 2 and 7. If a key hashes to 4, the successor node is 7 and hence the machine at node 7 will be responsible for storing all data for keys that hash to 4. It is also responsible for storing all data to keys that hash to 3, 5, 6, and 7.

If a node only knows of its clockwise neighbor node, then any query that a node cannot handle will be forwarded to a neighboring node. This results in an unremarkable O(n) lookup time for a system with n nodes. An alternate, faster, approach is to have each node keep a list of all the other nodes in the group. This way, any node will be able to find out out which node is responsible for the data on a key simply by hashing the key and traversing the list to find the first node ≥ the hash of the key. This gives us an impressive O(1) performance at the cost of having to maintain a full table of all the nodes in the system on each node. A compromise approach to have a bounded table size is to use finger tables. A finger table is a partial list of nodes with each node in the table being a factor of two away from the current node. Element 0 of the table is the next node (2⁰ = 1 away), element 1 of the table is the node after that (2¹ = 2 away), element 2 of the table four nodes removed (2²), element 3 of the table eight nodes removed (2³), and so on. With finger tables, O(log n) nodes need to be contacted to find the owner of a key.

Transport Layer

The network layer (layer 3 of the OSI stack) is responsible for machine-to-machine communication. The transport layer, one layer higher (layer 4), provides logical communication channels between applications. An application can create an arbitrary number of these channels, each of which has another endpoint on some process running on some host. Writing data onto this channel delivers it to the application that is reading data on the other end of this channel. The transport layer is responsible for implementing this abstraction. Routers in the network are unaware of this concept since they only provide network layer (machine-to-machine) services.

There are multiple transport protocols available on top of IP, including TCP, UDP, and SCTP. TCP and UDP are by far the most popular of these. Two responsibilities of the transport layer are multiplexing and demultiplexing communication channels on the network and, in some cases, implementing reliable data transfer.

Incidentally, a packet at the transport layer is called a segment; it is called a datagram at the network layer and a frame at the datalink layer. We send Ethernet frames, which contain datagrams that are routed by routers. These datagrams, in turn, contain segments that the transport layer of the operating system’s network stack processes.

Transport layer multiplexing and demultiplexing

Multiplexing and demultiplexing are the software mechanisms in place to combine data from multiple logical communication channels on a machine into a single stream of packets on the network and then separate a stream of incoming datagrams into the appropriate communication channels. This is important since communication on multiple sockets shares the same network connection. We can have multiple distinct streams at the transport layer that appear as a single stream of data to the network layer.

Multiplexing is the process of taking data from multiple communication channels (sockets) and sending it out of the machine as a stream of datagrams. Demultiplexing is the opposite process: separating the incoming stream of datagrams into the individual messages for the individual sockets to which each segment is targeted.

The key to IP transport layer multiplexing and demultiplexing is the use of port numbers. Each transport layer segment contains source and destination port numbers. A port number is a 16-bit number that has a unique association to a socket (a communication endpoint) on each host. Naming a socket, also known as binding, is the process of associating a socket with a specific port number and address. The address is the local host’s IP address, of course. In the case where a host has several network interfaces, it will have that many IP addresses and it is possible to make the socket available on only one of these interfaces. More commonly, though, a special address, INADDR_ANY, is used to associate a socket with all available network interfaces. Port numbers are usually specified explicitly for server programs since clients will need to know where to contact them. For example, an SMTP mail server will typically listen for client connections on TCP port 25. A client, on the other hand, will generally not care what port it uses and specifying port 0 is a request for the operating system to pick any available unused port number.

UDP multiplexing and demultiplexing

UDP is an extremely lightweight transport layer protocol on top of IP. Unlike TCP, it does not offer reliable message delivery and it does not guarantee that messages will be received in the order that they were sent.

An incoming frame (e.g., ethernet packet) contains a protocol identifier that identifies the payload (the data part of the frame) as IP data. When that payload is passed to the IP layer, a field in the header of the IP datagram identifies the higher layer protocol as UDP. The UDP layer reads the destination port field in the UDP header and delivers the segment to the socket that is associated with that port number. The kernel maintains a hash table of socket structures that is indexed by a key that is created from the UDP destination port. With UDP, any segments addressed to a specific port number will be delivered to the socket that is identified with that port. We will see that this is different from TCP, which takes performs full demultiplexing based on the source and destination.

While UDP does not have the reliability and in-order delivery advantages of TCP (or, as we shall see, rate adjustment to deal with congestion), there are several reasons that make it attractive for certain applications:

Segments are sent immediately. When a user writes data to a UDP socket, it immediately goes down the layers of the network stack and is transmitted onto the network. TCP may wait for an acknowledgement or for sufficient data in its transmit buffer instead of transmitting immediately.
Message boundaries are preserved. TCP treats a communication stream as a sequence of bytes. The number of writes to a socket does not necessarily correspond to the number of messages that will be received at the other end.
No connection setup overhead. With UDP, the first segment that is sent on the network can contain application data. With TCP, we first need to establish a connection with a three-way handshake, which requires an overhead of sending a segment and receiving an acknowledgement before we can send a segment with data.
UDP is stateless. The kernel has to keep track of sockets, of course, but does there is no need to keep track of sequence numbers, buffer for out-of-order data, acknowledgements, etc. This uses less kernel memory and makes error recovery and load balancing easier: requests can be redirected to other hosts spontaneously.
Smaller headers. UDP has an eight byte header compared to TCP’s 20-byte header. This leads to smaller packets on the network.

UDP header and checksum

Figure 1. UDP header with IP pseudo header

The UDP header (Figure 1) is eight bytes long. It contains the source and destination ports, segment length and a checksum. The checksum is a simple error-detecting code that allows the UDP layer to check for segment corruption. If the received segment contains an error, it is dropped and not delivered to the socket. The checksum is computed over the UDP header, application data, and a pseudo IP header. The pseudo IP header contains a subset of fields from the IP header (source address, destination address, transport protocol ID, and UDP segment length). It is included in the checksum computation to ensure that the UDP layer will not get any misrouted segments (this is a safeguard but the IP header has its own checksum, which is computed in the same way).

The checksum is a 16-bit value. If the data being summed does not contain an even number of bytes (i.e., it does not have an integral multiple of 16-bit values), it is padded with a zero byte. The same ones’ complement algorithm is used to compute checksums for IP headers, UDP headers, and TCP headers. The value of the checksum field is set to zero during the computation. To compute the checksum, all 16-bit chunks of data are added together. Each time the addition of two numbers results in an overflow, a one is added to the result. Finally, the bits of the final result are inverted.

To validate the checksum, the receiver performs the same arithmetic, generating a checksum for the segment and pseudo IP header. Since the segment checksum is included in the header for this computation, the result for an error-free packet will be all ones (0xffff). This computation reliably detects single bit errors in the segment.

TCP multiplexing and demultiplexing

UDP offers only limited demultiplexing. Segments from multiple sockets (sources) that are directed to the same host address and port number are received by the socket on that host that is associated with that port number.

With TCP, a connected socket is associated is associated with four values:

the sender’s source address
recipient’s (destination) address
source port
destination port

Recall that with TCP sockets, a server first creates a socket whose sole purpose is listening for and accepting incoming connections. It does so with the listen system call. This socket is said to be in the LISTEN state. It will never be used for data transfer. Its only purpose is to accept incoming connections.

When an incoming TCP connection request arrives at the host, the kernel searches for a socket in the LISTEN state where the packet’s destination address and port match those of the socket (the address can be “any” - a wildcard). The kernel then creates a new socket. The remote address and port are copied from the TCP segment header onto the new socket structure. Once the connection is set up, this new socket is in the ESTABLISHED state, indicating to the kernel that it has a connection to another socket. Any incoming TCP data segment will go to the socket that is associated with the source and destination addresses and ports in the segment header.

Principles of reliable data transfer

Given that the underlying IP network does not guarantee packet delivery, if a transport layer protocol wants to provide reliable data delivery, it has to implement it via software. We will first look at evolving reliable data transfer (RDT) software in general before turning our attention to how it is implemented in TCP specifically.

If the underlying networking layer was indeed reliable, there would, of course, be no need for RDT. A sender would send a segment and a receiver would receive it and immediately deliver it to the application.

Stop-and-wait protocol

Let us now assume that all segments are received (this will not be a valid assumption in the real world of IP) but that some of them might have corrupted data. In this case, we may need to request retransmission of a segment because it arrived with errors. Automatic Repeat Request (ARQ) refers to a family of protocols that acknowledge received packets and request retransmission for bad packets.

An acknowledgement (ACK), also known as a positive acknowledgement, is a receiver feedback message that confirms the successful receipt of a message. A negative acknowledgement (NAK) is a feedback message that tells the sender that a message was not successfully received.

A simple protocol for providing RDT over a channel that always delivers segments but may introduce errors into them is to transmit one segment and wait for an ACK or NAK. A receiver sends an ACK if the segment was received without errors. Upon receipt of the ACK, the sender can transmit the next segment. A receiver sends a NAK if the segment was received with errors. Upon receipt of a NAK, the sender retransmits the same segment and again waits for an ACK or NAK. This form of ARQ protocol, where a segment will not be sent until the previously sent segment has been acknowledged, is called a stop-and-wait protocol.

The protocol we just outlined fails in that it recognizes that data can be corrupted in transit but does not take that possible corruption into account for ACK/NAK messages. We can modify the protocol by adding a checksum for ACK/NAK segments and, upon receipt, detect if those segments are corrupted. If corrupted, we will treat the message as a NAK and retransmit the segment. This can result in the receiver getting duplicate packets. If a receiver gets a duplicate packet, it will need to ignore the data but still send an ACK in return.

Now we need to distinguish new data from a retransmission. A sequence number allows us to do that. In the case of a stop-and-wait protocol, a one-bit sequence number suffices since we only need to distinguish between the current packet we’re waiting for and the retransmission of a correctly-received previous packet. This stop-and-wait protocol using a single-bit sequence number is called an alternating bit protocol.

Removing NAKs

We just saw two cases where a recipient gets a packet that it does not want: receipt of a duplicate packet (in which case it sends an ACK) and receipt of a corrupted packet (in which case it sends a NAK and awaits retransmission). We can remove the need for a NAK by using an ACK and adding a sequence number to it. The ACK will acknowledge the last packet that was correctly received. If the sender receives an ACK for a different number than the packet it most recently sent, it will treat that as a NAK and retransmit the packet.

RDT over a lossy channel

So far, we only considered the case where packet data might be corrupted but the packets were always delivered to their destination. Now we need to account for the fact that packets may be lost. This can be due to overflow of a queue at a router or to data corruption in the packet header that prevents the packet from being routed to its destination.

We place the burden of detecting a lost packet on the sender. The sender will not get an acknowledgement from a receiver in the case that a packet was never delivered or in the case that it was delivered but the acknowledgement message was lost. To detect this lack of acknowledgement, the sender will use a countdown timer. The timer is initialized when the packet is sent. If it times out before an acknowledgement is received, the sender retransmits the packet and reinitializes the timer. The timer should be set to some value that is longer than the average round-trip time so that the timeout will indicate a likely loss. Setting the timer to a too-short value will result in excess duplicate packets. Although the protocol can deal with them, we’d like to avoid an excessive amount of unnecessary retransmissions.

Pipelining

A stop-and-wait protocol will not transmit a packet until the previous packet has been successfully sent and acknowledged. Having to wait a round-trip delay before sending the next packet yields horrible network utilization, often far less than 1%. Recall that network utilization is the ratio of the actual traffic on the network to the traffic that the network can support.

A way to improve network utilization dramatically is to send successive packets without first waiting for acknowledgements of earlier packets. This technique is called pipelining and we will look at two approaches to pipelining: Go-Back-N and Selective Repeat. In order to send multiple packets without first waiting for an acknowledgement from each one, we will need to increase the range of sequence numbers so that we can identify packets and match an acknowledgement to a specific transmitted packet. We also need to save packets on the transmitter until they have been acknowledged by the receiver in case we need to re-send them. If the receiver gets out-of-sequence packets, it cannot deliver them to the application and may need to consider storing them in a receive buffer or else requesting those same packets from the sender again in the future.

Go-Back-N (GBN) Protocol

The Go-Back-N protocol allows a sender to send multiple packets without waiting for an acknowledgement. Each successive packet has a monotonically increasing sequence number. A window size defines the maximum number of packets that could be transmitted before waiting for acknowledgements. The base of the window is the earliest packet that has been sent but not yet acknowledged. When an acknowledgement is received for a sequence number that corresponds to that packet, it can be discarded (the sender will never need to retransmit it) and the window advances, or slides to the next unacknowledged packet and the new packet that entered the window can now be transmitted. This is why Go-Back-N is called a sliding window protocol.

The sender sends all packets that fall within the current window and starts a timer. The receiver expects to receive packets in the correct sequence number order but it may not get that. Whenever a packet is received correctly and is in the proper sequence, that packet is acknowledged with its sequence number and delivered to the application. The expected sequence number is incremented and the receiver waits for the next packet.

If the receiver gets a packet that has a different sequence number, it discards the packet and sends back a duplicate acknowledgement (that is, the acknowledgement for the previous sequence number it received). An acknowledgement number n indicates that the receiver has correctly received all packets up to and including packet n. This form of acknowledgement is called a cumulative acknowledgement. The receiver only needs to keep track of the next sequence number it needs and only stores one packet at a time.

When the sender receives an acknowledgement n, it advances the base of its window to n. Packets less than or equal to n can be discarded. Note that there is no harm in losing acknowledgements less than n; this acknowledgement indicates that all prior packets were received as well. If the acknowledgement number corresponds to the last packet that was sent, the sender has all outstanding acknowledgements and can stop the timer. Otherwise, the timer is restarted to wait for additional acknowledgments. If the timer expires, that means that some all transmitted packets have not been acknowledged; one or more packets have been lost (or the final acknowledgement has been lost). Upon timer expiration, the sender sends all packets that are in the current window.

Selective Repeat (SR) Protocol

With the Go-Back-N protocol, many packets can be in the pipeline: sent but not yet acknowledged. A single error in one of these packets will result in a timeout at the sender and hence a retransmission of all packets in the sender’s window. This can result in a lot of unnecessary retransmissions since the receiver will be getting packets that it has previously received correctly but discarded.

Selective Repeat (SR). Selective repeat, like Go-Back-N, is a sliding window protocol but allows the receiver to store and acknowledge out-of-order packets so that they do not need to be retransmitted.

Instead of cumulative acknowledgements, the receiver sends an acknowledgement for the specific packet that was received. The sender’s window slides when the earliest packet in the window is acknowledged and always starts at the first unacknowledged packet. The window itself may contain a mix of acknowledged and unacknowledged packets.

The receiver must also maintain a window since it may receive packets out of sequence and needs to buffer them. Whenever a packet is received in sequence, it can be delivered to the application. The receive window slides to the slot for the first non-received packet.

Every transmitted packet has a separate timer associated with it. If an acknowledgement is not received successfully within that time, that specific packet is retransmitted. The receiver will acknowledge the packet if it fits in the window or if it is sequenced before the window. This latter case means that the packet is a duplicate of a packet that was already received and delivered to the application. If the packet is beyond the receiver’s window, it has no room to accept the packet and will ignore it.

Transport Layer: TCP

TCP, the Transmission Control Protocol, is the dominant transport protocol on the Internet. Where UDP was a thin layer over IP that provided us with multiplexing and a limited demultiplexing service (the source host was not factored into the demultiplexing - that’s up the the application to process), TCP provides applications with a reliable, bidirectional communication channel. In addition, TCP attempts to be a good network citizen and manage the flow control of data to ensure that the receiver’s buffer does not overflow and to avoid network congestion.

TCP is a connection-oriented protocol. This does not mean that a virtual circuit is established in the network. Indeed, routers are not aware of TCP any more than they are of UDP or other transport layer protocols. TCP is a form of protocol design called end-to-end control, where only the endpoints are responsible for the integrity of the connection. Routers do not guarantee not to drop packets, ensure the are sequenced correctly, or correct errors. The “connection” is managed in software at both end systems. Because the hosts need to keep track of the state of the communication channel, TCP has an initial connection setup step that comprises a three-way handshake where parameters such as port numbers, sequence numbers, and buffers are established. Once the connection is established, all messages are acknowledged and retransmitted if lost. When the session is complete, TCP enters a teardown phase to ensure both sides are informed that the connection is no longer needed and that they can free up resources that were used for the connection.

TCP’s communication is full duplex. This means that if a process A established a TCP connection to process B then process B can use the same connection to send data to process A.

TCP Segments

TCP sends segments between the sender and receiver. A segment is simply the transport layer term for a packet. The sending process writes data to a socket. This data is copied to the operating system kernel and ends up in TCP’s send buffer, a pool of memory devoted to that specific connection. TCP reads chunks of data from this send buffer, creates TCP segments, and sends them down to the IP layer for transmission onto the network (the IP layer will, in turn, send the IP datagram to the data link layer, which will actually get it to the network). When an IP datagram arrives at the destination machine, a protocol field in the IP header identifies the message as a TCP message and IP forwards it up to the TCP driver. TCP then examines the source address, source port, destination address, and destination port for find the appropriate socket for this segment. The data is placed in that socket’s receive buffer, the counterpart to the send buffer that is a pool of memory used to hold received data that did not yet make it up to the application.

Note that, unlike in UDP, TCP views the data coming from the application as a stream of bytes rather than individual messages that need to be sent out. There is no assurance that writing, say, 20 bytes to a socket will result in 20 bytes of data being transmitted in a TCP segment. If there was other data in the send buffer that was ready to transmit, it could be combined with this new data. Similarly, the 20 bytes may not be transmitted immediately but combined with additional data that is written onto that socket.

Maximum Segment Size

When the TCP driver reads bytes from data in the send buffer to create outgoing segments, it makes sure that the number of bytes is grabs is less than the maximum segment size (MSS) to make sure it does not try to send a segment larger than the underlying network can transmit. Data link layers have a Maximum Transmission Unit (MTU), the largest payload that they can carry. For a TCP segment to fit into a data link frame, the IP header, TCP header, and application data must be no larger than the MTU. Ethernet supports an MTU of 1500 bytes (with support for an MTU of 9,000 bytes for jumbo frames in gigabit ethernet) while 802.11 (Wi-Fi) supports a 7981-byte MTU. Since IP and TCP headers are each 20 bytes long, the MSS is typically the MTU minus 40 bytes. Hence, a common MSS on an ethernet network is 1460 bytes (1500–40).

Path MTU Discovery

It is easy enough for the TCP layer to find out the MTU of the local link layer, subtract 40, and compute a value for the MSS. While this is fine for communication within the LAN, this does not ensure that some link that the packet traverses to the destination will not have a smaller MTU, resulting in the need to fragment the packet (which is undesirable). The path MTU is the minimum MTU of all the hops along the path the destination. Unfortunately, there is no foolproof way of determining this value. IP networks are required to support an MTU of 576 bytes (512 bytes of data plus up to 64 bytes for headers). However, most network links can support larger MTUs (for example, and MTU of 1500 bytes works on practically all routers in the Internet but support is not guaranteed).

Path MTU Discovery (RFC 1181 and RFC 1191) defines a method for a host to discover the path MTU and hence set its MSS to the maximum possible value.

It works by initially assuming that the path MTU is the MTU of the first hop (this is local to the host and easy to find). All initial datagrams are sent to the destination with the “don’t fragment” (DF) bit set in the IP header. If a router needs to route the datagram to a link with a smaller MTU, the router will discard the datagram and send back an ICMP[1] datagram containing an ICMP Destination Unreachable message with a code indicating fragmentation needed. The MTU of the outbound link is placed in the ICMP message. Upon getting this response, the sending host reduces its MTU to the returned value and tries again. Since routes may change periodically, the Path MTU process is repeated periodically (every 10 minutes on Linux and Windows systems by default).

TCP structure

A TCP segment comprises at least 20 bytes of a TCP header followed by a variable number of bytes of application data. The TCP header is prefixed by an IP header that, in turn, is encapsulated in a link layer header.

Some of the key parts of the TCP header are:

Source port number: Identifies the port number of the sender’s socket.
Destination port number: Identifies the port number of the receiver’s socket.
Checksum: Like UDP, this is a 16-bit 1s complement sum of the TCP header, application data, and IP pseudo header. The IP pseudo header is a subset of the fields in the IP header: source address, destination address, protocol ID, and the length of the TCP header and data. All the data is added in 16-bit chunks. Any overflow carries result in a 1 added to the result. The final result is complemented (bits inverted). When the receiver performs the same operation and includes the checksum field, the result for an uncorrupted packet is all 1s (0xffff).
Sequence number: Every byte that is transmitted is counted starting from some initial sequence number. The 32-bit sequence number of the TCP segment is the number of the first byte in the data. The sequence number is an essential part of TCP’s reliable data transfer service.
Acknowledgement (ACK) number: A receiver sends back a 32-bit acknowledgement number that is the sequence number of the next byte that it expects to receive. This too is an essential part of TCP’s reliable data transfer service.
Receive window: The receiver tells the sender the maximum number of bytes that it is capable of receiving. This 16-bit value takes priority over the MSS (maximum segment size).
Header length: The header length indicates the number of 32-bit words in the TCP header. Note that the IP header contains the total datagram length, which can be used to determine how much data is in the message. The basic TCP header is 20 bytes and the minimum value of the header length is hence 5 (20 bytes = 5 32-bit words). TCP supports optional data in its header and this may make the header larger.
TCP Options: If the header length is greater than 5, that means there are more than 20 bytes in the header and TCP options are present. TCP options are located at the end of the normal 20-byte header. While approximately 32 different options are defined, some are obsolete and many are never used. The most commonly used options include:

Maximum Segment Size: defines the maximum segment size that will be used during a connection between two hosts.
Window Scaling: extends the number of bits for the receive window to allow the TCP window size to be specified as a 30-bit number instead of a 16-bit number.
Selective Acknowledgements: allow the receiver to inform the sender if it received any out-of-order segments.
Timestamps: an extension to allow more accurate mechanism to measure segment delivery time, including retransmissions.

Flags: A number of 1-bit flags are present in the header. These are:

ACK: informs the recipient of the segment that the acknowledgement field contains a valid sequence number.
RST, SYN, FIN: These are used to set up and tear down the TCP connection. The SYN (“Synchronize”) flag is used in the handshake used to set up a connection. The FIN (“Finish”) flag is used to close a connection. The RST (“Reset”) flag indicates that a segment was received for a closed or nonexistent socket.
PSH (“Push”): Tells the receiver to pass the received data to the application layer immediately. This flag is not used in practice.
URG (“Urgent”): Tells the receiver that the application data contains a region of “urgent” data, possibly along with “non-urgent” data. The 16-bit urgent data pointer is an index to the last byte of this data. As with PSH, the concept of urgent data is not used.
NS (“Nonce Sum”), CWR (“Congestion Window Reduced”), and ECE (“Explicit Congestion Expected”) are all part of an Explicit Congestion Notification protocol, which is an extension to IP. Not all routers support this and we will not cover this extension.

TCP Sequence Numbers

TCP views application data as a sequence of bytes and each byte in the sequence is assigned a sequence number. The sequence number in each TCP segment is the sequence number of the first byte in that segment. For example, if the current sequence number is 500 and we transmit a segment containing 1460 bytes, the segment will contain a sequence number of 500. If the following segment contains 1000 bytes, the sequence number of the segment will be 1960, which is 500 (the last sequence number) plus 1460 (the number of bytes that was sent with the last segment). Sequence numbers are 32-bit values and do not have to start with 0. The value may wrap around the 32-bit boundary, so all sequencing is done modulo 2³² (mod 4,294,967,296).

TCP Acknowledgement numbers

Received TCP segments are acknowledged by the receiver. TCP uses pipelining and permits several segments to be sent at once prior to waiting for an acknowledgement (more on this later).

An acknowledgement number is present in a received segment and is a 32-bit number that indicates the sequence number of the next byte that the remote host is expecting to receive next. For example, in the above example we sent 1460 bytes with a sequence number of 500. Upon receipt of this segment, the return segment will contain an acknowledgement number of 1960, the sequence number of the next byte that the receiver needs. The message it just received contained bytes numbered 500 through 1959. When the receiver receives the next segment, 1000 bytes with a sequence number of 1960, it will return an acknowledgement number of 2960.

It would be a waste of network resources to send back a TCP segment containing nothing but an acknowledgement number. While this is inevitable in some cases, if the receiver happens to have data to transmit back to the sender, the acknowledgement number is simply set in the TCP header of the transmitted segment, completely avoiding the need to send a separate acknowledgement. Using an outgoing data segment to transmit an acknowledgement is known as a piggybacked acknowledgement.

TCP uses cumulative acknowledgements. The acknowledgement number that a receiver sends inside a TCP header is always the sequence number that the receiver wants to see next. Going back to the earlier example, if a receiver receives 1460 bytes with a sequence number of 500, it sends back an acknowledgement for the next byte it wants: 1960. Suppose that the next segment it receives contains 500 bytes with the sequence number 2960. This means that the desired segment was either lost or will be arriving out of sequence. Upon receiving the segment with sequence number 2960, the receiver generates an acknowledgement (all segments get acknowledged) but the acknowledgement number is the number of the earliest sequence it does not yet have: 1960. Hence, the sender will get back duplicate acknowledgements for 1960.

To avoid sending too many data-free acknowledgement segments, a receiver is allowed to wait up to 500 ms (half a second) before sending an acknowledgement. If another segment comes in during that interval, a cumulative acknowledgement must be sent. For a steady stream of messages, therefore, cumulative ACKs need to be sent for every other packet. However, any out-of-order segments must be acknowledged upon receipt.

A receiver does not have to store segments that were received out of sequence but it is more efficient to do so and practically every implementation of the TCP protocol does this. Continuing our example, suppose that the receiver does get the segment containing sequence number 1960 after receiving the segment withe the sequence number 2960; the segment really did arrive out of order and was not dropped. Now the receiver can fill in its hole in the receive buffer. It just got bytes 1960…2959 and it already had bytes 2960…3459 from the earlier receipt of the segment with sequence number 2060. The acknowledgement it sends now will be the cumulative acknowledgement – the sequence number of the next byte it needs: 3460. The sender will see acknowledgements of 1960, 1960, and 3460. Had the transmitted segments arrived in order, the acknowledgements would have been 1960, 2960, and 3460.

TCP Connection setup and teardown

Connection setup

TCP employs a three-way handshake to set up a connection: a process of SYN, SYN-ACK, and ACK. The client initiates the process; it creates a random initial sequence number (client_isn) and sends it to the server in a TCP segment with the SYN flag set. The server receives this and allocates send and receiver buffers as well as variables. This set of data, containing all the information about a connection, is called the transmission control block (TCB). The server then creates a SYN-ACK segment that acknowledges the received sequence number and contains a random sequence number from the receiver. This segment also has the SYN bit set. Upon receiving this, the client acknowledges the server’s sequence number by sending the final ACK segment to the server and allocates its own TCP buffers and variables for the connection.

SYN flooding

Kernel memory is finite and the operating system will not allocate an unlimited amount of memory for managing TCP connections. A denial-of-service attack called SYN flooding sends a large number of SYN segments to a machine but usually uses an unreachable return address to never complete the handshake to set up a legitimate connection. The recipient normally allocates memory for each SYN segment that it receives, expecting each to become a legitimate connection. Eventually, kernel memory is exhausted and the operating system will not allow any more incoming TCP connection requests, including, of course, legitimate ones. The operating system will continue to refuse incoming connections until those incomplete ones time out. The connection setup timeout is an administrator-configured value and can range from half a minute to several minutes.

Several approaches have been proposed to deal with SYN flooding. One of these is SYN cookies. The key realization is that the kernel allocates memory (the TCB) before the connection is fully set up. With the technique of SYN cookies, no state is saved (no memory allocated) upon the receipt of a connection request. Instead, any needed information is encoded into the initial sequence number. Since that sequence number (+1) will be sent back in the final ACK from the client, the server will be able to validate that the ACK, and hence the requesting client, is legitimate. The initial sequence number that the server creates is a hash of the source and destination IP addresses, ports, and some secret value known only to the server. A client will not be able to spoof this value since it does not know the secret but a server can easily validate it from the acknowledgement number in the ACK message from a legitimate client.

MSS announcement

TCP provides an option during connection setup to tell the other side its maximum segment size (MSS); that is, the largest size segment that it is willing to accept. If both machines are on the same LAN, the MSS is likely to be the MTU of the network interface minus the size of the protocol headers (20 bytes for IP and 20 more bytes for TCP), although it can differ even here if, for example, one device supports jumbo (9000-byte) Ethernet packets and the other does not. The Internet requirement is that all IP routers support an MSS of at least 536 bytes.

Invalid messages

If a host receives a TCP segment where the port numbers or source address to not match any connection (e.g., the socket is closed or there is no listener on that address), it will send back a reset segment, a TCP segment with the RST flag set. In the case of UDP, an attempt to send a message to a port that does not have any process listening on it will result in the generation of an ICMP message back to the sender.

Connection teardown

Either the sender or the receiver can decide to terminate a TCP connection. Termination involves telling the other side that you are finished sending data, getting an acknowledgement, and then freeing allocated resources for that connection.

This is a two step process between the two hosts (let’s call them A and B). The the host that initiates the teardown, host A, sends a finished message: a TCP segment with the FIN flag set. It then enters the FIN_WAIT_1 state and waits for an acknowledgement from host B. This FIN message is a promise that host A will not send any more data on the connection. Host B sends the acknowledgement and enters the CLOSE_WAIT state. It may still have data to send, however. Upon receiving the acknowledgement, host A enters the FIN_WAIT_2 state and waits for host B to terminate its side of the connection. Host B does the same thing that host A did: once it has no more data to send, it sends a TCP segment with the FIN flag set and enters the LAST_ACK state, which means it is waiting for the final acknowledgement from host A. When host A receives the FIN message, it knows that it will receive no more messages from host B. It sends the final ACK to host B and enters the TIME_WAIT state, which is a timeout period to ensure that host B receives the final ACK and there are no stray packets for that connection in the network.

Timeouts

Since TCP does not know about the congestion or performance of the underlying network and the network does not provide any guarantees for packet delivery, TCP relies on a time limit to determine if any transmitted segments were lost. If an acknowledgement is not received within a specific time limit then the sender may assume that the segment was lost. Since IP provides no guarantees on the timeliness of delivery, this retransmission timeout (RTO) can only be a best guess. We need a value that is long enough to avoid excessive timeouts and retransmissions but is short enough to avoid excessive waits.

TCP keeps track of the average round-trip time of segments. Since these may fluctuate over time, TCP uses an exponentially weighted moving average (EWMA), which places approximately 10–20% of the weight on the most recent RTT measurement. This average of the RTT is called the smoothed round trip time, or SRTT.

The second thing that TCP keeps track of is how much the recently measured round-trip time deviated from the SRTT. This too is is an EWMA function, placing approximately 25% of the weight on the most recent deviation. The average delay is called the round-trip time variation, or RTTVAR. It is an estimate of how much the RTT typically deviates from the average RTT and is an approximation to the standard deviation, which would be slower to compute.

The retransmission timeout (RTO) value is set to a function of the SRTT and RTTVAR:

timeout interval = SRTT + 4 * RTTVAR

However, whenever a timeout occurs, this value is doubled. The timeout value is doubled for each timed-out retransmission up to a value of 64 seconds. This doubling of the timeout is called an exponential backoff. Whenever an acknowledgement is received without a timeout, the timeout interval is reset to its base value.

TCP Reliable Data Transfer

When an application writes data to a socket, the data is copied over to the send buffer for that socket’s TCP connection. Some time later (the time is not specified in the standards), TCP will grab data sitting in the send buffer and create and transmit one or more TCP segments, each with the appropriate sequence number. The first of these segments will cause the RTO timer to be set. Whenever any segment containing an acknowledgement from the receiver arrives, the timer is reset. If the acknowledgement number is greater than the sequence number of the base of the send buffer, the sender can remove all bytes prior to that of the acknowledgement number from the send buffer since it knows that the receiver has them. If a timeout occurs, TCP will retransmit only one segment, the non-acknowledged segment with the smallest sequence number, double the timeout interval, and restart the timer.

Let us consider a few cases to illustrate how this works.

Lost ACK

If an acknowledgement is lost and the sender times out (meaning that there were no additional segments sent), the sender will retransmit that segment. The receiver, upon getting it, will see that it does not need that sequence number (it is a duplicate segment) and will discard the segment but will send an acknowledgement back to the sender.

Another case is when the sender sent two segments but the acknowledgement to the first one was lost. Because acknowledgements are cumulative and the sender gets the second, cumulative acknowledgement, it knows that the sender has received all the bytes and a retransmission is not necessary.

Delayed ACKs

If the network round-trip time suddenly became much longer than expected, acknowledgements to a sequence of segments might arrive after a timeout. When the timeout occurs, the sender will retransmit only the earliest unacknowledged segment and restart the timer. Suppose now that acknowledgements for the previously transmitted segments arrive. TCP will process them and adjust the base of its send buffer if those bytes are no longer needed. When the receiver gets the duplicate packet, it will discard it but send an acknowledgement. Some time later, the sender will receive this acknowledgement but see that it is a duplicate and hence discard it.

TCP Fast Retransmit

As we have seen, TCP uses pipelining and can send multiple segments before waiting for acknowledgements. If a receiver detects a missing sequence number, it means one of two things: a segment was lost (either discarded due to queue overflows or due to data corruption) or that a segment is delivered out of sequence. TCP does not use negative acknowledgements but sends an acknowledgement for every received out-of-sequence segment with the sequence number of the next byte it needs. In the case where a segment has indeed been lost, every segment after that will be acknowledged with the same sequence number. Hence, the receiver will see a stream of duplicate acknowledgements.

If TCP receives three segments with duplicate acknowledgement numbers, it assumes a segment was lost (the assumption is that it is unlikely that one segment was routed or delayed such that three others arrived first). Instead of waiting for an RTO to occur, TCP will transmit that missing segment (the one with the sequence number of the ACK). This technique is called a fast retransmit because the protocol retransmits the segment without waiting for the RTO to occur.

Selective Acknowledgements

TCP behaves in a similar manner, but not quite the same, as the Go-Back-N protocol. The sender only keeps track of the earliest sequence number that has been transmitted but not acknowledged. A distinction is that, in the case of a timeout, Go-Back-N will retransmit all segments in its window while TCP will only retransmit the lowest (earliest) segment.

With this behavior, having the receiver store out-of-order segments is optional. The receiver may store them. If it does, and the receipt of a segment fills in a hole (missing segment), then TCP will send back a cumulative ACK, ensuring that the sender will not have to retransmit those segments. However, the sender does not know if this will happen and has to hold on to all unacknowledged segments in its send buffer just in case any or all of those segments have to be resent.

An optional enhancement to the TCP protocol uses the options field in the header to allow the receiver to send over a list of {start byte, end byte} values to identify the specific range of bytes that have been received. These are called Selective Acknowledgements and make the protocol behave like a Selective Repeat protocol. The acknowledgement number in the TCP header is unchanged but the list of received byte ranges allows the sender to remove those enumerated segments from its send buffer since it knows that they will never have to be retransmitted.

Receiver-regulated flow control

It would be pointless to keep sending data to the receiver if the receiver’s application isn’t keeping up with reading it. In addition to slowing down transmission due to congestion (which we will look at next), TCP allows the receiver to tell the sender to tell it how much space it has in the receive window (the free space in the receive buffer). It does this by placing the size of the receive window into the receive window field in the TCP header in each segment that is sent to the sender.

One problem that can arise is if the receive window size is zero, the sender would stop transmitting and not get feedback from the receiver once the window size became bigger when the application consumed some data. The feedback is absent because the receiver already sent acknowledgements for all received segments. To remedy this, the sender uses probing: if the receive window is zero, the sender will periodically send a message with one byte of data. The receiver does not have to accept this data if the window size is truly zero (the sender will retransmit) but this gives the receiver a chance to send an acknowledgement segment with a non-zero receive window once there is room in the buffer.

TCP Congestion Control

TCP tries to be a good network citizen and decrease the rate at which a sender sends traffic based on its perception of congestion in the network. This is in addition to TCP’s flow control where a recipient can tell a sender to send less data by giving it a smaller window.

TCP uses a sliding window approach to flow control. The way that TCP controls the rate at which data is sent on a network is by adjusting the size of its sending window. Recall that a window size regulates the number of bytes that can be sent before waiting for acknowledgements from the recipient. We have seen that a window size of one MSS gives us a stop-and-wait protocol where we can send out only one segment before waiting for an acknowledgement for the receipt of that segment. In general, the transmission rate is the window size divided by the round-trip time. TCP’s use of the sliding window algorithm makes the protocol self-clocking: packet transmission is timed by the receipt of acknowledgements; each acknowledgement slides the window.

A TCP connection receives a maximum size that the receiver can accept in the receive window field of the TCP header (abbreviated as rwnd). In addition to that, the transmitter will dynamically adjust its transmit window size based on its guess of network congestion. This window size is called the congestion window, abbreviated as cwnd. At any point in time, the maximum window size will be the smaller of these two windows: rwnd and cwnd.

IP provides no guarantees on the accuracy or timeliness of end-to-end packet delivery through the network, nor does it provide any data on the levels of traffic in the network or queue sizes at routers. TCP, therefore, relies on making assumptions on congestion based on observed behavior. If an RTO occurs or three duplicate acknowledgements are received, the protocol assumes a segment was lost. Segment loss often implies congestion, so TCP will decrease its transmission rate by reducing the size of its congestion window, cwnd. On the other hand, if TCP gets acknowledgements for sent packets then there is no packet loss and therefore no congestion. In this case, TCP will increase its transmission rate by increasing the size of its cwnd. This continuous approach of increasing the transmission rate until packet loss occurs and then decreasing it is called bandwidth probing. The connection tries to see if it can transmit faster, backing down when problems arise, and then trying again.

AIMD

TCP’s congestion control is called Additive Increase / Multiplicative Decrease (AIMD). While a connection is experiencing no packet loss, it will increase its congestion window by one MSS every round-trip time (RTT), hence increasing its transmission rate. This is an additive increase (linear increase). If cwnd was 15 segments and all 15 segments were sent and acknowledged, the window is then increased to 16 segments.

In practice, we don’t wait for all the segments in the window to be sent out before increasing the windows size. We increase the window size (cwnd) fractionally for each arriving ACK. The number of segments that fit into a window is cwnd/MSS. After that many segments have been delivered, we want to increase cwnd by one segment (MSS bytes). That means for each ACK (each received segment), we will increase cwnd by this fractional amount: MSS divided by segments_per_cwnd, or MSS / cwnd/MSS. The assumption is that we always have data to transmit so every segment will be of size MSS.

If TCP feels that it has congestion (because of lost segments), the congestion window is halved. This is a multiplicative decrease. AIMD is a necessary condition for TCP congestion control to be stable in the system as a whole and ensure that some connections do not end up monopolizing the network link.

States of TCP congestion control

TCP operates in one of three states: Slow Start, Congestion Avoidance, and Fast Recovery.

If we start with a congestion window of one MSS and increase it linearly, it can take a long time before we reach an effective transmission rate. TCP Slow Start prevents this slow ramp at startup by increasing the cwnd size exponentially. The congestion window starts at one MSS and increases by one MSS with each received ACK, causing it to double every RTT. Slow Start starts off slowly but speeds up quickly. It continues to increase until cwnd reaches a threshold level, called ssthresh (slow start threshold). Then the protocol switches to Congestion Avoidance. Initially, ssthresh is effectively not set (set to a maximum value), so the rate of transmission continues to increase exponentially until a transmission times out waiting for an ACK. At this time, the protocol sets the threshold, ssthresh, to one half of the window size that resulted in the RTO and restarts the Slow Start process. This second time, it will ramp up to the threshold and then switch to the Congestion Avoidance state (unless packet loss occurs).

Congestion Avoidance is the normal state for TCP. The state is entered after Slow Start reaches its threshold, which is one half the last window size that experienced an RTO due to packet loss. Congestion avoidance continues to increase the window size (cwnd) but does so linearly: one MSS per RTT. This continues until one of two events occur. If an RTO occurs, then ssthresh is set to half of the cwnd (half of the current window size when the packet loss occurred), cwnd is set to one MSS, and the protocol moves to the Slow Start state. If three duplicate ACKs are received then instead of going all the way back to cwnd=1 and a Slow Start, the protocol switches to Fast Recovery.

Fast Recovery is entered only when three duplicate ACKs are received. The receipt of three or more duplicate ACKs is a sign that we lost a segment but data is still flowing between the sender and receiver since each of those ACKs was generated when a segment was received. They are duplicates because one needed segment was not received. Fast Recovery assumes that cwnd is the estimated system capacity. Instead of reducing data flow abruptly by going into Slow Start, ssthresh and the congestion window are cut to half of their current size (this is a multiplicative decrease). Fast Recovery loops, picking up every duplicate acknowledgement it receives and increases cwnd by 1 MSS each time it does so. This includes the three duplicates that caused TCP to enter this state. Once a non-duplicate acknowledgement is received, cwnd is set back to the threshold, ssthresh, and the state is switched to Congestion Avoidance (additive increase). Should a retransmission timeout occur, the same thing happens as anywhere else that an RTO occurs: ssthresh is set to half the window size, cwnd is set to one MSS, and TCP switches to the Slow Start state.

Summary of TCP congestion control

TCP congestion control is always operating in one of three states.

Whenever the congestion window, cwnd, is below the slow start threshold, ssthresh, the protocol is in the Slow Start state and the window increases exponentially.
Once cwnd reaches ssthresh, TCP enters the Congestion Avoidance state and grows linearly.
Whenever three duplicate ACKs are received, ssthresh and cwnd are halved and all duplicate ACKs are picked up before going back to the Congestion Avoidance state. Should an RTO occur in any state, ssthresh is set to cwnd/2, cwnd is set to one MSS, and TCP switches to the Slow Start state.

ICMP is the Internet Control Message Protocol, a IP network-layer protocol designed for sending various status and error messages to hosts and routers on the Internet. ↩

Network Layer

We looked at the transport layer (layer 3), which allowed applications to communicate with other applications over logical communication channels. The transport layer sits on top of the network layer (layer 4). The network layer provides host-to-host communication and is responsible for routing packets (called datagrams at the network layer) from a source host to a destination host.

A route is the path that a packet takes through the network. Routing is the process of moving the packet along the route. Routing algorithms figure out the route that the packet will take. A router is a host that forwards packets from an incoming link to a specific outgoing link as determined by the route. Forwarding is the process that a router uses to transfer a packet from an incoming link to the specific outgoing link. A router consults a forwarding table (also known as a routing table) that uses information in the packet headers to determine the outgoing link. The forwarding table is configured by routing algorithms.

Ideally, we might expect a variety of guarantees from the network. These include:

Guaranteed (lossless) datagram delivery
Time limits on datagram delivery
In-order datagram delivery
Guaranteed constant end-to-end bandwidth or the ability to offer a specific minimum bandwidth
The ability to specify maximum jitter. Jitter is the variation in the latency of packet delivery.
Security services, such as authenticating the source and destination as well as encrypting contents through the network.

The Internet Protocol, IP, gives us none of these. It provides best effort packet delivery but makes no guarantees on the reliability of delivery, bounds on delay, jitter, or packet order. Other network technologies, such as ATM (Asynchronous Transfer Mode), provide some of these capabilities. ATM is a virtual circuit (VC) network that provides logical connections at the network layer. All routers in the path are involved in setting up and maintaining the connection. For example, ATM’s CBR (Constant Bit Rate) service allows the connection to request a specific constant bandwidth and specify constraints on jitter and packet loss. The network will also guarantee in-order delivery.

A datagram network, such as IP, provides connectionless service at the network layer and relies on the transport layer to provide connection-oriented service. Only end hosts are involved in providing transport-layer service; the network layer itself is oblivious.

Virtual Circuit Networks

Before examining datagram networks, we’ll take a quick look at virtual circuit networks. Unlike datagram networks, virtual circuit networks require a connection setup phase where an end-to-end route is established and each router along the path agrees to participating in the path and commits necessary resources (e.g., buffers for queues) to ensure it can deliver the desired level of service being requested.

A host that initiates a connection request for a virtual circuit (a communication channel) identifies the virtual circuit with a number. As the path for a virtual circuit is set up, each router enters the input port/output port mapping for that path in its forwarding table and designates a virtual circuit number for the outgoing link (easier than allocating a virtual circuit number that may need to be unique globally). Unlike datagram routers, virtual circuit routers need to maintain connection state information. For communication, each packet only needs to contain a virtual circuit number. There is no need to specify the source or destination addresses since each forwarding table can look up the incoming interface and virtual circuit number, find the outgoing interface and change the virtual circuit number for the next link.

Datagram networks

With routers on datagram networks, there is no a priori setup of the route from source to destination. Indeed, the route may change during a communication session. Each datagram must be identified with the destination address of the endpoint. A router uses this destination address to forward the packet to the next network link. A forwarding table on a router allows it to determine the outgoing interface for a given datagram. IP addresses are 32 bits long (for IPv4; IPv6 addresses are 128 bits long). That gives 2³² (or 2¹²⁸ possible addresses. It is not feasible to have a table of over four billion entries. Instead, a forwarding table is based based on matching a prefix of a number of most significant (leftmost) bits in the address. The fewer bits in the prefix, the more addresses are matched for that prefix. Since the forwarding table may have a mix of longer and shorter prefixes, it uses a longest prefix matching rule so that longer, more specific, prefixes are tested prior to shorter, more general, prefixes.

Router Architecture

An IP router comprises two part: a control plane and a data plane. The control plane is responsible for the high-level software of the router. It runs a routing processor that implements the user interface, runs routing protocols, populates forwarding tables, implements the ICMP protocol, and controls queue behavior.

The data plane is responsible for packet forwarding. Its purpose is to move packets from the input port to the output port on a router as quickly as possible. Because of the need to move as manay as tens of millions of packets per second per port, the data plane is generally implemented in hardware. Note that on a router, a port refers to the input and output interfaces and has nothing to do with the use of the term port at the transport layer. A line card is the hardware that is responsible for implementing the input and output ports for a specific interface (e.g., an Ethernet interface). Because the router operates at layer 3 (the network layer), the data plane must process layers 1, 2, and 3 of the protocol stack:

At layer 1, it has to retime and regenerate the signal at the output port.
At layer 2, it has to create the new datalink headers and checksums for transmission onto the selected output port
At layer 3, it has to extract the destination address, search the forwarding table to determine the output port, decrement a time-to-live count, regenerate a checksum, and forward the datagram to the output port.

The input port of a router implements the link-layer protocol to accept incoming packets (frames) on the physical interface of the line card. It decapsulates (extracts the data encapsulated in the frame) the layer 3 datagram to get the IP packet, validates the protocol version number, and updates the packet’s time-to-live (TTL) field. If the TTL field reaches 0, the packet is dropped and a message is sent to the routing processor in the control plane to send back an error packet. The input port then performs a lookup in the forwarding table (using a longest prefix match) to determine the required output port to which the packet needs to be delivered.

The output port of a router accepts outbound datagrams and encapsulates them with the appropriate link-layer headers (e.g., Ethernet). Like the input port, it implements the link-layer protocol to transmit these outgoing packets (frames) on the physical interface of the line card.

A packet is delivered from the input port to the output port via the router’s switch fabric. This is a general term for the architecture that allows the movement of packets between the line cards. This packet delivery may need to be delayed if the switch fabric cannot currently accept the packet or if another input port is currently moving data to that same output port. In that case, the packet will need to wait in a queue at the input port.

A router will have queues at both input and output ports. The output port maintains a queue of packets received from the switch fabric and transmits them using the link-layer protocol for the outbound interface.

Queues, of course, are finite in size and have the risk of overflowing and therefore causing packet loss. If the queue at the output port is full, there is no room for the packet and it will have to be dropped or some other packet in that queue will have to be deleted. The simplest algorithm is first come, first served (FCFS) queuing. A more sophisticated one may place a priority on the source, destination, protocol, or even a service level that may be embedded in the packet. Active Queue Management (AQM) refers to the algorithm in place to make the decision of which packet gets sent next and which packet gets dropped if the queue is full.

At the input port, packets may be queued if they cannot be forwarded to the output port quickly enough. Queuing is susceptible to head-of-the-line blocking. If a packet cannot be immediately forwarded to an output port (typically because that port or switching fabric is in use by another line card), not only is the packet delayed but all the packets queued behind it are blocked.

There are several router architectures and the choice of design largely depends on cost and the performance needs of packet forwarding. Every one of these architectures is in use.

Conventional shared memory: A conventional shared memory design is not different from that of a PC with each input/output pair of ports functioning as an input/output device. An incoming packet at a port generates a system interrupt. The operating system copies the packet from the transceiver to the system’s memory. The processor runs a network stack whose network layer searches the routing table to determine where the packet needs to be forwarded. The packet then travels back down the stack onto the desired output port. The limitation of this design is that only one memory operation can take place at a time. Moreover, the single CPU and system bus can become bottlenecks.
Shared memory with distributed CPUs: To alleviate the CPU bottleneck, this design incorporates a CPU on each line card. This gives each line card the intelligence to process the three layers of the stack, and determine the destination port of the packet without having to make use of the central CPU, shared bus, or main (shared) memory. The central CPU is responsible for the control plane: providing the administrative interface and creating the forwarding table. It pushes a copy of the forwarding table onto each line card. The shared memory and shared bus are used to allow the processor on one line card to copy the packet to another line card. The fact that the bus is shared and only one memory operation can take place at a time in shared memory can still result in a bottleneck for moving packets between line cards.
Shared bus, no shared memory: To alleviate having to use shared memory as an intermediate buffer in copying packets, this design allows one line card to send a packet directly to the memory of another line card using a common shared bus. The shared bus is now the performance bottleneck.
Switched (crossbar) data path: The final variation of architectures removes the shared bus and replaces it with a crossbar switch. This is a considerably more expensive option but allows one line card interface to switch directly to another line card interface without affecting the ability of other line cards to communicate.

Network Layer: Internet Protocol

The Internet Protocol (IP) has three components:

The IP protocol itself, which deals with addressing hosts, formatting datagrams, fragmenting and reassembling datagrams, and forwarding datagrams through routers.
Routing protocols, which determine network connectivity and how forwarding tables are configured at routers
The Internet Control Message Protocol (ICMP), which is a network-layer protocol for error and status reporting.

The IP datagram

The IP datagram comprises a 20-byte header, a variable-size options field after the header, and the payload, which will typically be the TCP or UDP segment. It contains a 32-bit source IP address, which identifies the sender, and a 32-bit destination IP address, which identifies the recipient.

A time-to-live (TTL) field is a counter that is designed to keep packets from circulating indefinitely in the network case forwarding tables accidentally create cycles. An IP datagram is typically initialized with a TTL of 60 or 64 and the TTL is decremented by one each time it enters a router. If the TTL reaches zero, the router will discard the packet.

A protocol field in the datagram identifies the higher-layer protocol that is contained within the data. Common values are 6 to identify the data as a TCP segment and 17 to identify the data as a UDP segment.

A header checksum field contains a 16-bit header checksum. This is calculated with the same formula as UDP and TCP checksums. Only the IP header is checksummed. A router has to recompute the checksum since the TTL field (and possibly the options field) will change with each network hop.

Fragmentation and reassembly

If a router needs to forward a packet to a link that has a smaller MTU (maximum transmission unit) than the incoming link, it is possible that the IP packet may be too large to be transmitted as a single frame (packet) on the outgoing link. To handle this situation, IP supports fragmentation. If a packet is bigger than the MTU of the outgoing link, a router can split the datagram into two or more fragments. Each fragment is a separate IP datagram with its own IP header. When the fragments reach their ultimate destination, the receiving host must reassemble them into a complete packet before passing them to the transport layer.

Fragmentation is controlled by two data fields and two one-bit flags in the IP header. A don’t fragment (DF) bit tells a router that fragmentation is not permitted on a datagram. This may result in the inability to route the datagram. If a router makes the decision to fragment the datagram, the datagram is split into two or more fragments. Each transmitted IP datagram contains an identification number in the identification field of the IP header. This is set when the original datagram is created and is typically an incrementing counter for each successive datagram. When a datagram is fragmented, the IP header of each datagram holding a fragment contains the same ID number. This tells the receiver that those datagrams are part of the same original datagram. Each fragment also contains a 13-bit fragment offset. This is a number that is multiplied by eight to indicate where the data in this fragment belongs in the reassembled datagram. The first datagram contains an offset of zero. Each datagram fragment except for the last one has a more fragments (MF) bit set to one. The last fragment will have MF=0 and the fragment offset along with the IP length field will indicate the length of the final reassembled datagram.

IP addressing

Our discussion focuses on IP version 4, which is the most widely deployed version of IP. IPv4 addresses are 32-bits long. Every interface on an IP network must have a unique IP address. If a host has two interfaces (e.g., Ethernet and 802.11 links), it will have one IP address for each link. If a router has 128 ports, it will have 128 IP addresses.

We earlier discussed that it would be impractical for addresses to be randomly assigned as each router would have to have to be able to look up an individual address in a forwarding table of over four billion addresses. Moreover, routing algorithms would need to manage information about the route of every single address on the Internet. Instead, groups of adjacent addresses are assigned to an organization. Rutgers, for example, has been assigned all addresses with the top 16 bits of 128.6. A router would need to know where to forward anything that starts with 128.6 rather than maintain a table of all the 2¹⁶ (65,536) possible addresses that may start with 128.6. This ability to use one prefix to refer to a route that may span multiple sub-networks or hosts is called route aggregation.

A subnet (also called a subnetwork or a network) is a group of adjacent IP addresses that share a common prefix and are assigned to an organization. A subnet makes up a logical network that is connected to a router. For example, routers on the Internet needs to know how to route an address starting 128.6 to Rutgers. Subnets are expressed in CIDR (Classless Inter-Domain Routing) notation, whose format is a 32-bit IP address that comprises the identifying bits of the subnetwork followed by a slash an the number of bits that identify the subnetwork. For example, 128.6.0.0/16 means that the top (leftmost) 16 bits of the address 128.6.0.0 identify the subnetwork. The subnetwork logically divides an IP address into a network part (the bits that make up the subnet) and the host part (the bits that identify the host within the subnet).

A subnet mask (also called a netmask) is a bit mask that contains ones in the positions of the network bits of the address. For Rutgers, this means the top 16 bits will be one, resulting in a subnet mask of 255.255.0.0. A subnet mask is used to strip the host bits from the address to match prefixes in a forwarding table.

Subnetworks are hierarchical. An Internet service provider (ISP) will often be assigned large blocks of IP addresses by a Regional Internet Registry (RIR). Routers between ISPs will need to know which block of addresses is handled by which ISP. A specific ISP will allocate smaller blocks of IP addresses to organizations or lower-tiered ISPs. This is not relevant information outside of the ISP since outside routers only need to know how to reach one of the ISP’s routers. Routers within the ISP need to route to the organizations that were allocated those addresses. This process can continue iteratively. Within Rutgers, for example, are multiple networks that use blocks within the 128.6.0.0/16 allocation. For instance, the host aramis.rutgers.edu has an address of 128.6.4.2 and a netmask of 0xffffff00. This indicates that it is in a subnetwork that is defined by the prefix 128.6.4.0/24.

IP supports several special addresses: bit patterns that cannot be used as generic host addresses. An address of 255.255.255.255 represents a limited broadcast address. This is a broadcast address for the host’s network. Datagrams directed to this address will be delivered to all hosts on the directly-connected network but routers will not forward them to other networks (they are limited to the same local area network as the sender). An address with only the host bits set to one (e.g., 128.6.255.255) represents a directed broadcast address. Datagrams directed to this address will be routed to the specified subnet (if the router permits it) and delivered to all hosts on that subnet (they are directed to a specific subnet). Routers may be configured to forward these datagrams to ensure that they are delivered to subnets outside the directly-connected local area network.

Host configuration

A regular host on the Internet needs to know a few key parameters:

Its IP address, so it can identify itself in the source address field of an IP header.
Its subnet mask. Using the subnetmask along with the IP address, it can identify its own subnet and hence identify which addresses are on the local subnet and which ones need to be directed to a router.
Its gateway. This is a router on the LAN and the default address for non-local addresses that are not in a host’s local routing table. A gateway is a simple router that routes datagrams between the LAN and another network.
One or more domain name servers. It needs to know the address of at least one name server so that it can look up Internet domain names and find corresponding addresses.

These four parameters can be configured manually. Alternatively, the Dynamic Host Configuration Protocol (DHCP) can be used to do this automatically.

DHCP is a protocol to allow a client to get an IP address for itself as well as essential network configuration parameters. The challenge with developing such a protocol is that it has to work before the client has a valid address on the network. Hence, a conventional request-response protocol with source and destination addresses will not work. A requirement for DHCP is that the DHCP server has to be running on the same local area network as the host. If not, a DHCP Relay Agent must run that serves as a proxy and forwards requests and responses to the remote DHCP server.

DHCP uses limited broadcast messages (255.255.255.255). A client is allowed to send a limited broadcast and is capable of receiving one even if does not have an address assigned. DHCP works in four steps, with an acronym of D-O-R-A to describe them.

Discover. The client sends a limited broadcast DHCP Discover UDP message to port 67. This contains a random transaction identifier.
Offer. The server listens to broadcasts coming in on port 67. It gets the Discover message and responds back by sending a limited broadcast DHCP Offer UDP message to port 68. The response contains the following parameters:
- Matching transaction identifier
- Proposed IP address
- Subnet mask
- Lease time
Request. The client picks up the server’s Offer message. It compares the transaction identifier to ensure that the offer is not directed to another client. If there have been multiple DHCP servers and it received multiple offers, it selects the one it wants to accept and ignores the others. The client responds with a Request message that contains a copy of the parameters in the Offer.
ACK. The server associates the offered parameters with the host and sends back a DHCP ACK message acknowledging the association. The client can now configure its network with those parameters.

DHCP can be used in several scenarios:

Automatic allocation. DHCP can be used to assign a permanent IP address to a host.
Dynamic allocation. DHCP can be used to lease an address to a host. The host may use the address for a specified period of time. This allows the reuse of an address after it is no longer needed by the host. A Wi-Fi hotspot is a common example of this use of DHCP.
Manual allocation. An administrator can configure the DHCP server to assign a specific address in response to a DHCP Discover message. This is done by associating the host’s link layer address (e.g., Ethernet MAC address) with a specific IP address.

Network Address Translation (NAT)

In order to move datagrams between hosts on the Internet, each host interface needs to have a globally unique IP address. If this is not the case, routers will not be able to route the packet to that interface. The need for this, of course, creates a huge need for IP addresses. An organization with 10,000 hosts would need 10,000 IP addresses.

Network Address Translation (NAT) addresses this problem by allowing an organization to create a private IP address space within the organization while presenting one, or a small set of IP addresses to the outside Internet. As a packet flows through a NAT-enabled router, the router uses a NAT Translation Table to map a source {private-address, port₁} to a {public-address, port₂}. When packets flow back to the router from the outside, the router uses the NAT Translation Table to perform the inverse mapping of {public-address, port₂} to {private-address, port₁}.

To enable NAT, the gateway router has to look at, and possibly modify, the transport layer header since since a source port number may need to be changed to one that is not used by any other internal-external mapping at the router.

The private address space within the organization must contain a range of addresses that are not used by any hosts on the public Internet. Otherwise, there would be ambiguity as to which host is being addressed and where it is located. Hence private addresses are non-routable on the Internet and can only be used in internal networks. RFC 1918 defines three address blocks that can be used for these addresses.

Hosts in a NAT environment cannot accept incoming packets unless a host/port mapping has been established by an outgoing packet. As such, NAT is not particularly useful for servers but is incredibly useful for client machines.

ICMP

The Internet Control Message Protocol (ICMP) is a simple network-layer protocol that was designed to allow hosts and routers to communicate network-related information. ICMP is an eight byte or greater segment that sits in the payload (data section) of an IP datagram. It contains a checksum over the ICMP header and associated data as well as type and code fields, which define the purpose of the message. Depending on the message, four additional bytes may specify parameters to the message and optional data may contain the IP header and first eight bytes of the original datagram for which ICMP is generating a report.

The most common ICMP message types include an echo request (ping), echo response (ping), a destination unreachable status, a TTL exceeded warning, and a bad IP header error.

The ping program is an example of a service that uses ICMP. It creates a raw socket and generates an ICMP message of the type echo request (type 8). When the message is routed to the destination host, the ICMP protocol sends back an ICMP echo reply (type 0) datagram.

The traceroute program traces a route to a specific host. It also uses ICMP by sending a series of UDP segments to a bogus destination port on the desired host. Each UDP segment has a progressively longer time-to-live (TTL) value in the IP header. The first router will not route the datagram with a TTL of 1 since it decremented to 0 and hence expired. Instead, the router sends back an ICMP TTL exceeded warning message that contains the name and address of the router in the body of the ICMP message. The datagram with a TTL=2 will be routed by the first router but will be rejected by the second one, and so on.

IPv6

We have thus far discussed IP version 4, the most widely deployed version of IP. As IP was rapidly using up allocatable subnetworks due to its 32-bit address size, design on a successor protocol, called IPv6, began in the mid 1990s.

IPv6 uses a huge address space: 128-bit addresses compared with IPv4’s 32-bit addresses. A 128-bit address allows for 3.4×10³⁸ addresses, which is 8.9×10²⁸ times more than IPv4. Even though its addresses are longer, IPv6 uses a simplified header compared to its predecessor. It is a fixed-length headers with fewer fields. An optional extension to the the header supports less-frequently used options and additional capabilities. Under IPv6, routers will never fragment IPv6 datagrams. This differs from IPv4, where a router may do so if the outbound link has a smaller MTU. With IPv6, the sender is expected to perform a path MTU discovery ahead of time to determine the minimum transmission unit for the entire route. To handle cases where higher levels of software might create larger datagrams without checking the path MTU, IPv6 does support fragmentation by the sender. Since fragmentation is often not used, however, the fields related to managing it are relegated to this optional header extension. There is also no header checksum. The designers reasoned that the link layer has a checksum and TCP as well as UDP include critical IP fields in their checksum computation.

Transitioning to IPv6 has been a challenge in a world with widespread IPv4 deployment. IPv6 systems can bridge to IPv4 systems since the IPv4 address space is mapped onto a subset of the IPv6 space. The problem is that IPv4 systems cannot effectively communicate with IPv6 systems due to its larger address space. A system using IPv6 may not be visible to a system on an IPv4 network. Most systems today are dual-stack systems, with both network stacks implemented and capable of using either protocol. In areas with widespread IPv4 deployments, such as the U.S., IPv6 is finding most of its initial deployment in less visible areas, such as cable modems, set-top boxes, and VoIP (voice over IP) MTAs (multimedia terminal adapter).

Routing

Routing algorithm goals

Routers connect networks together at the network layer and are responsible for moving datagrams (routing) from one link to another. In many cases, a datagram will have to flow through multiple routers and there are multiple possible paths that the datagram can take to reach its destination. The goal of a routing algorithm is to figure out a good path, or route, for a datagram to take to get to its destination. By good, we mean an algorithm that will minimize the cost of the overall route. That cost may be either time (quickest route) or money (if there are financial costs that differ between different routes).

For purposes of analysis, a route may be represented as a connected graph, G = (N, E), where N is the set of nodes (vertices) (routers, in real life) and E is the set of edges (links between the routers in real life). A connected graph is one where, given two nodes a and b, there is some path from a to b. Each edge is identified by a pair of nodes. A node y is considered to be a neighbor of node x if the edge (x, y) exists in the graph. That is, (x, y) ∈ E.

Each edge has associated with it a value that represents the cost of the link. We represent the cost of an edge between nodes x and y as c(x, y). If there is no edge (x, y) in the graph then the cost c(x, y) is infinite (∞). If we need to route from node x to node y in this case, we will need to establish a path through some other nodes. For the purposes of our analysis, we will assume that a link has the same cost in each direction. That is, c(x, y) = c(y, x).

A path in a graph G = (N, E) is a sequence of nodes (x₁, x₂, … x_p) such that each of the pairs (x₁, x₂), (x₂, x₃), etc. are edges in E: a path is a sequence of edges. The cost of a path is the sum of the edge costs. Since there may be multiple paths from one node to another, one or more of these will be a least-cost path. If all edges have the same cost, that least-cost path will also be the shortest path.

Categories of routing algorithms

There are two categories of routing algorithms. A global routing algorithm relies on complete knowledge of the network graph. The algorithm, as input, knows the connectivity graph: all the nodes and edges. Algorithms in this category are known as link-state (LS) algorithms. Dijkstra’s shortest path algorithm is an example of an LS algorithm.

In a decentralized routing algorithm , no node has complete knowledge of all links in the graph. A node initially knows only about its direct links. Through an iterative process of exchanging lists of nodes and costs with its neighbors, a node will eventually compute the least-cost path to any destination in the graph. The distance-vector (DV) algorithm is an example of a decentralized routing algorithm.

Link-State: Dijkstra’s shortest path algorithm

Dijkstra’s algorithm is a global algorithm that assumes that the entire network topology (graph nodes, edges, and edge costs) is known to the algorithm. In an implementation, each node will need to broadcast any link change information to every other node so that all nodes will have an identical and complete view of the network.

Dijkstra’s algorithm is an iterative algorithm that, after k iterations, will compute the least-cost paths to k nodes from some given initial node. For each iteration, the algorithm keeps a list of nodes, N’ for which the lowest cost path has already been found (the current node requires no edge and is the initial element on this list). For each node in the graph, the algorithm stores:

D(v): the distance, our current knowledge of the least cost path to get to node v.
p(v): the previous node before node v along the least cost path that we have so far.

Initialization

Let us assume that we need to find the least-cost routes from some node u to all other nodes. The list of nodes with a known least-cost path, N’ is initialized to u, our starting node. The distance for each node v, D(v) is set to the cost of the edge from u to v, c(u, v). If there is no edge between the nodes, the cost is set to infinity. For each node v with a non-infinite cost, the previous node, p(v), is set to u since the entire path at this time is simply a single edge from u to v.

For each iteration

Each iteration picks a new node, n, and examines the total distance to each of n’s neighbor nodes from u through n. Here are the steps.

Pick a node n that is not in N’ and has the smallest distance D(n). This will be the node that we will examine this iteration and for which we will find the definitive least-cost path.
Add node n to the least-cost list N’.
For each neighbor m of node n that is not in N’, compute the cost of the route through n. This is D(n) + c(n, m); that is, the cost to n plus the cost of the edge from n to m.
If this computed cost to m through n is lower than the value we currently have for D(m), update D(m) with the new cost and set the previous node of m, p(m), to n since the path through n resulted in a lower cost.

Eventually, all nodes will be in the list N’ and there will be no more nodes left to process. At this time we have computed the least-cost paths from u to all nodes in the graph.

Finding the least-cost route

After running Dijkstra’s algorithm for a starting node u, we know the least cost to each node v and the node that is encountered right before v on that least-cost path: p(v). We can work backwards from this to compute the full route. For example, if the previous node for some node z is w, we can look up p(w) to find the node before w along the least cost path to u. Suppose that is r. We then look up p(r) to to find the node before r along the least cost path to u. Suppose that is u, our starting node. We now reconstructed the least-cost path: u → r → w → z.

A routing table at u is interested not in the last hop or the entire path, but only in the first hop along the least-cost path so it can forward its datagram to that next router. In the routing table, we would need an entry that states that datagrams for the range of addresses handled by z need to be forwarded to router r.

Oscillations

If we have an environment where link costs vary to reflect the current traffic volume on the link, the lowest-cost paths will favor uncongested links. This will cause routers to send more data over these low-traffic links, which will, in turn, increase the level of traffic on these links and take traffic away from what used to be high-traffic links. When the algorithm is re-run, lowest-cost routes will now be recomputed to be the formerly high-cost routes since we channeled traffic away from them. As the new routing table is used, we will see the same phenomenon happen again as traffic is shifted to what used to be low-volume links. This results in oscillations between network routes each time the LS algorithm is re-run. The best approach to avoiding these oscillations is to have routers run their LS algorithm at random times rather than all at the same time.

Bellman-Ford equation

The distance-vector algorithm is based on a simple principle that is embodied in the Bellman-Ford equation. This equation states that the least cost to get from node x to node y is to go through a neighbor node v where the cost form x to v plus the cost from v to y is the smallest.

Let’s look at a two-node example. Suppose that you are in Geneva, want to get to Munich, but must travel through either Zurich or Turin. Your cost is travel time. The paths from Zurich or Turin to Geneva may involve several more stops but you don’t care about that. You just know the cost to your neighbors, Zurich (3:00 hours) and Turin (2:40) and you know the cost from Zurich to Munich (3:15) and the cost from Turin to Munich (6:00). To determine the best first leg of the route, you want to minimize the overall cost. The cost (time) via Zurich is 3:00 + 3:15, or 6:15. The cost via Turin is 2:40 + 6:00, or 8:40. Hence, you choose to use Zurich as your first hop. Pretty simple, eh?

Distance-Vector Algorithm

The distance-vector (DV) algorithm is a distributed algorithm where a node (router) communicates only with its neighboring nodes. The DV algorithm is iterative. A node will send messages to its neighbors only when its local link costs change or it receives a distance vector update message that results in the node changing its least-cost route estimate to some destination.

A node n’s distance vector is a list of costs from n that each other node in the graph. Unknown costs are infinity. Each cost is considered to be an estimate as it may be based on incomplete data. Eventually, the algorithm converges and the costs become true least-cost values. In the DV algorithm, a node sends its distance vector to its neighboring nodes.

A node also keeps a distance vector table. This is a set of distance vectors that it has received from its neighbors. These distance vectors are used to allow a node to check whether it is more efficient to have a route that goes through a specific neighbor, m, than the node it currently thinks is the next hop on the shortest path.

Initialization

Initially, a node knows only of its neighbors and the cost to each neighbor. The distance vector is a set of (node, cost) tuples, with the cost being the cost of the direct link to the neighbor. Send a copy of the distance vector to each neighbor.

Operation

Suppose that a node n receives and saves a distance vector from another node, m. It then does a node-by-node comparison of all the nodes in both vectors. For a given node x:

Is the cost of routing to x through node m lower than the current cost estimate? That is, is the cost to x supplied in node m’s distance vector plus the cost of the link from x to m result in a smaller value than n currently has?
If yes, update n’s distance vector for the destination m to the value computed by going through x. If no, leave the distance to m unchanged. To build a routing table, n would also record that the currently-known lowest-cost to m is via x. Unlike the link-state algorithm, the distance-vector algorithm keeps track of first hops rather than last hops.
Anytime a node initializes or updates its distance vector, it sends a copy to each of its neighbors.

Eventually, no node will experience any changes to its distance vector and therefore will not send any updates to its neighbors. The algorithm has converged.

Poison reverse

If a link to a node fails at some point, a neighbor will mark the cost of that link as infinite. As long as alternate paths exist, the algorithm will converge and use alternate paths. However, consider (for example) a case where there is only a single link to node C from B. Node A tells its neighbors that it can route to C (by going through B). If the link between C and B fails, node B., using A’s information about the route, will attempt to use A as an alternate route, not realizing that the attempted route will be B → A → B → C. This will result in an infinite sequence of distance vector updates between B and A, each advertising a progressively higher cost as they factor an additional link cost between the AB link each time. This is known as the count-to-infinity problem.

A technique called poison reverse tries to mitigate this problem. Since A knows that it needs to route through B to get to C, whenever A sends its distance vector to B, it will set any costs whose first hop is B to infinity. This will avoid creating a routing loop.

Internet Routing

Autonomous Systems

An autonomous system (AS) is a collection of routers and hosts that are administered together and expose a single routing policy to other systems on the Internet. ASes provide a two-level routing hierarchy in the Internet. Organizations manage their own infrastructure and expose only limited connectivity to others. Each AS comprises one or more subnetworks and serves one or more ranges of IP addresses. It advertises the set of IP prefixes that it can route (via CIDR and route aggregation to minimize the length of the list). The AS is responsible for the routing of traffic within its AS, whether it is routing it to another AS or to a machine within its own AS. The Internet can be viewed as a set of connected ASes, with packets being sent from one AS, possibly routed through other ASes, and delivered to a target AS. Routing on the Internet takes place between ASes. Using ASes simplifies the problem of dealing with the billion hosts on the Internet.

An Intra-AS routing protocol, called an Interior Gateway Protocol (IGP) runs within an AS. It is up to the system administrator to pick a protocol (e.g., an LS or DV algorithm) and manage the routes between machines within the AS. The outside world does not see the routes within an AS. Some Intra-AS routing protocols are RIP and OSPF.

Gateway routers are routers within an AS that connect to other ASes and forward packets between them. In addition to running an intra-AS routing protocol with other routers inside the AS, they are also responsible for routing to destinations outside the AS by sending datagrams to other gateway routers. An Inter-AS routing protocol, called an Exterior Gateway Protocol (EGP) runs on gateway routers and enables them to learn of routes to addresses served by other ASes or routed by the gateway routers within that AS.

Logically, an external AS looks like a router. An AS will be connected to some other ASes and needs to learn which destinations are reachable via those ASes. Some of those ASes may need to relay the datagram in order to send it onto other ASes. If a desired subnet can be accessed through either of several ASes (that is, either of those ASes can route to that subnet, not that the subnet belongs to multiple ASes), then a common approach is to have the AS send the packet out onto another AS in the least-cost manner. This is called hot-potato routing. The goal is to find the lowest cost path to any gateway router that can then route to some AS that can deliver the packet. Since ASes are owned by different organizations, everyone on the Internet must agree to use the same inter-AS routing protocol. Currently, this protocol is BGP version 4.

Autonomous System Taxonomy

Each Autonomous System is assigned a unique ID by a Regional Internet Registry (the same organization that assigns blocks of IP addresses). Policies defined by the owner of the AS determine if the AS will route traffic to other ASes and, if it will, whose traffic it will route.

A Transit Autonomous System is an AS that provides the ability to route traffic from one AS to another AS. A Tier 1 ISP represents a transit autonomous AS (or set of ASes) that does not pay any other network for transit. It peers (and connects directly) with every other Tier 1 network so it can route an IP address directly to the Tier 1 network that oversees it.

A Transit relationship on the Internet is considered to be one where an AS sells access to the Internet. That is, it agrees to act as a router between ASes but traffic is metered and charged. A peering relationship is one where a pair of ASes agrees to exchange traffic with each other for no cost. A Tier 2 ISP is an AS (or set of ASes) that needs to purchase Transit to connect to some parts of the Internet. Establishing a peering relationship avoids the need to purchase Transit.

A stub Autonomous System is an AS that is connected only to one other AS (run by an ISP). Because it is connected to just one AS, it cannot provide transit. A multi-homed stub Autonomous System is an AS that is connected to multiple ASes (e.g., multiple ISPs) but will not offer routing services between them.

Routing Information Protocol (RIP)

The Routing Information Protocol (RIP) is an Intra-AS IP routing protocol (interior gateway protocol) that uses a form of the Distance-Vector algorithm. It counts only hops, so the cost of each link is one. RIP creates and manages a routing table on a router. For each destination (a subnet: a group of IP addresses), the table contains the number of hops to that destination and the address of the next router (the first hop).

As with the DV algorithm, each router sends periodic RIP advertisements to its neighbors. A RIP advertisement is just the routing table with hop counts. When another router receives such an advertisement, it compares the routes in the received table to see if routing that that node will result in fewer hops. If so, it will update its own routing table. It will also add any new subnets that it obtains from received tables.

Open Shortest Path First (OSPF)

Open Shortest Path First (OSPF) is another interior gateway protocol that was designed as a successor to RIP. It is a link-state algorithm based on Dijkstra’s shortest path algorithm. Because of this, every router constructs a complete graph of the systems in the AS. Any link changes are broadcast to all routers, not just a node’s neighbors.

To support larger networks, OSPF allows the network in an AS to be partitioned into multiple OSPF Areas. Each area runs its own OSPF link-state algorithm and routes any packets that are destined to go out-of-area to an area border router (ABR). The collection of these area border routers belong to a common backbone area. They summarize (aggregate) routes to the subnetworks in their own area and advertise them to other area border routers. Area border routers also run the OSPF algorithm not just to learn routes within its area but also to ensure that each ABR can route to the proper ABR in another area based on a prefix match of the destination IP address. OSPF Areas make a single AS look like a mini Internet.

Border Gateway Protocol (BGP)

Unlike RIP and OSPF, the Border Gateway Protocol (BGP) is an exterior gateway protocol: an inter-AS routing protocol. Gateway routers in an AS establish a TCP connection with gateway routers in other ASes. A pair of such routers are known as BGP peers and the TCP connection between them is known as an external BGP session (eBGP). In cases where an AS has more than one gateway router, other routers need to know which IP address prefixes are served by which gateway. Internal BGP sessions (iBGP) between a gateway router and other routers within the AS allow the gateway router to propagate information about external IP prefixes that it can reach. Typically, iBGP will run between the gateway router and the area border routers (ABRs) in an OSPF backbone area.

BGP peers exchange CIDR route prefixes and use a distance vector (DV) algorithm to establish least-cost paths. A gateway router advertises its prefix reachability, which means it tells its peers in neighboring ASes the routes that it is capable of handling (as CIDR prefixes). In this manner, each AS finds out which neighboring AS yields the lowest-cost path to a destination. Each BGP advertisement identifies a route, which consists of:

a CIDR prefix (e.g., 128.6.0.0/16)
a path, which is a list of ASes through which the advertisement passed. This allows an AS to detect and reject routing loops. The sending of a full path makes BGB a path vector protocol: a variation of the distance-vector protocol that sends entire paths.
a next-hop router, which identifies the address of the remote router that sent the advertisement and allows the gateway router in an AS to address the router in a remote AS. The next-hop value also allows an AS to choose to choose from several possible links to another AS.

An important facet of BGP is its use of policies. An import policy defines what routes an gateway router will reject. Policies are designed by an administrator and can set local preferences and policies such as not offering transit to certain ASes.

Multicast

A unicast is a point-to-point transmission. The communications we looked at to date were unicast messages: a node sends a message to a unique destination. A broadcast is a message that is received by everyone on a network. We encountered this when we examined IP addresses. An IP address with all bits set to 1 is a limited broadcast address where the datagram is delivered to all nodes on the local area network but not forwarded by routers. A directed broadcast address where only host bits are set to 1 (the low-order bits that make up the host, as opposed to the network bits) delivers a datagram to all hosts on the specified subnet. Routers must may forward the datagram to that subnetwork.

A multicast is a message that is delivered to all members of a group. That is, it is delivered to all hosts that have subscribed to receive the multicast. These group members do not all have to reside on the same local area network. A simple way of implementing a multicast is by an n-way unicast. The sending host simply iterates over the list of destinations and sends a datagram to each destination. This is inefficient since the sender must transmit N times the traffic. Moreover, the sender needs to know all of the recipients.

It is more efficient – and desirable – to have routers replicate packets. With uncontrolled flooding, a host sends a single datagram that is then replicated by each router and sent over every other interface in that router. The problem with this approach is that any cycles in the graph formed by router connections will create a broadcast storm: a router will have receive a duplicate datagram, send it out on each interface, receive one or more of those duplicates at a later time, duplicate and send that, and so on.

A controlled flooding approach ensures that these cycles do not occur. Sequence number controlled flooding requires each multicast packet to have a sequence number affixed to it. Each router keeps track of the sequence numbers of datagrams that it has recently routed. If it receives a datagram with a new sequence number, the router sends a copy over each link except the one it came in on. If the router gets a packet with a sequence number that it has already seen, it discards the packet. Reverse path forwarding (RPF) is a technique that does not require adding new data to a packet. When a router receives a datagram, it looks at the datagram’s source address and then consults its routing table to see whether the datagram arrived on the link that corresponds to the route to the source (which will be the shortest path to the source). If it does, then the router forwards a copy of the packet onto every other link. If it does not, that indicates that the packet is a duplicate that arrived through a cycle. That packet is discarded. A multicast based on controlled flooding using RPF is called a source-based tree: the flow of multicast packets forms a tree that is rooted at the initial sender. With a source-based tree, each sender establishes a separate shortest-path tree for a multicast group. Since RPF is used, routers have to keep track of the sender of the multicast packet.

Another approach to controlled flooding is not to send a copy of the datagram onto every link in the router. Instead, we create an overlay network that is a subset of the full network. This overlay network is a spanning tree, which is a graph that contains all the nodes of the network but only a subset of the edges such that the graph is connected (all nodes are reachable) but there are no cycles.

One way of constructing a spanning tree is to pick a random node to serve as a rendezvous point (also known as a center node). Every node that wants to receive multicasts will send a join message to that rendezvous point. As each join message propagates through routers, each router records the incoming and outgoing interfaces as being associated with that multicast address. The incoming and outgoing links become part of the spanning tree. When all interested nodes have sent join messages, the spanning tree is complete. Any multicast messages will be forwarded only along the links that make up the spanning tree. This multicasting technique is called a group-shared tree since only interested edge routers beome part of the tree. With a group-shared tree, a single tree is shared among all senders for a multicast group; each sender has to send its multicast packets to a common rendezvous point in the exact same manner as if it was sending unicast messages. Unlike a source-based tree, routers do not need to make routing decisions based on the sender’s address for a multicast destination.

IP multicast routing

IP multicast is designed, like IP, to be decentralized and span multiple physical networks. Membership is dynamic: a machine can join or leave a multicast group at any time. There is no central coordinator and no restriction on the number of hosts that can be in a group. Multicasting provides network efficiency. Datagrams in a multicast stream only need to be replicated when a router needs to send them to multiple network links. A sender transmits only a single stream of datagrams and only one stream of datagrams is ever needed on any network segment regardless of the number of receivers.

An IP multicast address (also known as a class D address) is an IP address that starts with the bit pattern 1110 and contains a 28-bit multicast group ID. With IPv6, a multicast group address starts with the bit pattern of eight 1s (1111 1111) and contains a 112-bit multicast group ID (the other buts are used for flags and to indicate scope; we will not cover that here). A host may join any IP multicast address and receive messages addressed to that multicast ID. The collection of systems that are interested in receiving multicast messages from a specific multicast group is called a host group.

Routers have to get involved to support multicasting beyond the local area network. Two protocols are used to implement multicasting. The Internet Group Management Protocol (IGMP) is designed for hosts to inform routers that they are interested in joining a multicast group. The Protocol Independent Multicast (PIM) protocol enables routers to tell their neighboring routers that they are, or no longer are, interested in receiving packets for a particular multicast group.

A host uses IGMP to send a multicast report message to join a specific multicast group. A multicast-aware router will get this message and now know that the link on which the message arrived needs to receive any packets addressed to that multicast group. Periodically, a router will send a membership query, asking hosts what multicast groups they want to receive. If no host on a network link responds, the router will assume that nobody on that LAN is interested in receiving multicast packets. If a host is interested, it has to re-send a multicast report. This technique of requiring renewals and deleting a subscription if no renewal is received is called soft state. A host may send an explicit leave group message to state that it is no longer interested but the soft state mechanism ensures that multicasts will not be sent onto the network even if that does not take place (if the machine dies, for example).

IGMP allows routers to know what multicast groups the nodes on its connected LANs are interested in. PIM, Protocol Independent Multicast, is responsible for conveying membership information among routers. It assumes the presence of other protocols to know the network topology and which routers are connected together. There are two approaches to multicasting on the WAN (wide-area network): flooding and sparse-mode multicast.

Flooding, also known as Dense Mode Multicast, uses a source-based tree. That is, all multicast traffic originates from the source and the message is duplicated at a router and sent to all of its connected routers. Each of those routers, in turn, duplicates and sends the message to all of its connected routers, and so on. Reverse path forwarding (RPF) is used to ensure that no forwarding cycles arise. PIM Dense Mode floods the entire network of connected multicast-aware routers. A router stops forwarding traffic only when it receives a Prune message from a router telling it that the router does not want the multicast traffic for that address. It does this if its downstream routers are not interested in that multicast stream. If a node on a LAN joins a multicast group at a later time and sends an IGMP message to a router, that router would then send a PIM Graft message to its connected routers to state interest in the stream. Dense mode only makes sense when there are interested receivers spread through most locations covered my multicast-aware routers. It is rarely used.

In contradistinction to Dense Mode, PIM Sparse Mode Multicast starts with requests from multicast receivers rather than flooding the network with traffic from the sender. Each multicast group must be associated with a router known as a rendezvous point. Edge routers that are interested in receiving a multicast group send join messages to that rendezvous point. This builds a spanning tree between the rendezvous point and the subscribers. This rendezvous point acts as a central point that senders route to when they are transmitting multicast streams. The advantage of PIM sparse mode multicast is that packets go only where needed.

Data Link Layer

The data link layer is concerned with sending and receiving data on the actual communication links that connect machines. These are the links that constitute a local area network (LAN). Packets at the data link layer are called frames. Medium Access Control (MAC) is the general term for the protocol that is used for transmitting and receiving link-layer frames. Interfaces at the link layer use link-layer addressing. A MAC address (for example, an Ethernet address) is different from, and unrelated to, an IP address. An Ethernet MAC address is globally unique to a device and there is no expected grouping of such addresses within a local area network. IP addresses on a LAN, on the other hand, will share a common network prefix.

Error Detection & Correction

MAC protocols usually employ error detection codes and sometimes employ error detection and correction codes. We’ve seen that IP used error detection codes to validate an IP header and both TCP and UDP used them to validate their headers and data. Why do we need this at the link layer? The basic advantage of detecting errors at the link layer is to catch bad frames early and avoid the overhead of propagating a useless frame up the network stack. If the firmware on a network card can detect an error, it can drop the packet and not even waste time interrupting the processor. If error correcting codes are used, the link layer now has the opportunity to correct a limited amount of bit errors in the received frame. This can avoid the costly delay of having to retransmit a packet.

Parity

The simplest form of error detection is parity. With parity, we add one bit to a block of data, called a parity bit. With even parity the parity bit is set such that the total number of one bits in the block of data (including parity) is an even number. With odd parity the parity bit is set such that the total number of ones is an odd number. Parity is simple to compute and requires a low overhead both in terms of storage and computation. However, it cannot detect an even number of bit errors (one error cancels the other). Moreover, in networks, bit errors typically occur in bursts, not single bits, so parity is a poor choice for this kind of application. Parity is popular in applications where errors rarely happen and, when they do, are usually single-bit errors. Memory chips are an example where parity codes are useful.

Two-dimensional parity

The use of parity is based on the assumption that errors are rare and multi-bit errors are even more rare. We can arrange a sequence of bits into a two-dimensional M × N array. Then we generate a parity bit for each row and another one for each column. Finally, we generate a parity bit for the intersection of the row and column parity bits. To validate data, we check the parity along each row and each column. If there is a single bit error in the data, the intersection of the row and column with parity errors will identify the bad bit.

Two-dimensional parity, which can be easily extended to n dimensions, is a simple example of an error correcting code (ECC). Error correcting codes were pioneered by Richard Hamming in 1950 and there are numerous such codes, multi-dimensional parity being one of the simplest. A data transmission that uses error correcting codes is said to use forward error correction (FEC).

Checksums

A checksum is any form of code that is uses the data to compute a small, fixed-size value that is sent along with the data and is used for error checking. The receiver can apply the same computation to validate the checksum. Any bit errors in the data will yield a high probability that the computed checksum will be a different value. We already examined a simple checksum earlier. Protocols such as IP, UDP, TCP, ICMP, OSPF, and IGMP all use the Internet checksum. In this checksum, we summed up the data 16 bits at a time, adding one whenever the sum of two values resulted in a carry (overflow), and inverting the bits of the result. The Internet checksum is extremely easy to compute put provides poor detection of errors in the data.

Cyclic redundancy check

A cyclic redundancy check (CRC) is a much more robust checksum that is particularly good at detecting multiple consecutive bad bits. An n bit CRC code will generally detect a burst of up to n bad bits. CRC codes are a type of code known as a linear block code. A block code treats the data as a sequence of fixed-length integers. The Internet checksum is also a form of a block code. A CRC is a bit more computationally intensive to compute than an Internet checksum but, when done at the MAC layer, is generally done by the hardware of the adapter, and therefore does not take time from the processor.

A CRC computation is similar to long division but with the use of exclusive-ors instead of subtraction. The entire data to be checksummed is treated as a large binary number that is left-shifted by the desired length of the CRC code (e.g., r bits). This number is “divided” by a value called the generator (abbreviated by G). The generator is pre-defined and agreed to by both sides. For an r-bit CRC code, the generator must be r + 1 bits long, start with a 1 and be an odd number (end with a 1). The “division” is a series of exclusive-ors on the data, aligning G with the leftmost 1 in the remaining data, exclusive-oring the result, and repeating until there is no more room to position r bits under the data. What’s left is the remainder, which is the CRC value.

The figure on the right shows a CRC computation with a Generator of 23 (10111) yielding a remainder of 11 (1011), which becomes the checksum. The data is sent with the CRC number in place of the 0 bits that filled the data on the right when we left shifted the data by r bits. The recipient performs the same division without left shifting the value it receives. If the remainder is 0, then this indicates that there is no data corruption.

Multiple access protocols

A multiple access protocol is MAC protocol that is used on a network where multiple hosts need to send messages on the same physical link (e.g., the same wire or range of radio frequencies). These types of links are broadcast links since multiple hosts are connected to the same medium. The multiple access problem is that of coordinating transmissions from multiple hosts to avoid collisions. A collision is when two or more hosts transmit at the same time on the same frequency, causing one transmission to interfere with the other, and damaging both signals. There are three strategies for handling this.

1. Channel partitioning

Channel partitioning means dividing a communication channel either into fixed time slots or fixed frequency bands. Time division multiplexing (TDM) divides a channel into time slots. For n hosts, each host gets 1/ n of the time slots. A host can transmit only during its defined time slot. Frequency division multiplexing (FDM) divides a channel into n frequency bands. A host can transmit at any time but can only transmit on its allotted frequency band. As with TDM, this channel partitioning scheme does not make efficient use of the bandwidth. Frequency bands or time slots go wasted if the nodes to which they are allocated have nothing to send.

2. Taking turns

MAC protocols based on taking turns allow each node to have full use of the network and coordinate access to make sure that no other node will be using the network. This ensures that collisions cannot occur.

A polling protocol uses a master that polls each of the nodes on the network, each in turn, to see if one of them wants to transmit data. This ensures that there are no collisions as the master will not poll another node until transmission is complete. However, a node incurs the delay of waiting to be polled (think about a LAN with thousands of nodes on it). Also, if the master dies, the network becomes inoperable.

A token passing protocol uses a special frame, called a token, that is passed around the nodes on the network in sequence. If a node gets a token, then it can transmit data. Once it is done transmitting, it passes the token to its neighbor. If it has nothing to transmit, it passes the token to its neighbor immediately. Unlike a polling protocol, there is no need for a master. However, it suffers from some of the same problems. A node has to wait for the token and, should a node fail to pass the token, no other node will be able to transmit.

3. Random access

With random access protocols, any node has full use of the channel but only one node at a time may use it. There are no scheduled time slots as in TDM. Because there is no schedule on who can transmit when, random access protocols have a chance of collision when multiple nodes decide to transmit at the same time.

In the Slotted ALOHA protocol, access to the network is divided into time slots. Time slots are not assigned to any node. A node can transmit at the start of any time slot and must finish transmission at the end of that slot. If two nodes happen to transmit during the same time slot, a collision results and both nodes will have to retransmit the frame. Each of them retransmits it in the next time slot with some predefined probability p. Imagine the decision is made by a node flipping a weighted coin to decide whether to retransmit in the next time slot. If the flip tells it not to retransmit, it makes the same decision for transmitting on the time slot after that (and so on). Slotted ALOHA introduced the concept of a randomized wait after a collision but it does not use the network very efficiently. For a large number of transmitting nodes, only about 37% of time slots will have useful data. The rest will be empty or have collisions.

Carrier Sense Multiple Access with Collision Detection (CSMA/CD) is a successor to Slotted ALOHA and is the protocol used on Ethernet on a shared link (which was the design of the original Ethernet). CSMA/CD has two parts to it: carrier sensing means that a node will not just transmit when it is ready to do so but will first listen to the communication channel and wait until it is clear (i.e., nobody else is transmitting); collision detection means that a node is listening to the channel at the same time that it is transmitting. If it detects interference (a collision), it immediately stops transmitting.

After a node detects a collision and stops transmitting, it will have to retransmit that frame. Before doing so, it will wait a random interval. The desired heuristic is to pick a random wait time from a long time interval if there is a lot of traffic on the network and to pick a random wait time from a short time interval if there is little traffic on the network. Of course, a network adapter has no idea what is going on in the rest of the network. Instead, each time the transmission of a frame results in a collision, a random backoff value will be chosen from a time interval that is double the previous time interval. This technique is called binary exponential backoff.

Ethernet

Ethernet was designed as a random access packet switched network that uses CSMA/CD over a shared wire. It has been evolving since the the mid–1970 and continues to evolve. Not only did it get faster (from 2.74 Mbps to 10 Gbps and beyond) but it moved from a shared bus topology to a switched star topology, where every node is connected by a dedicated cable to a central switch.

There are slight variations on the format of an Ethernet frame depending on which version of the protocol is used but the key parts of an Ethernet frame are:

MAC destination address
MAC source address
type, indicating the higher level protocol encapsulated within the data field (e.g., an IP datagram). It also identifies the version of the Ethernet protocol used in the frame.
payload: the data
CRC checksum

Surprisingly, the frame length is missing in the most common version (Ethernet II) since the end of the frame is recognized by the hardware when the carrier signal drops. The packet driver on the adapter counts bytes as they arrive and stops when the receiver hardware has no more bytes to receive.

Address Resolution Protocol and Neighbor Discovery Protocol

Ethernet addresses are 48-bit addresses that are globally unique. They are generally assigned by the manufacturer of the adapter. These MAC addresses are unrelated to IP addresses. Any IP datagram needs to be encapsulated in an Ethernet frame before it can be transmitted on an Ethernet network. The MAC address in this frame is the address of the next hop for that datagram. The Address Resolution Protocol (ARP) allows a host to find a MAC address that corresponds to a desired IP address.

If a host needs to send a datagram to a system on its subnetwork, it must encapsulate the datagram in an Ethernet frame whose destination address is the MAC address of the recipient. If a host needs to send a datagram to a system that is not on its subnetwork, it needs to look up the destination IP address in its routing table and encapsulate the datagram in an Ethernet frame whose destination address is the MAC address of the first-hop router that is responsible for that IP destination.

ARP maintains a cache of recently-used IP-MAC address mappings in an ARP table. If the IP address it needs is not in the table, then it broadcasts an ARP query that contains the desired IP address. Every adapter on the LAN will receive it (it’s an Ethernet broadcast). If one of the adapters owns that IP address, it responds directly to the sender with an ARP response containing the corresponding MAC address.

IPv6 does not support ARP and uses a similar but more efficient mechanism called the Neighbor Discovery Protocol (NDP). Every IPv6 host must listen to a special multicast IP address that is derived from its own IP address. The multicast Ethernet MAC address is, in turn, derived from that IP multicast address. Hence, a query (called a neighbor solicitation) will be sent as a multicast message and delivered only to those hosts whose IP addresses result to the same Ethernet MAC address. Most likely, there will only be one such address in a LAN. The benefit of this method over ARP is that every host on the LAN is not interrupted with a query.

Multicast addressing

As with IP, Ethernet defines a range of MAC addresses that are reserved for use as multicast addresses. An Ethernet controller may, depending on the hardware, be configured to receive multicast frames in several ways:

If a hash of a MAC address matches a certain value, accept the frame.
If a MAC address matches one of several MAC addresses, accept the frame. This is usually limited to a small number of addresses.
Accept all multicast frames in multicast promiscuous mode.

The link-layer software driver needs to be prepared to receive extraneous frames and discard them if they contain destination addresses that the host is not interested in receiving.

While IP and Ethernet MAC addresses are completely unrelated on a host, multicast addresses are related since there would otherwise be no way of identifying the set of MAC addresses that need to receive a specific IP multicast datagram. Each host needs to define a multicast MAC address that can be derived from the IP address so that each receiver can listen on that address and each transmitter can transmit to that address. Both IPv4 and IPv6 use a similar approach. A number of least-significant bits of the IP multicast address (23 out of 28 bits for IPv4; 32 out of 112 bits for IPv6) are copied onto a defined MAC multicast address. The IP driver will need to discard any extraneous multicast packets that arrive because the lower bits of the multicast address happened to be the same.

Link-layer switches: Switched Ethernet

Ethernet began as a shared network using a bus topology. As coaxial cables gave way to twisted pair wiring (phone wires), adapters on different hosts could no longer tap into the same cable. Ethernet moved to a star topology where a separate cable connected each adapter to a central point: an Ethernet hub. An Ethernet hub simulated a shared Ethernet network by taking every bit that was received on one interface and transmitting it onto every other interface.

Hubs evolved into switches. While switches look similar to hubs, they are more intelligent and improve the performance of Ethernet networks considerably. A switch essentially acts like a link-layer router. It forwards frames received on one interface (connector) to another. Unlike a router, a switch is self-learning. It learns which MAC addresses correspond to which interfaces on the switch by looking at the source address of frames that arrive on each interface. This mapping of MAC addresses to interfaces is stored in a forwarding table (also called a switch table or a MAC table). Because switches may be connected to each other, multiple MAC addresses may map to a single interface on a switch (that interface happens to connect to another switch). If a switch receives a frame with a destination address that is missing from its table, it simply sends a copy of that frame out onto all interfaces.

The biggest benefit from switches is that collisions can no longer occur. A link between a switch and a host is full duplex: frames can be sent concurrently with frames being received. The switch itself will queue and forward frames (this leads to the possibility of buffer overflow, but not collision). Adapters operating in full duplex mode have no need to do collision detection.

Virtual Local Area Networks (VLANs)

Some Ethernet switches can be configured to act as multiple logical, or virtual, switches, thereby creating several distinct local area networks. These networks are called Virtual Local Area Networks (VLANs) and they look and feel like distinct LANs. Each interface on the switch can be assigned, or reassigned, to one of these VLANs. Operations such as broadcasts and link-layer multicasts will remain within a single VLAN.

VLANs can be extended geographically or in capacity by connecting multiple switches together. Instead of running a cable for each VLAN between the switches, a single cable may be used to connect the switches to handle all the traffic that flows on all of the VLANs. This is called VLAN trunking. To identify which VLAN an Ethernet frame belongs to, the Ethernet frame is augmented with a VLAN tag. This tag is removed when the frame is switched to a non-trunk interface.

With VLAN switches, an administrator can define which VLAN any device resides on, regardless of which switch it is connected to. This allows the administrator to segment groups of computers (e.g., accounting vs. development vs. test) without having to connect each group to its own dedicated switch.

Wireless Networks

A base station is a device that sends and receives data to and from wireless hosts. It may also coordinate transmission among hosts, with directives on who can transmit, when, and what signal strength. The base station generally interfaces to other, usually wired, networks thereby connecting the wireless devices with the wider Internet. Examples of base stations are cell towers and wireless access points.

A wireless device, called a station may operate in infrastructure mode, in which case traditional network services such as DHCP, DNS, and routing as well as connectivity to the Internet are expected to be provided through the base station and the wired network to which it connects. Alternatively, hosts may operate in ad hoc mode, also known as peer-to-peer mode. In this case, there is no base station or back-end infrastructure available. Hosts are self-configuring and need to figure out how to communicate among themselves, which includes assigning unique addresses, discovering names, and deducing routes to each other.

802.11

802.11, also known as Wi-Fi, is a set of standards for wireless local area networking. Standards such as 802.11a, 802.11b, 802.11g, 802.11n, and 802.11ac define operating frequencies and data encoding techniques used for sending and receiving frames on a wireless LAN. Most 802.11 systems operate in either the 2.4 or 5 GHz frequency bands.

An 802.11 base station is known as an access point (AP). The collection of an access point and one or more mobile wireless devices (stations) is called a basic service set (BSS). Each BSS has a unique ID, called the BSSID, which happens to be the MAC address of the BSS’s access point. Devices that interact with an access point operate in infrastructure mode. The access point connects to a wired Ethernet infrastructure. 802.11 devices can also operate in ad hoc mode, where devices communicate directly with each other with no need of an access point.

Each access point, in addition to having a BSSID, has a Service Set Identifier (SSID), which is a human-friendly text name that is assigned by the administrator who configures the AP. Each BSS operates over a range of frequencies identified as a channel. Adjacent channels partly overlap with each other. For example, in the U.S., the allowable non-overlapping channels in the 2.4 GHz band are 1, 6, and 11. If adjacent BSSes use these distinct channels, then frames sent by a device in one BSS will never interfere with those in another BSS.

A station (a wireless endpoint device) needs to find and associate itself with an AP. Two techniques are available for this. With passive scanning, the AP periodically sends beacon frames, each containing the AP’s SSID and MAC address (BSSID). The beacon is typically sent every 100 ms. The station scans all channels, searching for beacon frames from any APs in the area.

With active scanning, a station may broadcast a probe request frame. It sends this to a broadcast address (ff:ff:ff:ff:ff:ff) and sets a timer to wait for any answers. If no answer has been received at the end of the time period, the station moves to the next channel and repeats the process. APs that receive the probe will respond with a probe response frame, that contains their identity, supported data rates, and other information. A station then selects an access point with which to associate. This may be done by the user or programmatically (e.g., the AP with the strongest signal or one that has been seen in the past). The station sends an association request frame, receives an association response from the AP and is now part of that AP’s BSS. It can then send a DHCP discovery message to configure itself on the IP network.

802.11 MAC

There are a couple of fundamental differences between 802.11 and Ethernet, apart from one being wireless and the other being wired. Operating in a wired medium results in considerably higher bit-error rates. Also, while an Ethernet transceiver, when operating on a shared medium, can listen while transmitting, an 802.11 transceiver cannot. Because of this, Ethernet was able to perform collision detection and stop transmitting the instant that it detected a collision. 802.11 cannot do so and will transmit a complete frame, even if it gets garbled in the air.

CSMA/CA

To deal with the inability to detect collisions, 802.11 employs a random access MAC protocol called carrier sense multiple access with collision avoidance (CSMA/CA). The key part of CSMA/CA is that, if stations sense a busy channel, they all perform a random backoff and then try again. Collisions are likely to occur if multiple stations wait for a busy channel to become clear and then start to transmit at the same time. To avoid this situation, CSMA/CA tells the stations to wait a random time after they sensed a busy channel became clear, and then transmit the frame. The random time counter counts down only when the channel is sensed to be clear; it pauses whenever a signal is detected on the channel. The idea is that stations pick different random values and when they are ready to try transmitting again, they will not transmit at the same time: somebody will end up going first, causing the channel to be busy for others.

802.11 ARQ

Since 802.11 cannot listen while transmitting to detect a collision and because there is a higher likelihood of data corruption on a wireless network, 802.11 adds an ARQ protocol (acknowledgements and retransmissions). After a node sends a frame, the receiver (AP or station) validates the CRC in the frame and sends back an acknowledgement frame. If the sender does not receive the acknowledgement, it increases its backoff value, counts down a random interval for the channel to be free to transmit, and retransmits the frame. Like CSMA/CD, 802.11’s CSMA/CA uses binary exponential backoff. Because 802.11 uses an ARQ protocol, the 802.11 frame contains a sequence number to allow a receiver to detect duplicate frames. After a certain number of retransmissions, the sender gives up, so reliable delivery is not assured.

Because of the higher likelihood of errors and the time expense of retransmission, 802.11n and 802.11ac systems also support the optional use of an error correcting code (low-density parity check code, LDPC). This is in addition to the CRC error-detection code.

802.11 RTS/CTS

A unique problem in wireless networks is that a station does not necessarily know if a channel is busy or not because it can be out of radio range of another station that is transmitting while both stations are in range of an access point. This is called the hidden node problem. In this case, a station may transmit when it thinks the channel is clear, not realizing that a transmission was already taking place between the AP and another station. This is unavoidable but is ameliorated via the optional use of RTS/CTS (Request to Send / Clear to Send). Before transmitting a frame, a station sends a Request to Send (RTS) message to the AP, asking to reserve the communication channel for a message of a specific size. The AP then responds with a Clear to Send (CTS) message, giving it permission to send the frame. The CTS message is broadcast so that all stations will get the message and know that they should not transmit anything during that interval. RTS and CTS frames are short and sending them reduces the chance of collision compared to sending a long data frame. The RTS/CTS mechanism serves as an extension of carrier sensing. It allows a station to be informed that a channel will be busy even if it may not have the ability to pick up the signal from the station that will be transmitting the data.

802.11 addresses and host migration

The 802.11 frame is conceptually similar to an Ethernet frame. In fact, they share the same MAC addressing scheme so that Ethernet addresses interoperate with 802.11 addresses. The access point serves as a bridge between the two protocols and converts 802.11 frames into Ethernet frames when they need to go on an Ethernet link and vice versa. One distinction between Ethernet and 802.11 frames is that an 802.11 frame has four address fields, three of which are used in infrastructure mode. One address identifies the wireless destination. Another identifies the wireless source. In cases where the frame is routed between the wireless and wired network, the third address identifies the MAC address of the wired device (regardless of whether it is a sender or a receiver). The reason for three addresses is that, unlike an Ethernet switch, an AP has a distinct MAC address and is addressed as an intermediate destination both for frames between wireless stations and for frames that go between and devices on the wired network. If a frame needs to go to the wired network, a station will address it to the AP, which will then create an Ethernet MAC frame that is addressed to the targeted wired destination.

A station is part of a BSS. That means it is associated with a single access point. To extend the range of a wireless LAN, multiple access points may be deployed that share the same SSID. A device can dissociate itself from one AP and associate itself with another one when it detects a stronger signal from the new one. This is called host migration. Since the APs are connected to the same LAN via the same switch (or a set of connected switches), there is no problem with the station keeping the same IP address and any state of TCP session. The challenge is to get the Ethernet switch to immediately route frames to the latest access point that the station has joined. To support host migration at the switch, an access point broadcasts an Ethernet frame that contains the migrated host’s source MAC address. The switch, being self-learning will immediately update its forwarding table, associating the device’s MAC address with that of the interface on which it arrived.

Power management

Most wireless devices are battery powered and power consumption is a key design factor. A transceiver on a wireless device can quickly go to sleep and wake up at a preset interval. Before a device puts its 802.11 transceiver to sleep, it sends a message to the AP stating that it is going to sleep (this is a bit in the MAC header). Upon receiving this message, the AP will queue but not send any frames targeted for that device. The device’s transceiver is programmed to wake up before the AP is scheduled to send its next beacon frame (which the AP does typically every 100ms). When the AP sends a beacon frame, it sends a list of MAC addresses of stations that have buffered frames (queued frames). The station will then wake up and request the frames via a polling message. Otherwise, it can go to sleep until the next beacon.

Quality of Service in Networks

IP was designed as a system that would provide best-effort packet delivery but with no guarantees on the path a packet will take, whether it gets dropped, or what order it arrives in. Hence, there is no concept of delivering any specific grade of service. As IP networks began to be used for carrying continuous media, such as voice and data, several approaches were taken to attempt to provide better controls for scheduling the delivery of IP packets.

Quality of service (QoS) on a network is characterized by four factors:

Bandwidth (or bit rate): the average number of bits per second over the network
Delay (or latency): the average time for data to get from one endpoint to another
Jitter: the variation in the delay
Errors (or packet loss): the percentage of packets that do not reach their destination or reach it with errors

Quality of service problems arise because we do not have unlimited resources.

Congestion: Congestion arises when more packets are flowing into a router than can flow out. It is largely due to bandwidth mismatch, such as having a 10 Gbps network routing traffic to a 1 Gbps link, or to aggregation, such as having four 1 Gbps links all sending traffic that needs to be routed over a single 1 Gbps link. If a router gets packets at a faster rate than it can route, the result will be packet loss. The router may also choose to block the arrival of new packets.
Latency: Latency is due to not only the propagation delay of transmitting data on a network but to various delays along the route. The sending host spends time in packetization, the creation of packets that need to be transmitted. Serialization is the queuing of packets are for transmission onto the network since they can only go out one at a time. Each router incurs a processing delay due to inspecting the packet and moving it onto the appropriate output queue. Then there is the queuing delay of waiting for the router to schedule the packet for transmission. Once it reaches the destination, there is the processing delay of depacketizing the packet and moving it through the network stack, delivering the data to the application, and scheduling the process to consume it.
Jitter: If multiple packets arrive to a router or switch at approximately the same time but need to be transmitted over a single link, they need to be queued since only one can be sent at a time. Some packet will be first and some packet will be last. This variation in dequeuing time creates jitter. Jitter is also caused by having some packets take a different route than others and, most significantly, by the retransmission of lost packets.
Packet loss: Packets can be corrupt or lost due to collision, interference, and signal degradation in the network. The most likely cause, however, is that of a router simply dropping packets because a queue is full. This condition is known as buffer overrun.

Achieving quality of service

Quality of service on a network is achieved via resource allocation and packet prioritization Network service quality can be strictly enforced or may be attempted via a best-effort approach. Hard QoS refers to a system that is designed to provide a guaranteed level of service via reservations while soft QoS refers to a system that uses a best-effort approach to try to deliver, but not guarantee, the desired level of service.

Two broad approaches to providing data channels with managed quality of service on a network are admission control and traffic control. Admission control is generally designed to enforce hard QoS. With admission control, we ask that applications first request a particular quality of service from the network. The network (i.e., all the routers in the path from the source to the destination) will reserve resources for switching and routing the packets to conform to that grade of service and grant the request to the application. If any router cannot commit to the needed resources, the request will be denied.

Traffic control provides soft QoS. Soft QoS on a network refers to prioritization of packets without any reservation of resources from routers or endpoints or any a priori negotiation for a level of service. With traffic control, applications may send data onto the network freely but the network elements (routers) will classify, queue, schedule, and sometimes drop packets based on various criteria.

A link scheduling discipline defines how packets are scheduled at the output queue. When we looked at router architecture, we considered only a single queue per output interface with packets transmitted in a first-in-first-out (FIFO) manner. This is the simplest approach but does not offer any opportunity for differentiating one datagram from another based on its service class.

A router can set up multiple queues for an output link and place different classes (e.g., based on addresses, ports, or other fields in the IP header) of packets onto different queues. A packet scheduler then picks the next packet from a specific queue for transmission. A round-robin scheduler will give each class of packet (queue) equal priority. A priority scheduler will always service high-priority queues first but can lead to starvation of low-priority queues: low-priority queues might never get serviced. A desirable characteristic of queue management is traffic isolation - to ensure that one class of service cannot adversely affect another class even if it has a higher A weighted fair queuing (WFQ) approach gives each queue a priority level but also a ensures a certain minimum percentage of the available link bandwidth so that starvation does not occur.

Traffic shaping and traffic policing

Traffic shaping is when a router queues packets in certain flows during peak usage for later retransmission when there is available bandwidth. With traffic policing, traffic that exceeds the allocated bandwidth for a particular flow is discarded.

A leaky bucket is a traffic shaping algorithm that coverts a bursty flow into a constant bitrate flow: it removes jitter. The bucket is represented by a queue that receives data at an irregular rate (in bursts). The queue is emptied at a constant rate (the hole at the bottom of the bucket), resulting in an absence of jitter. If there is nothing left to read in the queue, we have a buffer underrun and jitter occurs. If the queue is already full when data arrives, we have a buffer overrun and packet loss occurs.

A token bucket is a “bucket” that holds tokens that are generated at a constant rate. In order to transmit a packet, the bucket must be drained of the number of tokens that is proportionate to the packet’s size. The token bucket algorithm does not smooth out traffic like the leaky bucket does but ensures that an average bitrate is enforced for a specific flow. For example, a buildup of tokens can result in a burst of data.

DiffServ

Differentiated services (DiffServ) is a way for programmers to provide advisory information inside an IP header on how a packet should be processed by routers. A packet can be assigned a specific service code by setting a value in the 6-bit Differentiated Services Codepoint (DSCP) field of the IP header. DiffServ is an example of traffic classification. It is entirely up to the routers to decide how to process this information (e.g., via WFQ), what the service levels mean, or even whether to process it at all. Differentiated services are an example of soft QoS: there is no guarantee on the actual quality of service that will be delivered. A common use for DiffServ is to try to provide a better quality of service for voice over IP (VoIP) phone calls by tagging those packets for Expedited Forwarding (EF) service, which is defined to have the characteristics of low delay, low loss, and low jitter. Whether DiffServ values are used at all and how they are interpreted is strictly up to the ISP.

IntServ

Integrated Services (IntServ) is an approach that relies on end-to-end reservation of services. A transmitting host specifies its traffic (as a token bucket with a given rate and size) and requested level of service guarantee. Integrated Services system relies on the Reservation protocol, RSVP, which has been developed to allow a flow of packets to be routed with bitrate guarantees. RSVP is an example of a soft state protocol, meaning that state is not maintained perpetually but reservations expire unless they are refreshed periodically.

The problem with guaranteeing this is that all routers in the path from the source to the destination must be configured to support RSVP: each intermediate router must commit to reserving the needed amount of routing resources to guarantee the desired level of service. Integrated Services is an example of admission control while differentiated services is an example of traffic control. If one ISP in the path does not support this then all bets are off.

RTP and RTCP

The Real-Time Protocol (RTP) is an application-level protocol on top of UDP. It does not define any mechanisms for data delivery or QoS control. As with any UDP datagrams, neither in-order delivery nor delivery in general is assured. What RTP does is allow a sender to attach timestamps to packets so that a receiver can play them back at the same rate at which they were sent.

An RTP header contains: - payload type: identifies type of video or audio encoding. An application can change the encoding type mid-stream (e.g., to switch to a lower bandwidth codec, for example) - sequence number: app can detect missing packets & conceal data loss - timestamp: app can play back data at appropriate intervals - source ID of stream: uniquely identifies stream to allow demultiplexing multiple streams that may received at the same port.

RTP is widely used for voice and video, particularly for media transport in SIP (Session Initiation Protocol) systems.

The RTP Control Protocol (RTCP) is a companion protocol to RTP and is used to provide feedback from the receiver to the sender. By comparing the arrival time between packets with the difference in timestamps in the RTP header, a receiver can compute jitter. By checking sequence numbers, a receiver can compute packet loss. The receiver can periodically send this feedback through RTCP so that the sender can make adjustments to the data stream (for example, change to a lower bandwidth codec).

Firewalls

A firewall protects the junction between an untrusted network (e.g., external Internet) and a trusted network (e.g., internal network). Two approaches to firewalling are packet filtering and proxies. A packet filter, or screening router, determines not only the route of a packet but whether the packet should be dropped based on contents in the IP header, TCP/UDP header, and the interface on which the packet arrived. It is usually implemented inside a border router, also known as the gateway router that manages the flow of traffic between the ISP and internal network.

The packet filter evaluates a set of rules to determine whether to drop or accept a packet. This set of rules forms an access control list, often called a chain. Strong security follows a default deny model, where packets are dropped unless some rule in the chain specifically permits them. With stateless inspection, a packet is examined on its own with no context based on previously-seen packets. Stateful inspection allows the router to keep track of TCP connections and understand the relationship between packets. For example, a port that needs to be enabled for the FTP data channel once an FTP connection has been established or that return packets should be allowed into the network in response to outbound requests.

Packet filters traditionally do no look above the transport layer. Deep packet inspection (DPI) allows a firewall to examine application data as well and make decisions based on its contents. Deep packet inspection can validate the protocol of an application as well as check for malicious content such as malformed URLs or other security attacks.

An application proxy is software that presents the same protocol to the outside network as the application for which it is a proxy. For example, a mail server proxy will listen on port 25 and understand SMTP, the Simple Mail Transfer Protocol. The primary job of the proxy is to validate the application protocol and thus guard against protocol attacks (extra commands, bad arguments) that may exploit bugs in the service. Valid requests are then regenerated by the proxy to the real application that is running on another server and is not accessible from the outside network. The proxy is the only one that can communicate with the internal network. Unlike DPI, a proxy may modify the data stream, such as stripping headers or modifying machine names. It may also restructure the commands in the protocol used to communicate with the actual servers (that is, it does not have to relay everything that it receives).

A typical firewalled environment is a screened subnet architecture, with a separate subnet for systems that run externally-accessible services (such as web servers and mail servers) and another one for internal systems that do not offer services and should not be accessed from the outside. The subnet that contains externally-accessible services is called the DMZ (demilitarized zone). The DMZ contains all the hosts that may be offering services to the external network (usually the Internet). Machines on the internal network are not accessible from the Internet. All machines within an organization will be either in the DMZ or in the internal network.

Both subnets will be protected by screening routers. They will ensure that no packet from the outside network is permitted into the inside network. Logically, we can view our setup as containing two screening routers:

The exterior router allows IP packets only to the machines/ports in the DMZ that are offering valid services. It would also reject any packets that are masqueraded to appear to come from the internal network.
The interior router allows packets to only come from designated machines in the DMZ that need to access services in the internal network. Any packets not targeting the appropriate services in the internal network will be rejected. Both routers will generally allow traffic to flow from the internal network to the Internet, although an organization may block certain services (ports) or force users to use a proxy (for web access, for example).

Note that the two screening routers may be easily replaced with a single router since filtering rules can specify interfaces. Each rule can thus state whether an interface is the DMZ, internal network, or Internet (ISP).

A variation on screening routers is the use of intrusion detection systems (IDS). A screening router simply makes decisions based on packet headers. Intrusion detection systems try to identify malicious behavior. There are three forms of IDS:

A protocol-based IDS validates specific network protocols for conformance. For example, it can implement a state machine to ensure that messages are sent in the proper sequence, that only valid commands are sent, and that replies match requests.
A signature-based IDS is similar to a PC-based virus checker. It scans the bits of application data in incoming packets to try to discern if there is evidence of “bad data”, which may include malformed URLs, extra-long strings that may trigger buffer overflows, or bit patterns that match known viruses.
An anomaly-based IDS looks for statistical aberrations in network activity. Instead of having predefined patterns, normal behavior is first measured and used as a baseline. An unexpected use of certain protocols, ports, or even amount of data sent to a specific service may trigger a warning.

Secure communication and authentication

Cryptography

Cryptography deals with encrypting plaintext using a cipher, also known as an encryption algorithm, to create ciphertext, which is unintelligible to anyone unless they can decrypt the message.

A symmetric encryption algorithm uses the same secret key for encryption and decryption.

A public key algorithm uses a pair of keys: data encrypted with the first key can be decrypted only with the second key and vice versa. One of these keys is kept private (known only to the creator) and is known as the private key. The corresponding key is generally made visible to others and is known as the public key. Anything encrypted with the private key can only be decrypted with the public key. This is the basis for digital signatures because the encryption can only be performed by the key’s owner. Anything that is encrypted with a public key can be encrypted only with the corresponding private key. This is the basis for authentication and covert communication because decryption can only be performed by the owner, who is the only one who has the private key.

When data is transmitted, it is broken into blocks and each block is encrypted separately. This leads to two problems. If different communication sessions contain the same messages and use the same key, an intruder can see that the same data is being sent. Secondly, a malicious party can add or delete, add, or replace blocks (perhaps with random junk or perhaps with blocks that were captured from previous communication sessions). Cipher block chaining (CBC) addresses these problems. Every block of data is still encrypted with the same key. However, prior to being encrypted, the data block is exclusive-ored with the previous encrypted block. The receiver does the process in reverse: a block of received data is decrypted and then exclusive-ored with the previously-received block to obtain the original data. The very first block is exclusive-ored with a random initialization vector, which must be transmitted to the remote side. Note that CBC does not make the encryption more secure; it simply makes the result of each block of data dependent on the previous block so that data cannot be inserted or deleted in the message stream.

A cryptographic hash function is a one-way function whose output is always a fixed number of bits for any input. By one-way, we mean that there is no way to compute the input when given the output. For good cryptographic hash functions (e.g., SHA–1, SHA–2, SHA–3), it is highly unlikely that two messages will ever hash to the same value, it is extremely difficult to construct text that hashes to a specific value, and it is extremely difficult to modify the plaintext without changing its resultant hash. The hash function is the basis for message authentication codes and digital signatures. Note that when we talk about cryptography and mention phrases such as “extremely difficult”, we mean “impossible for all practical purposes,” not that “you can do it if you spend an entire week working on the problem.”

Secure communication

To communicate securely using a symmetric cipher, both parties need to have a shared secret key. Alice will encode a message to Bob using the key and Bob will use the same key to decode the message. If Alice wants to communicate with Charles, she and Charles will also need a secret key. The fact that every pair of entities will need a secret key leads to a phenomenon known as key explosion. Overall, in a system with n users, there will be O(n²) keys.

The biggest problem with symmetric cryptography is dealing with key distribution: how can Alice and Bob establish a key so they can communicate securely? The Diffie-Hellman key exchange algorithm allows one to do this. Each party will generate a private key and a public key (these are not encryption keys; they are just numbers — Diffie-Hellman does not implement public key cryptography — it is unfortunate that the term was used to describe these numbers). Alice can use her private key and Bob’s public key to compute a common key. Bob can compute the same common key by using his private key and Alice’s public key. They can then communicate securely by using the common key with a symmetric cipher.

Using true public key cryptography, such as RSA, if Alice encrypts a message with Bob’s public key, Bob will be the only one who can decrypt it since doing so will require Bob’s private key. Likewise, Bob can encrypt messages with Alice’s public key, knowing that only Alice will be able to decrypt them with her private key.

Session keys and hybrid cryptosystems

A session key is a random key that is generated for encrypting data for one communication session. It is useful because if the key is ever compromised, no lasting information is obtained: future communication sessions will use different keys. A hybrid cryptosystem uses public key cryptography to send a session key securely. The originator generates a random session key and encrypts it with the recipient’s public key. The recipient decrypts the message with the corresponding private key to extract the session key. After that, symmetric cryptography is used for communication, with messages encrypted with the session key. This has the advantages of higher performance (public key cryptography is much, much slower than symmetric cryptography), ease of communicating with multiple parties (just encrypt the session key with the public keys of each of the recipients), and allows the bulk of data to be encrypted with session keys instead of the hardly-ever-changing public keys.

Message authentication

Message Authentication Code (MAC) is a cryptographic hash of a message that is encrypted with a shared symmetric key. This MAC is sent along with the message. If an intruder modifies the message, the receiver will be able to validate that the message no longer matches the MAC: simply hash the message and compare it with the value of the decrypted MAC. An intruder cannot generate a new MAC because you need the secret key to do so.

A digital signature is similar to a MAC but uses public key cryptography. It is simply the hash of a message encrypted with the creator’s private key. Anyone who has the message signer’s public key can decrypt the hash and thus validate it against the message. Other parties, however, cannot recreate the signature. Even though they can generate the same hash for the message, they do not have the signer’s private key to encrypt that hash.

Public key authentication

As we saw with hybrid cryptosystems, public key cryptography makes key exchange simple. It also simplifies authentication. If Alice wants to authenticate herself to Bob (prove that she’s really Alice), Bob will generate a random bunch of bits, called a nonce, and ask Alice to encrypt the nonce with her private key (something only she can do). She will send the result back to Bob who will decrypt the data with Alice’s public key. If the result matches Bob’s original nonce, he is convinced that Alice has Alice’s private key and therefore is really Alice.

Digital certificates

For Bob to validate Alice in the above example, Bob must be confident that he really has Alice’s public key rather than someone else’s. X.509 digital certificates provide a way to do associate an identity with a public key and have some entity vouch for that association. A certificate is a data structure that contains user information and the user’s public key. This data structure also contains a signature of the certification authority (CA). The signature is created by taking a hash of the rest of the data in the structure and encrypting it with the private key of the CA. The CA is responsible for setting policies of how they validate the identity of the person who presents the public key for encapsulation in a certificate.

Transport Layer Security (Secure Sockets Layer)

Secure Sockets Layer (SSL, also known as TLS — Transport Layer Security) is a layer of software above TCP at the application layer designed to provide authentication and secure communication while giving the programmer the feel of a normal sockets interface. It makes it easy to add a secure transport onto insecure TCP socket-based protocols (e.g., HTTP and FTP). SSL uses a hybrid cryptosystem and relies on public keys for authentication. If both the sender and receiver have X.509 digital certificates, SSL can validate them and use nonce-based public key authentication to validate that each party has the corresponding private key. In some cases, it may validate the server only. If the server does not have a certificate, SSL will then use a public key simply to allow a symmetric session key to be passed securely from client to server. The client generates a session key and encrypts it with the server’s public key. This ensures that only the server will be able to decode the message and get the session key. After that, communication takes place using a symmetric algorithm and the client-generated session key. Each encrypted message that is sent contains a MAC (message authentication code) to allow the receiver to ensure that the message has not bee accidentally or maliciously modified.

VPNs

Virtual private networks (VPNs) allow disconnected local area networks to communicate securely over the public Internet, saving money by using a shared public network (Internet) instead of leased lines. This is achieved by tunneling, the encapsulation of an IP datagram within another datagram. In this case, a datagram that is destined for a remote subnet, which will often have local source and destination IP addresses that may not be routable over the public Internet, will be treated as payload and be placed inside a datagram that is routed over the public Internet. The source and destination addresses of this outer datagram are the VPN endpoints at both sides, usually the VPN-aware routers.

When the VPN endpoint (router) receives this encapsulated datagram, it extracts the data, which is a full IP datagram, and routes it on the local area network. This tunneling behavior gives us the virtual network part of the VPN.

To achieve security (the “private” part of VPN), an administrator setting up a VPN will usually be concerned that the data contents are not readable and the data has not been modified. To ensure this, the encapsulated packets can be encrypted and signed. Signing a packet enables the receiver to validate that the data has not been modified in transit. Encrypting ensures that intruders would not be able to make sense of the data, which is the encapsulated datagram.

VPNs usually provide several options for key management: shared private keys (AES or 3DES), Diffie-Hellman key exchange, or RSA public keys. IPsec is one of the most popular types of VPNs and has two variations. Authentication Header (AH) mode adds a signature to the encapsulated datagram but does not encrypt any data. Data is readable but cannot be modified. Authentication Header mode is rarely used since the overhead of encrypting data is quite low. The other variant is the Encapsulating Security Payload (ESP) mode, which adds a signature as with AH but also encrypts the entire datagram.

In an environment where an individual computer outside a trusted network needs to connect to another node securely, tunneling will not work since there is no gateway router that will extract an embedded datagram and no trusted local area network on which to route it. To support this environment, VPNs can operate in transport mode instead of tunnel mode. No tunneling takes place and hence there is no encapsulation of the full IP datagram. The IP payload (which will include TCP or UDP headers) is encrypted and signed but the original IP header is left unchanged.