Final Exam study guide

The three-hour study guide for the final exam

Paul Krzyzanowski

Latest update: Sun Feb 22 11:43:07 EST 2015

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Don't take the three hour time window in the title literally.

Introduction

Some introductory terms:

  • Batch systems: operating systems that would run a program to completion and then load and run the next program.
  • Multiprogramming: keep several programs in memory at once and switch between them.
  • Time-sharing: multiprogramming with preemption, where the system may stop one process from executing and starts another.
  • Portable operating system: an operating system that is not tied to a specific hardware platform. UNIX was one of the early portable operating systems. Windows NT was also written to be architecture-independent.
  • Microkernel operating system: an operating system that provides just the bare essential mechanisms that are necessary to interact with the hardware and manage threads and memory. Higher-level operating system functions, such as managing file systems, the network stack, or scheduling policies, are delegated to user-level processes that communicate with the microkernel via interprocess communication mechanisms (usually messages).

A mechanism is the presentation of a software abstraction: the functional interface. It defines how to do something. A policy defines how that mechanism behaves (e.g., enforce permissions, time limits, or goals). Good design dictates that we keep mechanisms and policies separate.

Booting an operating system

The boot loader is a small program that is run at boot time that loads the operating system. Boot loaders are sometimes broken into several stages. A multi-stage boot loader starts off with a simple boot loader that then loads a more sophisticated boot loader that then loads the operating system. This cascaded boot sequence is called chain loading.

On a classic PC architecture, the computer starts running code in the system's BIOS on startup. This BIOS code performs a power-on test, identifies and initializes some hardware, and then loads the first block of the disk, which is known as the master boot record, or MBR. The MBR contains a table identifying disk partitions and code that will load the Volume Boot Record (VBR), which is the first disk block of the designated boot partition. This contains code to read successive consecutive blocks to load additional blocks that comprise the second stage boot loader. This boot loader may present the user with a choice of operating systems to load and will have the intelligence to read those files from the file system. The Universal Extensible Firmware Interface, EFI, is a successor to the BIOS and contains a boot manager that can directly boot files that are placed into an EFI boot partition. It has the intelligence to parse several file system types so that it can load a file that does not necessarily occupy consecutive disk blocks. The files that are loaded from the EFI partition are often boot loaders for individual operating systems that then load in the file or files that comprise the operating system.

Operating Systems: essential concepts

An operating system does a number of things. It is a program that loads and runs other programs. It provides programs with a level of abstraction so they don't have to deal with the details of accessing hardware. It also manages access to resource, including the CPU (via the scheduler), memory (via the memory management unit), persistent files (via the file system), a communications network (via sockets and IP drivers), and devices (via device drivers).

The operating system runs with the processor in kernel mode (also called privileged, supervisor, or system mode). In this mode, the processor can execute privileged instructions that define interrupt vectors, enable or disable system interrupts, interact with I/O ports, set timers and manipulate memory mappings. Programs other than the kernel run with the processor in user mode and do not have privileges to execute these instructions. A processor running in user mode can switch to kernel mode by executing a trap instruction, also known as a software interrupt. Some processors also offer explicit syscall instructions, which is a faster mechanism since it does not need to read the branch address from a interrupt vector table that is stored in memory but keeps the address in a CPU register. Either of these approaches switches the processor to kernel mode, saves the current program counter on the stack, and transfers execution to a predefined address for that trap. Control is switched back to user mode and the location right after the trap by executing a return from exception instruction.

Programs interact with operating systems via system calls. A system call uses the trap mechanism to switch control to operating system code running in kernel mode. The operating system can either service the request immediately or may need to put the process in a waiting state because it requires data that isn't ready (such as a disk block or network message). If an operating system decides that a process has been executing long enough, it may preempt that process to let another process run.

To allow the operating system to get control at regular intervals, a programmable interval timer can be configured to generate periodic hardware interrupts, for example every 10 milliseconds. When the timer generates the interrupt, control is transferred to an interrupt handler in the kernel and the processor is switched to kernel mode. When the operating system is ready to return back, it issues a return from exception instruction. An interrupt or trap results in a mode switch, where execution is transferred from user mode to kernel mode. If the operating system decides it is time to replace the currently executing processes with another process, it will save the current process' context and restore the saved context of another process. The context includes the values of a processor's registers, program counter, stack pointer, and memory mappings. Saving the context of one process and restoring that of another is called a context switch.

The operating system is responsible for managing system devices. The three categories of devices are:

  • character devices: any device whose data that can be thought of as a byte stream. This includes keyboard input, mouse movements, printer output, camera inputs, etc.
  • block devices: any device that has persistent storage that is randomly addressable and read or written in fixed-size chunks (blocks). These devices include disks and flash memory. Because the data is persistent and addressable, it can be cached by the operating system so that future requests for cached content may be satisfied from system memory instead of accessing the device again. This cache of frequently-used blocks of data is called the buffer cache. Basically, any device that can be used to hold a file system is a block device.
  • network devices: packet-based communications networks.

Devices can be controlled by mapping their control and data registers onto memory. This is called memory-mapped I/O. The processor can get device status notifications (e.g., data is ready) either via a hardware interrupt or by periodically polling the hardware (e.g., when the interval timer goes off). Data can be transferred between the device and system via software that reads/writes the device's memory. This is called programmed I/O (PIO). Alternatively, the device may have access to the system's memory bus and can use direct memory access (DMA) to transfer data to/from system memory without using the processor.

Processes

A process is a program in execution. It comprises the state of the processor's registers and the memory map for the program. The memory map contains several regions that the operating system allocated and initialized when the process was first created. These regions are:

  • text: the machine instructions
  • data: initialized static and global data
  • bss: uninitialized static data that was defined in the program (e.g., global uninitialized strings, numbers, structures)
  • heap: dynamically allocated memory (obtained through memory allocation requests)
  • stack: the call stack, which holds not just return addresses but also local variables, temporary data, and saved registers

Text and data are present in the stored program. The size of the bss segment is also fixed and known in the stored program. The heap and stack are dynamically allocated. A process may be in one of three states:

  • running: the process is currently executing code on the CPU
  • ready: the process is not currently executing code but is ready to do so if the operating system would give it the chance
  • blocked (or waiting): the process is waiting for some event to occur (such as a requested I/O operation to complete) and will not ready to execute instructions until this event is ready.

The operating system's process scheduler is responsible for moving processes between the ready and running states. A system where the operating system can save the context of a running program and restore the context of a ready program to give it a chance to run is called a preemptive multitasking system. Just about every operating system uses this today. Systems that allow program to run until it terminates or blocks on I/O are called non-preemptive or cooperative systems.

The operating system keeps track of all processes via a list. Each element in that list is a pointer to a data structure that contains information about one process. This data structure is called a process control block (PCB). The process control block stores all pertinent information about the process, including its state, saved registers, memory map, owner ID, parent process, child processes, and open files. Each process is uniquely identified with a process ID (PID).

The basic process manipulation capabilities that POSIX systems (Unix, OS X, Linux, *BSD, etc.) offer include:

fork
fork creates a new process. The new process is a copy of the parent. It's running the same program and has the same open files. It is, however, a copy and not a reference to the parent. Any memory modifications or changes to open file status will be invisible to the parent and vice versa.
execve
execve does not create a new process. It loads a new program from the file system to overwrite the current process, reinitializing the process' memory map. It is often used immediately after fork to allow have the child process created by fork to load and run a new program.
exit
exit tells the operating system to terminate the current process.
wait
wait allows a parent to wait for and detect the termination of child processes.
signal
signal allows a process to detect signals, which include software exceptions, user-defined signals, and death of child signals.
kill
kill sends a signal to a specified process. The signal can usually be detected with signal. It does not kill the process unless the signal is a kill (or terminate) signal.

Threads

A thread can be thought of as the part of a process that is related to the execution flow. A process may have one or more threads of execution. A process with more than one thread is said to be multithreaded. A thread-aware operating system keeps track of processes like we mentioned above: via a list of process control blocks. Each PCB now keeps a list of threads for that process (a process will have to have at least one thread). Thread-specific information (registers, program counter, stack pointer, priority) is stored within a thread control block (TCB). All other information that is shared among threads in a process is stored within the common PCB. Note that even though each thread gets its own stack, the stacks all reside in memory that is accessible to all threads within the process.

One advantage of threads is that creating threads and switching among threads in a process is more efficient than creating processes and context switching among processes. Threading also makes certain types of programming easier since all memory is shared among threads. Moreover, on multi-core or multiprocessor systems, the operating system can take advantage of the hardware architecture and schedule threads onto multiple processor cores to allow the threads to run in parallel.

Threads can be implemented in, and managed by, the operating system. These are known as kernel-level threads. It is also possible to create a thread library that will create and manage threads within a process. The library will save and restore registers and manage switching stacks among threads. It may ask the operating system for periodic software interrupts to enable it to preempt threads. To the operating system, this appears as one single-threaded process. These are called user-level threads. One caveat with user-level threads is that if one thread blocks on a system call then the entire process is blocked and no threads within that process get to run. Most operating systems offer non-blocking versions of system calls that a threading library can use to simulate blocking system calls.

The advantage of user-level threads is that they can be even more efficient than kernel-level threads since there is no need to even do a mode switch into the kernel. The threading library can also have its own thread scheduling algorithm that is optimized to a specific job. A disadvantage is that user-level threads cannot take advantage of multiprocessors. Threading models can be combined and user-level thread libraries can be used with operating systems that offer kernel threads. In the general case, hybrid threading maps N user-level threads onto M kernel-level threads.

The operations for creating and handling threads are conceptually similar to those for processes. Since threads share the same program and memory map, there is no concept of an execve operation. When a new thread is created, it starts execution at a given function in the program. When it returns from that function, the thread is destroyed. A thread can also exit explicitly via a system call. Instead of a wait operation, where a parent process waits for the termination of a child process, threads support a join operation, where any one thread within a process can wait for any other thread to terminate. There is no parent-child relationship in threads.

Finally, because threads access the same memory, it is important to allow a thread to grab a lock so that all other threads that want to grab the same lock will have to wait. The basic mechanism for is a mutual exclusion lock.

Synchronization

Threads (or processes) are concurrent if they exist at the same time. They are asynchronous if they need to synchronize with each other occasionally. They are independent if they have no dependence on each other: it does not affect a thread whether another thread exists or not. Threads are synchronous if they synchronize frequently so that their relative order of execution is guaranteed.

A race condition is a bug where the outcome of concurrent threads is dependent on the precise sequence of execution (particularly, the interleaving of executing two threads). A race condition typically occurs when two or more threads try to read, write, and possibly make decisions based on the contents of memory that they are accessing concurrently. The regions of a program that try to access shared resources and cause race conditions are called critical sections. To avoid race conditions, we want to make sure that only one thread at a time can execute within a critical section. Enforcing this is called mutual exclusion. Deadlock is a condition where there is a circular dependency for resources among threads and each one is waiting on the other to release a lock (a simple case of deadlock is, "I need B before I release my lock on A" while "you need A before you release your lock on B"). Starvation is the case where the scheduler never gives a thread a chance to run. If a thread doesn't get to run, it may not get a chance to release any resources that it is holding.

Mutual exclusion can be achieved in a number of was. Disabling interrupts when entering a critical section ensures that other threads will never get a chance to preempt you and run at all. This is a drastic solution. It also requires that you are running in kernel mode and have the privileges to disable interrupts. Moreover, doing so on a multiprocessor system will usually disable interrupts only on a single processor, allowing threads to run on other processors and access the critical section. Test and set locks (if (!locked) lock = 1) implemented in software are a simple approach but a buggy one because they introduce a race condition: a thread can be preempted between testing the value of a lock and setting it (grabbing it). This can allow multiple threads to get the same lock.

Processor manufacturers have introduced atomic instructions that can perform operations such as test-and-set as one indivisible operation, meaning that another thread cannot preempt the sequence of operations and they all execute sequentially and indivisibly. Some of these instructions are:

test-and-set
Set a memory location to 1 and return the previous value of the location. If the returned value is a 1 that means that somebody already grabbed the lock. If the value is 0 then you know that there was no lock before you executed the instruction and that you grabbed it while running the instruction.
compare-and-swap
Compare the value of a memory location with an old value that is passed to the instruction. If the two values match then write the new value into that memory location. With this instruction, we set a memory location to a specific value provided that that somebody did not change the contents since we last read them. If they did, our new value will not be set.
fetch-and-increment
Increment a memory location and return the previous value of that location. This is analogous to the deli counter number dispensing machine that gives out incrementally increasing numbers. You grab a ticket (fetch-and-increment) and wait until it's your turn. When you're done, you increment turn so that the thread that grabbed the next ticket gets to go.

These atomic-instruction-based mechanisms all require looping in software to wait until the lock is released. This is called busy waiting or a spinlock. It is undesirable because it keeps the threads in a ready to run state even though they have no useful work to do except loop and wait for a value to change. Moreover, if a thread that is looping on a spinlock is a high priority one, it is always ready to run. With a priority scheduler, it may always get scheduled to run, starving a low priority thread that may be the one that has the lock that needs to be released. This situation is known as priority inversion: the low priority thread needs to be given a high priority so it can exit the critical section as quickly as possible and let the high priority thread get it.

A more desirable approach than spinlocks is to have the operating system provide user processes with system calls that we can use for mutual exclusion and put processes that are waiting for a critical section into a waiting state. This way, a waiting process will not be running until it is allowed entry into the critical section.

Semaphores
A semaphore counts the number of wake-ups that are saved for future use. Each time you call down on a semaphore, you decrement its value. If the value is 0 and you call down then your thread goes to sleep. When a thread calls up on a semaphore, the operating system will wake up one of the threads that is sleeping on that semaphore. If no threads are sleeping on a semaphore then up will just increment the value of the semaphore. The most basic use of a semaphore is to initialize it to 1. When a thread want to enter a critical section, it calls down and enters the section. When another thread tries to do the same thing, the operating system will put it to sleep because the value of the semaphore is already 0 due to the previous call to down. When the first thread is finished with the critical section, it calls up, which wakes up the other thread that's waiting to enter. In general, the semaphore is initialized to the number of concurrent threads that you want to enter a section before they block.
Event counters
An event counter works similar to the way you would use a fetch-and-increment instruction but without the spinlock since the thread can go to sleep waiting for a value. Three operations are permitted: you can read the current value of an event counter, increment it, and wait until the event counter reaches a certain value. This last operation is a blocking one.
Condition variables (also known as monitors)
A condition variable supports two operations: wait until a condition variable is signaled and signal a condition variable. Signaling a condition variable will wake up one thread that is waiting on the condition variable.
Message passing

Message passing supports two operations: send and receive. The most common behavior of message passing is where a send operation does not block and a receive operation does block. We can use the blocking aspect of receive to coordinate mutual exclusion by sending a null message ahead of time. The first thread that does a receive will get the message and then enter the critical section. The second thread will block on the receive. A message passing system where the send blocks until the data is received via receive is called rendezvous. Rendezvous is efficient when messaging is done within the machine (versus a network) in that it does not require the message to be copied to an intermediate buffer. However, it forces the senders and receivers to be tightly synchronized.

Conventional message passing requires that the sender knows the identity of the thread/process that will be receiving the message. This is known as direct addressing. An alternate form of messaging is via mailboxes using indirect addressing. Mailboxes allow any number of threads to send to an intermediate queue, called a mailbox. Any number of threads can read from this queue. Hence, it is easy to support multiple readers and writers and nether the writers nor readers have to know how to address each other.

Scheduling

Processes execute for a while, then block on I/O, then execute some more, then block, etc. Each period of execution is called a CPU burst. Overall throughput is increased if another process can run when a process is in a blocked state. A scheduler is responsible for deciding what process should get to run next and, if the current process did not finish its CPU burst, whether the current process has run too long and needs to be preempted with another process. If a scheduler can preempt a process and context switch another process in then it is a preemptive scheduler. Otherwise it is a non-preemptive, or cooperative scheduler.

A time slice (also known as quantum) is the maximum time that a process will be allowed to run before it gets preempted and another process gets a chance to run. Short time slices are good for maximizing interactive performance. Long time slices reduce the overhead of context switching but reduce response time. The scheduling algorithm determines whether a process should be preempted and which process gets to run next. The dispatcher is the component of the scheduler that is responsible for restarting the chosen process. It does so by restoring the process' context (loading saved registers and memory mapping) onto the CPU and executing a return from interrupt instruction that will cause the process to to execute from the location that was saved on the stack at the time that the program stopped running — either via an interrupt or a system call.

First-come, first served scheduling

This is a non-preemptive scheduler that uses a simple FIFO queue. New processes are added to the queue. When a process blocks on I/O, the scheduler takes the earliest (first in) process out of the queue. When a process is no longer blocking and is ready to run, it goes to the end of the queue. The drawback of FCFS is that it is non-preemptive and processes with long CPU bursts will delay other processes.

Round robin scheduling

This is a preemptive version of first-come, first served. When a quantum expires for a running process, the process is preempted and brought to the rear of the queue. The ready process at the head of the queue gets to run next. It's a simple scheduler but giving every process an equal share of the CPU is not necessarily a good idea since it does not schedule highly interactive (I/O-intensive) processes more frequently.

Shortest remaining time first scheduling

This scheduler picks the process with the shortest estimated CPU burst time to run next. The idea is to maximize average response time and let I/O bound processes run first and, hopefully, quickly issue their I/O requests and block. It also maximizes I/O utilization since we give priority to processes that we expect to block quickly. The trick to this scheduler is that you need to estimate the next CPU burst time. A weighted exponential average of previous CPU bursts is used to do this. The weight may be adjusted to weigh recent CPU bursts more or less than historic averages.

Priority scheduling

A priority scheduler requires that each process be assigned a priority. It simply picks the highest priority process from the list of ready processes. Priorities may be internal or external. Internal priorities are determined by the system. External priorities are assigned by the administrator. Priorities may also be static or dynamic. Dynamic priorities are priorities that are adjusted by the system as a process executes. Static priorities remain fixed for the duration of execution. A danger with priority scheduling is starvation. Starvation occurs if there is always a higher priority process that is ready to run. The lower-priority process never gets scheduled. We saw an example of this in synchronization when we looked at priority inversion. Starvation can be handled by using dynamic priorities and temporarily boosting the priority of processes that has not been scheduled in a while (this is known process aging) or, conversely, by penalizing processes that do get scheduled and use up their quantum.

Multilevel queues

Instead of a pure priority scheduler that expects processes to have unique priorities, we can set up priority classes. Each class has its own unique priority and within each class we have a queue of processes that are associated with that priority class. We may give choose to give high-priority classes short time slices to ensure good interactive performance and low-priority classes longer time slices to run longer but less frequently. Each priority class can have its own scheduling algorithm to determine how to pick processes from that class. For example, some classes may use round robin scheduling while others may use shortest remaining time first scheduling.

Multilevel feedback queues

Multilevel feedback queues are multilevel queues with dynamic priorities thrown in. High priority queues often have a short time slice associated with them. If a process reaches the end of its time slice rather than blocking, the scheduler demotes it to the next lower-priority queue. Lower priority queues are then assigned increasingly longer time slices. The process will get knocked to a lower priority level each time it runs for the full quantum. The goal is to keep interactive processes with short CPU bursts at high priorities and lower the priorities of CPU-bound processes while giving them longer stretches of time to run (but less often). A drawback is that if an interactive process has a stretch of computation, it will get knocked to a low level and never get pushed up. Some systems deal with this by raising a process' priority whenever it blocks on I/O. Another approach is to raise the priority of a process that has not had a chance to run for a while. This is called process aging. [ Note that Linux's approach has been to give interactive tasks a high priority and a longer time slice, with the expectation that they would block before using up the time slice. Linux, however, does not use a true multilevel feedback queue. ]

Multiprocessor scheduling

With multiple processors, the goals in scheduling are load balancing and processor affinity. Processor affinity is the desire to reschedule a process onto the same processor on which it was previously scheduled to take advantage of the fact that the processor may still have memory used by that process in its cache. Hard affinity is the case where the process will always be rescheduled onto the same processor. Soft affinity is the case where the process may be rescheduled onto another processor to keep the load balanced and ensure that all processors have processes to run. Load balancing is the process of making sure that all processes have processes to run. Push migration is a periodic examination of the load (number of processes in the run queue) on each CPU. If one processor is more heavily loaded than another, some processes are moved to another CPU's queue. Pull migration is the case where a scheduler has no ready processes to run on a CPU and has to get get one from another CPU's queue. Push and pull migration are often used together.

Real-time scheduling

Real-time processes have to deal with providing guaranteed response. A start time is the time at which a process must start in response to some event, such as an interrupt from a sensor. A deadline, also known as stop time is the time at which the task must complete. A hard deadline is one in which there is no value if the deadline is missed. A soft deadline is one where there is decreasing value in the computation as the deadline slips. Processes may be terminating processes, which run and then exit. Their deadline is the maximum amount of time before they exit. Process may also be nonterminating processes. These run for a long time but perform periodic operations, such as encoding and decoding audio and video. Their deadline is not the time to exit but rather the time by which results have to be ready for each period of execution (e.g., decode video at 30 frames per second).

Real-time systems have to be designed to be able to respond to events quickly. Systems that can guarantee response times are called hard real-time systems. Those that cannot are soft real-time systems. Real-time operating systems tend to use priority schedulers (possibly with variations), have a guaranteed maximum time to service interrupts, can ensure that processes are fully loaded into the system's memory, provide efficient memory allocation, and support multi-threaded execution or preemptable system calls. You do not want your response to an event delayed because the operating system is busy doing something else.

Earliest deadline scheduling

Each process has a deadline. The scheduler simply picks the process that has the nearest deadline (i.e., the process in greatest danger of missing its deadline).

Least slack scheduling

For each process, we know its deadline and required amount of compute time. Slack is determined by how free time is left: time to deadline minus the compute time. This is the most amount of time that we can procrastinate before we devote 100% of our resources to running the process. Least slack scheduling picks the process with the smallest slack: the one we can least procrastinate with.

When we compare the effects of earliest deadline with least slack, we see that all processes will complete on time if there is sufficient compute time available. If there isn't, earliest deadline scheduling will likely complete some processes on time while others may be significantly late. Least slack scheduling will have the effect of having all processes be late by about the same amount.

Rate monotonic analysis

Rate monotonic analysis is a way to assign priorities to periodic tasks. It simply gives the highest priority to the process with the shortest period (highest frequency).

Memory Management

The most rudimentary approach to dealing with memory is single partition multiprogramming, where we allow only a single program to be loaded in memory at any given time. The program shares its memory with the operating system and nothing else. This was the approach adopted in early computers and early personal computers prior to time sharing.

Since we want to be able to switch between multiple executing programs, we need to have those programs loaded into memory. The more programs are loaded in memory, the greater is the chance that one of them is ready to run, as opposed to being blocked waiting on I/O. CPU utilization is the estimated percentage of time that the CPU is not idle. The number of processes in memory at one time – and candidates for running – is known as the degree of multiprogramming.

If multiple programs are loaded into memory at once, they will, of course, occupy different memory locations. Absolute code uses real memory addresses and expects to be loaded into and run from a specific memory location. Position independent code uses relative addresses throughout and may be loaded into any memory location. Of course, there's a risk that some absolute memory references may have been used. Another approach is not to rely on the compiler to generate position independent code but instead leave memory references blank and generate a relocation table that will allow those references to be filled in when the program is loaded.

A completely different approach to this problem is to rely on hardware to dynamically transform address references made by the process into actual memory locations. This is known as logical, or virtual addressing. A memory management unit (MMU) performs this translation. The simplest approach to logical addressing is to add a base address to each memory reference made by a process. This address will be different for each process and will be based on where in memory that process was loaded. The MMU can also check the virtual address against a limit to ensure that the process is accessing only memory within the process. This is base & limit addressing. Some hardware may support breaking up a process's memory into several logical segments (code, data, stack) and each such segment is assigned its own register containing the base offset address for that segment. A logical address is effectively a combination of a segment identifier and an offset within that segment. The Intel memory model refers to the combined segment value and offset as a far pointer. This system is known as segmentation.

The base & limit addressing approach requires each process to occupy contiguous memory. Segmentation allows several variable-sized chunks of contiguous memory but each segment (e.g., stack) must occupy a contiguous region. With multiple fixed partitions, the operating system divides memory into a bunch of partitions ahead of time. When a program needs to be loaded, it is loaded into an available partition that can hold it. If no partition is available, the loading will have to be delayed in a queue of requests. Unused space within a partition remains unused. This is internal fragmentation. If any partitions have no programs to run then their memory is always wasted. This is called external fragmentation. With variable partitions, the operating system creates a partition for a process on demand, using a block of contiguous memory that is big enough to hold it. As processes come and go, holes (unallocated regions) develop and there may not be new processes that fit into any available hole. Memory compaction can relocate processes in memory but is a time-consuming approach. If a process grows, it will need more memory and the operating system will need to find a bigger unallocated block and move the process, relocate other processes, or save the process's memory onto disk and wait until a big block frees up. All of these hugely impact the performance and responsiveness of a multitasking system.

Page-based virtual memory

With a page-based memory management unit, physical memory is divided into equal-sized chunks that are multiple of a power of two (that way, we can use a fixed number of bits to address a byte offset within the chunk). These chunks are called page frames. A logical memory address is divided into two components. The high-order bits of the address identify a page number and the low-order bits identify a page offset, the byte offset within a page. For example, a 32-bit address with 4 KB pages will use the lower 12 bits (212 = 4096) to identify the page offset and the 20 high bits to identify the page number. For each memory reference made by the processor, the memory management unit takes the logical address, extracts the page number, and uses that as an index into a page table. Each page table entry (PTE) contains a page frame number. The physical memory address generated by the MMU has the page number from the logical address replaced with the corresponding page frame number from the PTE. A direct mapping paging system contains a page table per process. When the operating system performs a context switch to run a specific process, it loads a page table base register with the address of that process' page table. Each PTE in a page table contains a page residence bit that indicates whether that specific page is mapped onto a page frame. If it is, then the PTE also contains the page frame number that corresponds to that page. If the page is not mapped onto a page frame, then a page fault is generated. This is an interrupt that is handled by the page fault handler of the operating system. The fault handler will examine the memory address of the instruction that faulted and decide whether the process needs to be terminated because of an invalid memory access or whether the access was indeed valid but the requested page was not mapped onto a page frame. The PTE also contains other flags, such as whether the page is writable, executable, and whether it has been accessed or modified.

If we had to consult a memory-based table for every memory operation, the memory system would effectively be twice as slow since each memory access would require another memory access to look up the location of the page in the page table. To improve performance, the MMU employs an associative cache known as a translation lookaside buffer (TLB) to store recently-accessed page numbers and their corresponding PTEs. Since the corresponding PTE is a function of the per-process page table that we used to do the lookup, most MMUs employ an address space identifier (ASID) in their TLB to identify the process to whom an entry belongs. During a context switch, the operating system must set the ASID so that only cached values for that specific ASID will be used. The hit ratio is the percentage of memory references that can be satisfied from the TLB without resorting to a page table lookup.

A page table allows the process to potentially be able to address all system memory. Much of that memory will not be used by the process but we still pay the penalty of having to maintain a possibly large page table (our earlier example of 20-bit page numbers means that we need a table with 1,048,576 entries. To save on memory, we can use a multilevel page table. The virtual address is split into three or more parts. The lowest bits still give us the offset within a page and page frame. The very highest bits serve an an offset into a top-level index table. Each entry here contains the base address of a partial page table. The next set of bits of the virtual address serves as an offset into this partial page table. Entries in this table are PTEs that include the page frame address and various access flags.

One approach to deal with very large address spaces is the inverted page table. Instead of having a per-process table that maps logical pages to physical page frames, we have a single table with one entry per page frame. For a given virtual address, we search this table for a page frame that contains that virtual address. Clearly, searching a table is not good for memory access performance. We need to rely on an associative cache and the use of hashing to perform a lookup.

A paging system has a number of attractive features:

  • It does not let the amount of physical memory limit the size of our process.
  • It allows the process feels like it owns the entire address space of the processor.
  • It allows memory allocation to be non-contiguous and has no external fragmentation problems, simplifying memory management and obviating any need for compaction.
  • It lets us keep more processes in memory than the sum of their memory needs so that we can keep the CPU utilization as high as possible.
  • It lets us us keep just the parts of a process that we're using in memory; the rest may be stored on a disk.

Systems such as the Intel IA-32 architecture employ a combined segmentation and paging system. A logical address is offset by a segment base register. This resultant address is a linear, virtual address and is looked up in a two-level page table to generate a physical address.

Demand paging

Demand paging refers to the technique of allocating and loading memory pages on demand; that is, when a process makes a request for that region of memory. When a process makes a memory reference to a page that is not mapped onto a page frame, the MMU generates a page fault. The page fault handler checks whether the reference is a valid memory reference or not. If it is not valid then the process is terminated. If the address references executable code or static data, then those contents are present in the program file on the disk and are read from there. If the memory reference is to a stack or heap that grows then a new frame is allocated and initialized. Finally, if the memory reference refers to stack or heap data that was written out to a page file disk because the memory manager needed a free page frame, then a page frame is allocated, the disk read is scheduled, and the operating system context switches to some other process while this one waits for the memory contents it needs to be loaded into memory.

Page replacement is the process of finding a page to save out onto disk to create a free page frame. Since it requires disk activity, which is much slower than memory, the hope of an operating system is to minimize paging — shuffling page contents between memory and the disk. Ideally, we would like to use a least recently used (LRU) algorithm to select a page for replacement. The snag is that there is no way to keep track of use counts in an MMU so we need to find an efficient algorithm that comes close to LRU. Memory management units have a "referenced" bit that tells us if a page was referenced. The clock algorithm (also known as the second chance algorithm) looks at page frames as a circular list. When a page needs to be selected for eviction, the algorithm starts at its current position and looks for page frames whose referenced bits are 0 (they have not been referenced in a while). As it does this, it sets the referenced bits of pages it skipped over to 0.

The nth chance replacement algorithm is similar except that it keeps a counter per entry. When we search for a page, if a page frame's referenced bit is 1, we set it to 0 and clear the counter. If the page frame's referenced bit is 0, we increment the counter. We continue with this until we find a clear page whose counter is greater than or equal to some value N. This algorithm effectively counts the number of times that we examined this page and it has not been referenced.

As a process executes, it accesses certain memory locations. Over some small interval of time, a process is unlikely to reference all of its memory. Instead, there's a a principle of locality at work: a process tends to reference the same pages over and over before moving on to a few other pages. This set of pages is known as the process' working set. Some processes may have bigger working sets than others. In all cases, For good performance, we want to have each process' working set in memory. If this is not the case then the system will exhibit thrashing: constantly paging to and from the disk. A process' resident set is the set of pages that is currently mapped to page frames. The working set model approximates the locality characteristics of a process to try to identify the set of pages that a process needs to have its working set in memory and avoid thrashing. One way to manage the resident set is by monitoring page fault frequency. If a process does not have its working set in memory, it will generate page faults. This is an indication that it needs more memory. Conversely, if a process is never generating page faults, that could be an indication that it has too much memory allocated to it. We can set thresholds of page faults to decide what percentage of available page frames should be allocated to a specific process.

Devices

A block device provides structured access to the underlying hardware. Block devices are devices that can host a file system. They provide addressable (block numbers) I/O of fixed, equal-sized chunks of bytes. Since block data is persistent, it is suitable for caching in memory. The buffer cache is a pool of kernel memory that is allocated to hold frequently used blocks from block devices. By using the buffer cache, we can minimizes the number of I/O requests that actually require a device I/O operation.

A character device provides unstructured access to the underlying hardware. Examples of character devices include printers, scanners, mice, and a video frame buffer. Unlike a block device, there is no concept of seeking: addressing data by its location. Whereas in a block device, we may have a concept of, say, block 1,523 containing a chunk of data that will never change unless we change it, a character device presents a continuous sequence of bytes.

A network device is similar to a concept device except that instead of sending and receiving a stream of bytes, it sends and receives bytes in chunks (packets).

On POSIX systems, the kernel maintains two tables to keep track of devices: a block device table and a character device table. The major number is an index into either the block or character device table and allows the kernel to find the set of functions that are associated with that device: the device driver. The minor number is interpreted within the device driver. It usually identifies which specific instance of a device is being accessed. For example, SATA disks will have one functional interface (driver) but multiple disks. The major number will identify the SATA driver and the minor number will identify a specific disk.

Devices, particularly on POSIX-based systems, are usually presented as file names via the hierarchical file system name space. A device appears as a file within the file system. It can have an arbitrary name and there is no data associated data with this file but the file's metadata identifies the type of device (block or character) and the device's major and minor numbers, which identify the specific instance of the device.

By placing devices in the same hierarchical name space as files and ensuring that each device must implement a well-defined set of I/O operations the operating system can provide a degree of transparency where applications can often be unaware of whether they are interfacing with a file or a device. For example, the shell can just as easily read commands from a terminal window (device) as it can from a file containing a script. When a file is first opened, the kernel can check the file's attributes to determine if it is a device file. If so, it reads the type of device (block or character) and its major and minor numbers from the file's metadata. The kernel then sends operations to the driver for that device. The underlying file system where that device file resided does not get involved.

Execution contexts

Any kernel code is running in one of three contexts:

  1. interrupt context: this is invoked by the spontaneous change in the flow of execution when a hardware interrupt takes place. Any actions performed in this context cannot block because there is no process that requested the action that could be put to sleep and scheduled to wake up when an operation is complete.
  2. user context: this is the flow of control when a user process invokes a system call. The mode switch to kernel mode took place but the context is still that of a user process. If needed, the kernel may schedule an I/O operation, put the process in a blocked state, and context switch to another process.
  3. kernel context: the kernel itself has one or more worker threads that it schedules just like any other process. Even though these have no relation to user threads, they have a context and may block. Note: the kernel context is not the context of a process that has just made a system call; that is user context running in kernel mode.

Device drivers

Device drivers are modular and can be compiled into the kernel, loaded at initialization, or dynamically loaded at any time in the future. Upon initialization, they register themselves with the kernel's interrupt handler. When a specific interrupt occurs, the kernel handler calls the appropriate interrupt handling function in the driver. To ensure that the interrupt context does not take too long (since it cannot block), interrupt handling is split into two parts:

  1. The top half of a driver is the interrupt service routine that is registered with the interrupt handler. It tries to do as little as possible — generally grabbing data, placing it into a work queue (a buffer), and scheduling bottom half activity.
  2. The bottom half of a driver is the part that is scheduled by the top half for later execution and does much of the real work. Because it runs in a kernel thread (that handles work queues), it may perform blocking operations.

Top half and bottom half devices communicate via I/O queues. A device status table keeps track of devices and the current status of each device. Each device has an I/O queue associated with it. Any received I/O (reads from the device or writes by the user) is placed on the work queue and scheduled by a kernel thread.

Disk scheduling

Even though solid state flash storage is becoming increasingly popular, spinning magnetic disks are still the dominant device for storing large amounts of information. Compared to memory, a disk is an incredibly slow device. The buffer cache helps with the performance of frequently-accessed blocks. Beyond that, we rely on disk scheduling algorithms to optimize the flow of data between memory and the disk.

The simplest algorithm is First Come, First Served (FCFS), which processes requests in the order in which they are generated. The elevator algorithm (SCAN) takes the fact that a disk has a head that moves back and forth on the disk. It relies on the system to keep track of the current head position and direction. Requests are sorted in the order in which they will be encountered as the head moves from the track 0 to the highest track on the disk. The head movement then reverses and any outstanding requests are processed from highest to lowest as the head seeks back to track 0.

The LOOK algorithm simply reverses the disk head's direction if there are no blocks scheduled beyond the current point. The Circular SCAN (C-SCAN) algorithm is the SCAN algorithm but schedules disk requests only when the head is moving in one direction to avoid preferential treatment of blocks in the middle. The Circular LOOK (C-LOOK) algorithm is like LOOK but schedules requests in only one direction. In the Shortest Seek Time First (SSTF) algorithm, the queue of current requests is be sorted by cylinder numbers closest to the current head position and requests made in that order.

File Systems

File systems provide us with the abstraction of persistent and protected objects sitting in a hierarchical name space. The actual data resides on block devices: disks or flash memory. The specific file system that is used on that device defines the data structure on the disk blocks that defines how directories, files, free space information, and permissions are stored.

Multiple individual file systems can be combined into an abstract view of a single hierarchical name space via a mount mechanism. By executing a mount system call and providing it with the block device, underlying file system type, and a mount point in the current file system, the operating system will place the root of the file system on that device right at the given mount point. The operating system will maintain a list of mount points so that attempts to traverse directories will switch to the appropriate device when a mount point is encountered.

In the past, operating systems would support a implementation of a file system. To support multiple file systems transparently, a layer of abstraction called the Virtual File System is used. This is an object-oriented approach to looking at file systems. System calls that interact with files communicate with the VFS layer. This layer implements high-level operations related to managing files and directories that are common to all file systems. The VFS layer then interfaces with one or more file system modules that implement the specific underlying file system. This file system module interacts with the buffer cache and with block device drivers to read and write data to the device containing the file system. The layers are:

  • system call interface: APIs for user programs
  • virtual file system: manages the namespace, keeps track of open files, reference counts, file system types, mount points, pathname traversal.
  • file system module (driver): understands how the file system is implemented on the disk. Can fetch and store metadata and data for a file, get directory contents, create and delete files and directories
  • buffer cache: no understanding of the file system; takes read and write requests for blocks or parts of a block and caches frequently used blocks.
  • device drivers: the components that actually know how to read and write data to the disk.

At the VFS layer, the crucial components are:

inode
An inode uniquely identifies a file. In file system, it stores key information about the attributes of a file and the location of its data. At the VFS layer, it contains functional interfaces for creating files, directories, symbolic links, and reading and setting attributes. In short, it contains directory and file operations that do not relate to the file's data.
dentry
A dentry, short for directory entry, is an abstract representation of directory contents. It contains the filename as a string, its inode (so we can access its contents), and a pointer to its parent dentry. A directory file can be read and converted into a set of dentry structures.
file
The VFS interface keeps track of open files and contains a set of interfaces to open, close, read, and write files as well as to map them to memory or lock them. The file object represents an open file and keeps track of the access mode and reference counts.
superblock
Contains interfaces to get information about the file system, read/write/delete inodes, and lock the file system.

Managing files on disks

The lowest level of working with file systems and storage media is managing and allocating free disk blocks to files. For flash memory, there are no performance implications of using one block versus another one. For disks, however, that is not the case. Seek times greatly affect performance and the goal is to keep frequently-used or related blocks together. Ideally, files would use contiguous allocation. That is, if a file's data requires multiple disk blocks, the blocks would be contiguous. Unfortunately, as files get deleted and new ones get created, we'll find that we get external fragmentation as regions of free blocks too small for most files get scattered between existing files. Moreover, we usually don't know ahead of time how much space a file needs.

A compromise to contiguous allocation is to use extents. An extent is a contiguous set of blocks. A file will comprise one or more extents. File systems that use extents generally refer to extents (as a <block number, size> tuple) instead of block numbers when managing a file's data. A cluster is a logical block used throughout the file system that is a grouping of multiple physical blocks. For example, many disks provide 512 byte blocks but a file system may use 4096-byte clusters as its basic allocation unit, reading and writing eight blocks at a time. This results in better performance but also increases the amount of internal fragmentation in the file system.

The technique of linked allocation uses a few bytes of each block to store the block number of the next block in a file. An alternate implementation of this is a file allocation table (FAT). Here, a table of block numbers is created. The entry at table[block_num] contains the block number that follows the block block_num in the allocation list for a file. The directory, in addition to containing the file name, will also store the first block number that is allocated to the file. The entire chain of blocks can be read from the file allocation table.

For FAT to be efficient, the entire table must be kept in memory. As an alternative approach, we can just read in just the list of blocks used by a specific file when we open the file. What this means is that with each file we associate the entire list of blocks that the file uses. This approach is called indexed allocation but it isn't practical since these will be varying size lists and, for big files, may be extremely long. A variation of this, combined indexing uses fixed length structures. The information structure for a file, the inode, contains a list of just the first several blocks of the file. These are direct block pointers. If a file needs to use more blocks, the inode contains an indirect block pointer. This is the number of a block that contains list of direct block pointers (block numbers for data). If a file needs even more blocks, a double indirect block pointer in the inode will point to a block whose contents are a list of indirect block pointers. This is the approach used in the traditional UNIX file system as well as Berkeley's fast file system (FFS) and Linux's ext2, ext3, and ext4 file systems. A directory is just a data file that contains a series of names, each with their associated inode number. The inode number contains file permission information, creation/modification/access times, an identification of whether the file is a regular file or a device, and direct, indirect, double, and (in some cases) triple indirect block numbers.

File system implementation case studies

Unix File System (UFS)

The Unix File System is an implementation of the inode-based just described. When laid out on the disk, it contains three sections: a super block, inode blocks, and data blocks. The super block contains information about the file system, including a pointer to an linked list of free blocks. Directories are files that contain a list of names and corresponding inodes. An inode identifies a file as being a regular file, directory, or device file.

The performance of UFS was generally quite poor (2-4% of raw disk bandwidth). Several factors contribute to this inefficiency. The linked list approach to free block allocation leads to a high degree of fragmentation over time. In addition to our earlier encounters with the terms internal fragmentation and external fragmentation, we have another use of the term within file systems. Fragmentation is when a file's data is not allocated among contiguous blocks. The small block sizes typically used by UFS make that degree of fragmentation even higher than it would be otherwise and necessitate the use of double and triple indirect blocks for even not-very-large files. Finally, the fact that inodes reside at one end of the disk (beginning) and much of the data resides at the other end (middle to end) leads to large disk seeks between reading a file's inode and that file's data.

Berkeley Fast File System (FFS)

The Berkeley Standard Distribution variant of UNIX redesigned the file system to improve performance in several areas:

  1. Larger clusters. FFS chose a larger cluster size: at least 4 KB instead of the 512 bytes or 1K bytes of UFS. This brings about an eight-fold improvement improvement in contiguous allocation. Also, each block pointer in an inode or disk block now addresses much more data, so a file has to be far bigger before indirect, double indirect, and triple indirect blocks have to be used. To avoid too much internal fragmentation from large cluster sizes, FFS allowed the last cluster of a file to be a fragment, part of a cluster.
  2. Cylinder groups. Instead of having one region for inodes and another for data, multiple such regions were created. FFS looks like lots of little file systems laid out on one disk. In most cases, the data that corresponds to an inode will be located in the same cylinder group, minimizing seek times.
  3. Bitmap allocation. Instead of using a linked list of free blocks, a bitmap is used to keep track of which blocks are in use within each cylinder group. This makes it easy to find contiguous or nearby clusters for a file's data rather than just using any available free cluster. Moreover, FFS will preallocate adjacent clusters to a file on the chance that the file may need them as it grows.
  4. Prefetch. If two or more sequential blocks are read from a file, FFS assumes that file access is sequential and will prefetch additional blocks from that file.

These changes resulted in a 5-10 times performance improvement over UFS.

Linux ext2

Linux's ext2 file system is an adaptation of Berkeley's FFS. With bigger disks, space wasted due to internal fragmentation was no longer considered that precious and worth saving, so managing fragments was abandoned to simplify cluster allocation. Instead of cylinder groups, ext2 uses the term block group since modern disks make it impossible to identify cylinders; we just have the abstraction of a long linear list of blocks. To improve performance over FFS at the expense of possible disk corruption, ext2 caches all data aggressively in the buffer cache. FFS would flush metadata changes aggressively to minimize corruption of file structure even though the data was still at risk.

Journaling and Linux ext3

A file system is consistent if all free block and inode bitmaps are accurate, all allocated inodes are referenced by files or directories, and each data blocks is either marked free or is referenced by an inode. Whenever a file is modified, a disk block may be allocated, free block bitmap modified, inode modified, and disk block written. If anything happens to the system, such as a spontaneous loss of power, only some of these changes may have been written out to the disk, resulting in an inconsistent state. The consistent update problem is the challenge of performing all updates atomically: either all of them should be applied to the file system or none should be, ensuring that the file system can always remain in a consistent state. Journaling is a method of providing this guarantee of atomicity. All changes associated with a file modification are first written as a log to a journal (a dedicated area on the disk). The file system itself is not touched. When all changes are complete the journal log is marked with a transaction-end. Only when the transaction has been made permanent on the disk, the same sequence of changes is now applied to the file system. When complete, the transaction entry is deleted from the journal. Should anything happen to the system, upon restart, any transactions in the journal are applied to the file system to bring it to a consistent state.

The downside of journaling is performance: all disk opartions are applied to the journal and then applied again to the file system. By giving up some integrity, there are several ways that performance can be improved.

Full data journaling
As described above, all disk operations are written to the journal first and then applied to the file system.
Ordered journaling
This writes data blocks first, then journals only the metadata (inode information, bitmaps, and superblock changes), terminates the transaction, and then applies all the metadata that was journaled onto the file system. The file system is always kept in a consistent state but in some cases may lead to partially modified files if, for example, data was written to existing and new blocks of a file but the system died before an end-of-transaction record was written.
Writeback journaling
This journaling does metadata journaling but without writing the data blocks first. Recently modified files can be corrupted after a crash. It's the fastest of the journaling options and a weak form of journaling since it can lead to data corruption. However, it leads to no more corruption than with non-journaled file systems and can save a lot of time in file system recovery.

Linux's ext3 file system is the same as ext2 but with the addition of a journal in each block group.

Linux ext4

The ext4 file system is an enhancement to ext3 and can support far bigger files. Two key enhancements were:

  1. Extent-based allocation. Recall that an extent is a set of contiguous clusters. Instead of listing block number in inodes, ext4 uses extents. An extent contains the starting cluster number of a group of clusters along with a cluster count. In cases where contiguous blocks can be allocated, this can allow much more data to be addressed from the list of blocks in an inode. Since computing the block that holds a specific byte offset within a file is difficult with extents (you have to iterate through the list of extents), the use of indirect blocks has been replaced with an Htree (similar to a B-tree) data structure that makes it easy to traverse a tree to search for a block representing a specific offset within a file.
  2. Delayed allocation of disk space. Instead of allocating data clusters as a file grows, the blocks are kept in the buffer cache and allocated to physical blocks only when they are ready to be flushed. This increases the likelihood that we can allocate contiguous blocks for a file.

Micorosft NTFS

NTFS was the successor to Microsoft's FAT family of file systems and has been the file system of choice on every version of Windows Server and every consumers version since Windows Vista. NTFS allocates data in extents on top of clusters and is designed such that just about every structure in the file system appears as a file. Whereas Unix-derived file systems had distinct areas for inodes, journals, and bitmaps, under NTFS they are all just files.

The MFT, or Master File Table is the structure that keeps track of file records (similar to an inode table). It is structured as a B-tree to facilitate searching for a specific file record. Each file record is roughly similar to an inode in that it contains information about the file. A file record contains an arbitrarily long list of attributes. If the list is too long to fit within a single file record, additional extents are allocated, with an attribute in the record identifying their location. The data associated with a file record is just an attribute. Very small files or directories may contain all their data within a file record and not require any additional disk blocks.

Special File Systems and Files

Flash memory

There are a couple of differences between solid state flash memory and magnetic disk drives. NAND flash memory presents itself as a block-addressable device: data is read or written a page (block) at a time and each page is uniquely addressed. In this way, it is very similar to a magnetic disk drive. Because there is no moving disk head that must to seek to a specific track to read or write a block, there is no seek latency with NAND flash, making the use of elevator algorithms pointless. Unlike a disk, data cannot be written directly to a page of NAND flash: the contents of the page must be erased first, turning write operations into an erase-write sequence. Unlike disk drives, each page of NAND flash can handle only a limited number of erase-write cycles before it wears out. This number is typically between 100,000 and 1,000,000. Because of this, a goal in NAND flash is wear leveling: distribute writes across all pages of the storage.

Dynamic wear leveling monitors high and low use areas of flash memory and at some point swaps high-use blocks (pages) with low-use blocks. Unused blocks are not affected. A more aggressive form of wear leveling is static wear leveling where even unchanging (static) contents are periodically moved to other blocks to give all blocks a chance to wear out evenly. Wear leveling is implemented either in a NAND flash controller or in a layer of software between the flash device driver and the block device. This layer of software is called a Flash Translation Layer (FTL). The FTL maintains a block lookup table that maps logical to physical blocks so the block can be identified and located even after it is moved.

Log-structured file systems

To avoid reusing the same blocks over and over again, a log-structured file system is attractive for flash memory. This is a file system where each file operation is written as an entry to a transaction log (e.g., "new file 111: name=ABC", "new data block 0 for file 111: ...."). One entry may obsolete an older entry. When played from start to end, the log allows us to construct an up-to-date file system. Unlike traditional file systems, the entire file system is stored simply as a sequence of log entries.

YAFFS is a popular log-structured file system. All data is written to a log as chunks. Each chunk is either a data chunk or an object header, which represents file metadata or a directory. New chunks may cause certain old chunks to become obsolete. If all the chunks in a block are no longer valid, the chunk can be reused. Garbage collection is performed periodically to reclaim free blocks. Free blocks can also be created by copying live (in-use) chunks from blocks that contain mostly deleted chunks onto other blocks.

On startup, the YAFFS driver scans the log and constructs an in-memory view of the file system hierarchy (minus the data, which will still be located in flash). Checkpointing is the process of writing out this file system hierarchy onto the file system. It speed up mounts since the file system driver can read the last checkpoint and no longer needs to reconstruct the entire file system hierarchy.

Special devices

The classic view of a device driver is that of software that interfaces with some physical device and allows user processes to read, write, and otherwise control the device. The device, however, need not be a real device. Drivers can be written to create data streams for various purposes. A few examples of this are the null device (which discards any writes and always returns an end-of-file on reads), zero device (similar to the null device but always returns bytes containing values of 0), and the random device, which returns random numbers.

The loop device is a special device that is given a file name (via an ioctl system call) and creates a block device interface to that file. A block device interface is one where the system can only read and write fixed-size blocks of data. File systems are formatted on top of block devices (usually disk drives or NAND flash memory). Since a file can now present itself as a block device, one can now create a file system within that file and mount it just as one would a physical disk or flash memory.

Special file systems

Our classic view of a file system is that of a file system driver which lays out data structures on top of a block device and allows one to create, access, and delete files and directories. A file system driver can be more abstract, however. The process file system, for example, creates a file system view of processes and other aspects of the kernel. Each process is presented as a directory. Various aspects of the process, such as its command line, open files, memory map, signals, and statistics, are presented as files and directories within the process directory. A device file system presents all the kernel's devices as device files so that one does not need to explicitly create and remove them from a /dev directory in a physical file system. FUSE is a kernel module that allows one to create user-level file systems. The VFS layer sends requests to FUSE that, in turn, sends them to a user-level process that interprets the requests as it sees fit.

Client-server Networking

Data goes over a network in one of two ways: baseband or broadband. Only one node may transmit data at a time on a baseband network but, for that time, it has the full bandwidth of the network. A broadband network, on the other hand, has its available bandwidth divided into multiple channels, or frequency bands. Cable TV is an example of a broadband network. However, data services offered by cable providers are confined to two groups of channels (one set for downstream traffic and another set for upstream), making IP access effectively baseband within broadband. Don't confuse these terms with the marketing community's use of broadband to refer to any relatively high-speed network to the home.

Baseband networks have all nodes on the network share the same frequencies. Therefore, they cannot all talk at the same time and have to share the network. Two ways of sharing the network are circuit-switching and packet switching. A circuit-switched network divides the network into short, fixed-length time slots and each node is allowed the use of only specific time slots. This is called Time Division Multiplexing (TDM). With circuit switching, a dedicated path (route) is established between two endpoints. Circuit switching provides guaranteed bandwidth and constant latency but does not use networking resources efficiently since time slots may go unused if a node has nothing to transmit. The public telephone network is an example of circuit switching, providing a maximum delay of 150 milliseconds and digitizing voice to a constant 64 kbps data rate. Packet-switched networking uses variable-length time slots. Data is segmented into variable-size packets and each packet must be identified and addressed. This type of transmission generally cannot provide guaranteed bandwidth or constant latency. Ethernet is an example of a packet-switched network.

Data networking is usually implemented as layers (a stack of several protocols, each responsible for a specific aspect of networking. The OSI reference model defines seven layers of network protocols. Some of the more interesting ones are: the data link, network, transport, and presentation layers.

The data link layer (layer 2) transfers data within a physical local area network. The network layer (layer 3) manages routing: the journey of packets from one machine to another possibly across multiple networks. The transport layer (layer 4) manages the communication of data from one application to another rather than machine-to-machine delivery. The presentation layer (layer 6) manages the representation of data and handles any necessary conversion of data types across architectures (for example, different byte ordering in integers, different character representations).

IP Networking

The Internet Protocol (IP) handles the interconnection of multiple local and wide-area networks. It is a logical packet-switched network whose data is transported by one or more physical networks (such as Ethernet, for example). In the OSI model, IP is a layer 3 (network) protocol, while ethernet is a layer 2 (data link) protocol. Each machine on an IP network must have an IP address. As IP is a logical overlay network that connects multiple physical networks together, the IP address is separate from, and unrelated to, the address of any underlying network (e.g., ethernet address).

Since IP is a logical network, any machine that needs to send out IP packets must do so via the physical network. Typically, this is Wi-Fi or Ethernet, which uses a 48-bit Ethernet address that is unrelated to a 32-bit IP address. To send an IP packet out, the system needs to find the physical (Ethernet) destination address (the MAC, or Media Access Control address) that corresponds to the desired IP destination. The Address Resolution Protocol, or ARP, accomplishes this. It works by broadcasting a request containing an IP address (do you know the corresponding MAC address for this IP address?) and then waiting for a response from the machine with the corresponding IP address. To avoid doing this for every outgoing packet, it maintains a cache of most recently used addresses.

There are two transport-layer protocols on top of IP: TCP and UDP. TCP (Transmission Control Protocol) simulates virtual circuit (connection-oriented) service. This layer of software ensures that packets arrive in order to the application and lost or corrupt packets are retransmitted. The transport layer keeps track of the destination so that the application can have the illusion of a connected data stream. UDP (User Datagram Protocol) provides datagram (connectionless) service. While UDP drops packets with corrupt data, it does not ensure in-order delivery or reliable delivery. Port numbers in both TCP and UDP form the transport address and are used to allow the operating system to direct the data to the appropriate application endpoint (or, more precisely, to the socket that is associated with the communication stream).

Protocol encapsulation keeps the layers of the networking stack separate. A TCP or UDP packet, along with associated packet data, is treated simply as data within an IP packet. The IP header has a 6-bit protocol field that identifies the type of protocol that is encapsulated within it so that an IP driver can send the packet to the right transport-level protocol (e.g., to the TCP module). Likewise, an Ethernet packet (called a "frame") treats the entire IP packet as data. It has a two-byte value that identifies the type of protocol that it is encapsulating so that an ethernet driver can send the data to the correct protocol module (e.g., to the IP driver).

Sockets

Sockets are the interface to the network that is provided to applications by the operating system. A socket is a communication endpoint that allows one application to communicate with another application. Writing data to a socket allows a corresponding socket on another application to receive that data. The interface is usually a data network between machines but doesn't have to be — sockets can also be used for interprocess communication within the same system. Because sockets are designed to deal with application-to-application communication, they almost always interact at the transport layer of the OSI reference model (UDP/IP and TCP/IP if IP communications is used).

Sockets are created with the socket system call and are assigned an address and port number with the bind system call. For connection-oriented protocols, a socket on the server is set to listen for connections with the listen system call. The accept call blocks until a connection is received at a listening socket, at which point the server process is given a new socket dedicated to the new connection. A client can establish a connection with a server socket via the connect system call. After this, sending and receiving data is compatible with file operations: the same read/write system calls can be used. This is a key feature of sockets: once the network interface has been set up, the application can use file system I/O to communicate and does not have to think about networking. When communication is complete, the socket can be closed with the shutdown system call or the regular file system close system call.

User processes interact with sockets either through socket-specific system calls (socket, listen, connect, etc.) or through file system calls (read, write, etc.). However, sockets are not implemented as a file system under the VFS layer of the operating system. The distinction between files and sockets occurs in the kernel data structure referenced by the socket's file descriptor. Every socket has its own socket structure associated with it. This structure identifies the set of operations that can be performed on it as well as its data send and receive queues. The structure also identifies the networking protocol that is associated with it (e.g., TCP) so that it can queue outbound data for processing by the appropriate protocol layer.

The generic networking interface provided by sockets interacts with protocol-specific implementations. Just like the file system and devices, network protocols are modular and can be installed, registered, and removed dynamically.

We love layered protocols because they nicely partition the different things that need to happen in a networking stack. What we don't love is moving data around. Allocating, copying, and freeing data as it moves between layers of the protocol stack will severely impact performance. Instead, a packet is placed in a socket buffer (sk_buff) as soon as it is created: either at the system call (for sending data) or at the device driver (for received data). Instead of copying data, a pointer to the socket buffer moves between the queues of each layer of the networking stack. When an application needs to transmit data, the socket layer allocates a socket buffer with enough extra space to hold the process' data along with all the headers that will be needed as the packet goes down the protocol stack (e.g., TCP, IP, and ethernet headers). The data is copied from the user's process into the socket buffer in kernel space. The next time it is copied is when all protocol processing is complete and the data, with all of its headers, moves from the socket buffer out to the device.

An abstract device interface sits below the networking layers and right above the actual device drivers. This interface contains generic functions for initializing devices, sending data and allocating socket buffers for received data. The network device driver (e.g., an Ethernet or 802.11b/g/n/ac driver) interacts with the physical network and is responsible for actually transmitting data out to the network and grabbing received data from the hardware. As with character and block device drivers, network device drivers are modules that can be inserted, registered, and deleted dynamically.

With fast networks, the rate at which packets come in can generate thousands to hundreds of thousands of interrupts per second. To avoid this, Linux NAPI ("New" API) disables network device interrupts when a packet is received and reverts to periodic polling. If, at some time, a poll yields no data then interrupts are re-enabled and polling is stopped. This avoids unnecessary polling when there is no incoming traffic. Once a network interrupt is serviced, interrupts are again disabled and we repeat the same process.

Remote Procedure Calls

One problem with the interface offered by sockets was that it provided a send-receive model of interaction. However, most programs use a functional (procedure call) model of interaction. Remote procedure calls are a programming language construct (something provided by the compiler), as opposed to an operating system construct, such as sockets. They provide the illusion of calling a procedure on a remote machine. During this time, execution of the local thread stops until the results are returned. The programmer is alleviated from packaging data, sending and receiving messages, and parsing results.

The illusion of a remote procedure call is accomplished by generating stub functions. On the client side, the client stub is a function with the same interface as the desired remote procedure. Its function, however, is to take the parameters, marshal them into a network message, send them to the server (using sockets), await a reply, and then unmarshal the results and return them to the caller. On the server side, the server stub (sometimes known as a skeleton) is responsible for being the main program that registers the service and awaits incoming requests for running the remote procedure. It unmarshals the data in the request, calls the user's procedure, and marshals the results into a network message that is sent back to the recipient.

Network file systems

There are a couple of models for implementing distributed file systems: the download/upload model or the remote access model. In the download/upload model, an entire file is downloaded to the local machine when that file is opened and, if modified, uploaded when closed. In the remote access model, individual requests are forwarded to the remote server as remote procedures.

In a stateful file system, the server maintains varying amounts of state about client access to files (e.g., whether a file is open, whether a file has been downloaded, cached blocks, modes of access). In a stateless file system, the server maintains no state about a client's access to files. The design of a file system will influence the access semantics to files. Sequential semantics are what we commonly expect to see in file systems, where reads return the results of previous writes. Session semantics occur when an application owns the file for the entire access session, writing the contents of the file only upon close, thereby making the updates visible to others after the file is closed, and overwriting any modifications made by others prior to that.

Network file systems: examples

NFS

NFS was originally designed as a stateless, RPC-based model implementing commands such as read bytes, write bytes, link files, create a directory, and remove a file. Since the server does not maintain any state, there is no need for remote open or close procedures: these are used only on the client to keep track of the state locally. NFS works well in faulty environments: there's no state to restore if a client or server crashes. To improve performance, regardless of how little data an application requests, a client requests remote data a block a large chunk at a time (8 KB by default, but negotiated based on network speeds) and performs read-ahead (fetching future blocks before they are needed). It also caches data for some period of time. NFS suffers from ambiguous semantics because the server, as well as other clients, has no idea what blocks the client has cached and the client does not know whether its cached blocks are still valid. The system checks modification times if there are file operations to the server but otherwise invalidates the blocks after a few seconds. File locking could not be supported because of NFS's stateless design but was added through a separate lock manager that maintained the state of locks.

AFS

AFS was designed as an improvement over NFS to support file sharing on a massive scale. NFS suffered because clients would never cache data for a long time (not knowing if it would become obsolete) and had to frequently contact the server. AFS introduced the use of a partition on a client's disk to cache large amounts of remote files for a long time: a model of whole file caching and long-term caching. It supports a file download-upload model. The entire file is downloaded on first access (whole file download) and uploaded back to the server after a close only if it was modified. Because of this behavior, AFS provides session semantics: the last one to close a modified file wins and other changes (earlier closes) are lost.

During file access, the client need never bother the server: it already has the file. When a client first downloads a file, the server makes a callback promise: it maintains a list of each client that has downloaded a copy of a certain file. Whenever it gets an update for that file, the server goes through the list and sends a callback to each client that may have a cached copy so that it can be invalidated on the client. The next time the client opens that file, it will download it from the server. Files under AFS are shared in units called volumes. A volume is just a directory (with its subdirectories and files) on a file server that is assigned a unique ID among the cell of machines. If an administrator decides to move the volume to another server, the old server can issue a referral to the new server. This allows the client to remain unaware of resource movement.

SMB/CIFS

Microsoft's Server Message Block protocol was designed as a proprietary connection-oriented, stateful file system with the goal of providing strong consistency, full support for locking as well as other features provided by the native file system rather than client caching and performance. While it does not use remote procedure calls, its access principle is the same: it uses a remote access model as opposed to whole file downloads. Requests (message blocks) are functional messages, providing file access commands such as open, create, rename, read, write, and close.

Protection & Security

Protection is the mechanism that provides and enforces controlled access of resources to users and processes. Security is the set of policies that define authorized access to a system. The principle of least privilege is an approach to security that specifies that any entity (users, processes, functions) should be able to access only the resources that are essential to completing its task and no more. Privilege separation is a design where we segment a process into multiple parts, each granted only the privileges that it needs to operate. This allows us to partition a component that needs administrative access or needs to access a critical object away from other components and hence minimize any damage if the other components get compromised.

In the most general sense, an operating system is responsible with providing processes and threads with controlled and protected access to resources. The process scheduler is responsible for providing threads with access to the CPU; the memory manager and the MMU are responsible for giving processes protected access to their own memory address space; device drivers and the buffer cache provide access to peripheral devices; sockets provide access to the network; and the file system provides access to logical data on disks.

An operating system is responsible for ensuring that threads can access objects (devices, files, networks, etc.) only in accordance with security policies. Every thread operates in a protection domain, which defines what resources it is allowed to access. We can model the full system-wide set of access rights as an access matrix. Rows represent protection domains (e.g., users or group IDs under which a thread runs) and columns represent objects. The row, column intersection defines the access rights of a specific domain on a specific object. An access matrix is not an efficient structure to implement, so operating systems typically model it with access control lists. An access control list (ACL) is a column of an access matrix: a list of permissions for each domain that are assigned to a specific object. For example, a file may specify different read, write, or execute rights for different users. In Linux and most current operating systems, a domain is associated with a user.

A challenge with access control lists is that they may potentially be long and hence will not fit into a fixed-length inode structure. To address this, UNIX-derived systems provide a severely limited form of ACLs by having the inode contain read, write, and execute permissions for the only the file owner's ID, file's group ID, and everyone else. If a full ACL is used, it is stored as an extended attribute and uses additional disk block(s) apart from the inode.

Virtually all modern operating systems implement some form of access control lists. An alternate approach is a capability list, which represents a row of an access matrix. Each domain has a capability, which is an enumeration of all the objects in the system and the operations that the domain (user or group) can perform on them. Capabilities are typically implemented with items such as network services, where an authorization server may present a process with a capability token that it can pass on to an object and thereby prove that it is allowed to access the object.

The access control mechanisms available in most systems (and the ones we use most) follow a model called discretionary access control (DAC). This allows the thread to access objects to which it has permissions and also pass that information to other processes or write it to other objects. Access to objects is restricted based on the identity of users or groups. A mandatory access control (MAC) model is one where a centrally-controlled policy restricts the actions of domains on objects. Users cannot modify the policy. MAC models typically implement a Multi-Level Secure (MLS) access model, of which the Bell-LaPadula model is the most widely used example. In this model, objects are classified in a security hierarchy: unclassified, confidential, secret, and top-secret. Users (domains) are are also assigned a security clearance. The overall policy is no read up; no write down. This means that a user cannot read objects that belong to a higher clearance level than the user is assigned. The user also cannot write or modify objects that belong to a lower clearance level.

This model, however, has limited use outside of governments as those policies do not apply well to civilian life. Another form of MAC is the use of an application sandbox (see later discussion). This allows an administrator or domain owner to specify restrictions on a per-process basis regardless of the user ID (protection domain) under which the process is executing. The process cannot override these restrictions.

Cryptography

Cryptography deals with encrypting plaintext using a cipher (also known as an encryption algorithm to create ciphertext, which is unintelligible to anyone unless they can decrypt the message.

A restricted cipher is one where the workings of the cipher must be kept secret. There is no reliance on a key and the secrecy of the cipher is crucial to the value of the algorithm. This has obvious flaws: people in the know leaking the secret, coming up with a poor algorithm, and reverse engineering. For any use of serious encryption, we use well-tested, non-secret algorithms that rely on secret keys.

A symmetric encryption algorithm uses the same secret key for encryption and decryption. A public key algorithm uses one key for encryption and another for decryption. One of these keys is kept private (known only to the creator) and is known as the private key. The corresponding key is generally made visible to others and is known as the public key. Anything encrypted with the private key can only be decrypted with the public key. This is the basis for digital signatures. Anything that is encrypted with a public key can be encrypted only with the corresponding private key. This is the basis for authentication and covert communication.

A one-way function is one that can be computed relatively easily in one direction but there is no known way of computing the inverse function. One-way functions are crucial in a number of cryptographic algorithms, including digital signatures, Diffie-Hellman key exchange, and RSA public key cryptography. For Diffie-Hellman and RSA keys, they ensure that someone cannot generate the corresponding private key when presented with a public key. A particularly useful form of a one-way function is the hash function. This is a one-way function whose output is always a fixed number of bits for any input. For good cryptographic hash functions, it is highly unlikely that two messages will ever hash to the same value, it is extremely difficult to construct text that hashes to a specific value, and it is extremely difficult to modify the plaintext without changing its hash. This is called collision resistance. The hash function is the basis for message authentication codes and digital signatures. Note that when we talk about cryptography and use phrases such as "extremely difficult", we mean "impossible for all practical purposes," not that "you can do it if you spend an entire day or two working on the problem."

Secure communication

To communicate securely using a symmetric cipher, both parties need to have a shared secret key. Alice will encode a message to Bob using the key and Bob will use the key to decode the message. If Alice wants to communicate with Charles, she and Charles will also need a secret key. The fact that every pair of entities will need a secret key leads to a phenomenon known as key explosion. Overall, in a system with n users, there will be O(n2) keys.

The biggest problem with symmetric cryptography is dealing with key distribution: how can Alice and Bob establish a key so they can communicate securely? The Diffie-Hellman exponential key exchange algorithm allows one to do this. Each party will generate a private key and a public key (these are not encryption keys; they are just numbers. Diffie-Hellman does not implement public key cryptography. It is unfortunate that the term "keys" was used to describe these numbers. It uses the one-way function abmod c in a way that allows Alice to compute a common key using her private key and Bob's public key. Bob can compute the same common key by using his private key and Alice's public key.

Using true public key cryptography, such as the RSA algorithm, if Alice encrypts a message with Bob's public key, Bob will be the only one who can decrypt it since doing so will require Bob's private key. Likewise, Bob can encrypt messages with Alice's public key, knowing that only Alice will be able to decrypt them with her private key.

Session keys

A session key is a random key that is generated for encryption during a single communication session. It is useful because if the key is ever compromised, no lasting information is obtained: future communication sessions will use different keys. A hybrid cryptosystem uses public key cryptography to send a session key securely. The originator generates a random session key and encrypts it with the recipient's public key. The recipient decrypts the message with the corresponding private key to extract the session key. After that, symmetric cryptography is used for communication, with messages encrypted with the session key. This has the advantages of higher performance (public key cryptography is much, much slower than symmetric cryptography) and ease of communicating with multiple parties (just encrypt the session key with the public keys of each of the recipients). It also allows the bulk of data to be encrypted with session keys instead of the hardly-ever-changing public keys. Generating a public-private key pair is also much, much slower than generating a symmetric key, which is just a random number.

Digital signatures

With public key cryptography, a digital signature is simply the act of encrypting a hash of a message with the creator's private key. Anyone who has the public key can decrypt the hash and thus validate it against the message. Other parties cannot recreate the signature since they do not have the private key even though they can create the hash.

Authentication

The three factors of authentication are: something you have (such as a key or a card), something you know (such as a password or PIN), and something you are (biometrics). Combining these into a multi-factor authentication scheme can increase security against the chance that any one of the factors is compromised. For example, the use of a password combined with an access card is two-factor authentication. Using two passwords, however, is still single-factor authentication because the same factor is used twice.

Password Authentication Protocol (PAP)

The classic authentication method is the use of reusable passwords. This is known as the password authentication protocol, or PAP. With PAP, a password is associated with each user and stored in a file. The system asks you to identify yourself (login name) and then enter a password. If the password matches that which is associated with the login name on the system then you are authenticated.

One problem with the protocol is that if someone gets hold of the password file on the system, then they have all the passwords. The common way to thwart this is to store hashes of passwords instead of the passwords themselves. This takes advantage of one-way functions. To authenticate a user, check if


hash(password) == stored_hashed_password(user)

Even if someone gets hold of the password file, they won't be able to reconstruct the original password from the hash. They'll have to resort to an exhaustive search or a dictionary attack and try different passwords, searching for one that hashes to the value in the file.

Hashes are vulnerable to a dictionary attack with pre-computed hashes. An attacker can, ahead of time, create a table of a few million passwords and their corresponding hashes. Then, when confronted with a hash value, one simply searches the table for a matching hash to find the password that created it. A similar vulnerability exists if Alice notices that Bob's password hash is the same as hers. She now knows that she and Bob share the same password.

Salt is extra data that is added to the password prior to hashing it. What salt does is change the input text and hence the resultant hash. When you created your password and stored its hash, the system generated a random salt to go along with it (and stored the value along with your identity and hashed password). Suppose your salt is "abc" and your password is "test123". We now hash "test123abc" (or some such combination) and get a value that is NOT hash("test123"). The attacker can see the salt value in the password file but there's no way to remove its effect from the hash value. Now, the attacker needs a much, much larger set of data to account for all possible salt values. This will generally not be feasible for decent-sized salt values. For example, Apache and BSD use a salt that's up to 8 characters. Even with just alphanumeric characters, each possible password can have over 200 trillion variations as an input to the hash function.

The other problem with reusable passwords is that if a network is insecure, an eavesdropper may sniff the password from the network. A potential intruder may also simply observe the user typing a password. To thwart this, we turn to one-time passwords. If someone sees you type a password (or gets it from the network stream), it won't matter because that password will be useless for future logins.

Challenge Handshake Authentication

One way to avoid the hazards of reusable passwords is never to send data over the network that an attacker can capture and replay to authenticate onto the system in the future. The Challenge Handshake Authentication Protocol, CHAP, does this by generating and sending a "challenge", which is a random bunch of bits (more generally called a nonce). Both the client (e.g., user) and server share a secret key. The client generates a hash on data the includes the challenge and the secret key and sends the hash back to the server. The server can compute the same hash since it also has the challenge and knows the shared secret. If the computed hash matches the one it received, then authentication was successful. Anyone sniffing the network can get a challenge but will never see the shared secret and hence will not be able to generate a valid response.

	challenge: nonce
	response: hash(key, nonce)

SecurID®

RSA's SecurID is a two-factor authentication system that generates one-time passwords for response to a user login prompt. It relies on a user password (a PIN: Personal ID Number) and a small computing device called a token (an authenticator card, fob, or software). The token generates a new number every 30 seconds. The number is a function of the time of day and a seed: a number that is unique for each card. To authenticate to a server, you send a combination of your PIN and the number from the number from the token in lieu of a password. A legitimate remote system will have your PIN as well as the token seed and will be able to compute the same value to validate your password. An intruder would not know both PIN and the token’s seed and will never see the data on the network.

	password: hash(PIN, time, seed)

Public key authentication

A nonce is a random bunch of bits that is generated on the fly and is usually used to present to the other party as to prove that they are capable of encrypting something with a specific key that they possess. The use of a nonce is central to public key authentication (e.g., the kind used in SSL, the secure sockets layer). If I send you a nonce and you encrypt it with your private key and give me the results, I can decrypt that message using your public key. If the decryption matches the original nonce, this will convince me that only you could have encrypted the message since you are the only one who has access to your private key.

	challenge: nonce
	response: Ek(nonce)

Digital certificates

While public keys simplify authentication (just decrypt this with my public key and you know that I was the only one who could have encrypted it), identity binding of the public key must be preserved. That is, you need to be convinced that my public key really is mine and not an impersonator's. X.509 digital certificates provide a solution for this. A certificate is a data structure that contains user information and the user’s public key. This data structure also contains a identification of the certification authority (CA) and its signature. The signature is a hash of the certificate data that is encrypted using the certification authority's private key. The certification authority (CA) is responsible for setting policies to validate identity of the person who presents the public key for encapsulation in a certificate. You can validate a user's certificate if you have the CA's public key. To validate, you generate a hash of the certificate (minus its signature). You then decrypt the signature on the certificate using the CA's public key. If the decrypted signature and your hash match, then you are convinced that no data in the certificate has been altered. The CA's public key is present in another certificate: the CA's certificate. CA certificates for several hundred well-known CAs are pre-installed in places such as iOS's Trust Store, OS X's keychain, and Microsoft's Trusted Root Certification Authorities store.

Security

Most security attacks on systems are not cryptographic: they do not find weaknesses in cryptographic algorithms and try to break ciphers. Most rely on bugs in software or the trust of individuals.

For protocols with no encryption that use a public (or sniffable) network, one can sniff the network traffic to extract logins and passwords (there's a lot of software that makes this easy; snort, for example, is one of the oldest and most popular). If one cannot find a password for a user, one can try guessing it. If one can't do that then one can try all combinations. An exhaustive search through every possible password may be time-prohibitive. A dictionary attack is one where you go through a dictionary of words and names and test them as potential passwords, applying common tricks such as prefixing and suffixing digits and substituting numbers that look like letters). Performing this attack becomes a lot easier if you are lucky enough to get the hash password that you are searching for. Then you can perform the search on your own machine without going through the login service on the target system.

A social engineering attack is one where you convince a person to give you the necessary credentials. You might do this by impersonating as a system administrator and simply asking. A Trojan horse is a program that masquerades as a legitimate program and tricks you into believing that you are interacting with the trusted program. A common example is a program that masquerades as a login program and obtains an unsuspecting user's login and password. Phishing is an example of email that purports to come from a trusted party (such as your bank) and attempts to get the user to click on a link that will take them to what they believe is the party's web site. In reality, it is an intruder's site that is designed to look like the legitimate one. The goal here is also to collect data such as your login and password or perhaps your bank account, social security, or credit card numbers.

A buffer overflow bug is one where software expects to read a small amount of data into a fixed-size buffer but never checks to see whether the incoming data is bigger than the buffer. What ends up happening is that the software continues to append data to the buffer but is now writing into memory beyond that which is allocated to the buffer. If the buffer was declared as a local variable within a function, then its memory resides on the stack. At some point, the overflow data will clobber the return address for that function. A carefully crafted data stream can ensure that the return address gets modified with the address of other code in this same stream, which will get executed as soon as the function attempts to return. This technique is known as stack smashing.

The first approach to guard against stack smashing was to turn off execute permission in the memory management unit for memory pages that are allocated for the stack. In the past, this was not possible since neither AMD nor Intel architectures had MMUs that permitted this. With this in place, an attacker cannot inject executable code.

This defense was not sufficient, however. Since the return address can still be overwritten with a buffer overflow attack, an attacker can find some code within the program or any shared libraries that the program uses that performs a desired task and place that address as the return address on the stack. It is unlikely that there is a segment of code that will do everything the attacker needs. A clever enhancement of this technique led to return oriented programming, or ROP. Here, the attacker finds a sequence of useful bits of code in various libraries. All of the code must be at the tail end of functions with a a return instruction at the end. Once the function returns, it "returns" to an address that is read from the stack and is under control of the attacker. The return address can then take the program to another snippet of code. All together, the attacker adds a sequence of return addresses on the stack to get each block of code to execute in the desired order.

Two additional defenses make this technique more difficult:

  • stack canaries: The compiler places a random value (called a canary) on the stack just before the the region of data that is allocated to local variables, including buffers. A buffer overflow attack will overwrite this value before it overwrites the return value on the stack. Before returning from the function, the compiler adds code to check the value of the canary. If it is corrupt, the function will abort the program instead of return.
  • address space layout randomization (ASLR): The compiler randomly arranges the exact addresses where it places the stack, heap, or libraries. This makes it impossible for the attacker to jump to a known address in a library. However, this requires that all libraries are compiled to generate position independent code.

A rootkit is code that hides the presence of users or additional software from the user. Sometimes, this is done by replacing commands that would present this data (e.g., ps, ls, who, ... on UNIX/Linux systems). In more sophisticated implementations, the rootkit modifies the operating system to intercept system calls.

The four A's of security are:

  1. Authentication: the process of binding an identity to the user. Note the distinction between authentication and identification. Identification is simply the process of asking you to identify yourself (for example, ask for a login name). Authentication is the process of proving that the identification is correct.
  2. Authorization: given an identity, making a decision on what access the user is permitted. Authentication is responsible for access control.
  3. Accounting: logging system activity so that any breaches can be identified (intrusion detection) or a post facto analysis can be performed.
  4. Auditing: inspecting the software and system configuration for security flaws.

Signed Software

The idea of signing software is to ensure that it has not been tampered (for example, a virus is not attached to it). The most basic form of signing software is to do the same as we would for signing any other digital data: generate a hash of the entire package, encrypt it with the software publisher's private key, and attach it to the software as a signature. To validate the signature, you would compute the hash of the program and compare it with the decrypted signature. You would use the using the software publisher’s public key to decrypt the signature. This key would be present in a digital certificate that is obtained from that publisher.

Computing a hash for software before we run it would involve scanning through the entire code base. Demand-paged virtual memory systems load pages of the program as needed, so this would greatly increase the startup time of the software. Instead, signed software will often contain a signature for every memory page. When a particular page is loaded, the operating system would check the signature for that page.

Sandboxes

A sandbox is an environment designed for running programs while restricting their access to certain resources, such as disk space, file access, amount of memory, and network connections. It allows users to set restriction policies so that they can run untrusted code with minimal risk of harming their system. It is a form of mandatory access control in that the process, even though it runs under the protection domain of the user, loses specific privileges and cannot regain them.

The simplest approach to sandboxing is the use of the chroot system call on Unix systems. This changes the root directory of the file system for a process from the real root to one that is specified by the parameter to chroot. For example, the call

	chroot("/var/spool/postfix");

will set the root directory of the process to /var/spool/postfix. No files outside of this directory will be visible. This form of sandbox is known as a chroot jail. Even if an attacker compromises the application, she will not be able to access any files or run any programs outside the jail. One weakness is that if the attacker gets administrative privileges, she can use the mknod system call to create a device file for the disk that holds the root file system and re-mount the file system. This approach also does not restrict network operations.

A rule-based sandbox provides the ability to set finer-grain policies on what an application cannot do. For example, an application may be disallowed from reading the file /etc/passwd; or be disallowed from writing any files; or not be allowed to establish a TCP/IP connection. Sandboxing is currently supported on a wide variety of platforms at either the kernel or application level. One example is the Apple Sandbox on OS X. It allows a detailed set of policies to be enumerated that govern networking and file system reads and writes. These policies can include patterns to allow one to restrict file operations to specific directories or to files matching certain names. These policies are parsed and loaded into the kernel. The kernel runs the TurstedBSD subsystem. This subsystem was originally designed to enforce mandatory access control policies but in this case passes requests to a kernel extension that goes through the list of rules to determine whether a specific instance of a system call should be allowed or not.

The popular example of a sandbox is the Java Virtual Machine, initially designed for running Java applets, which were compiled Java programs would be loaded and run dynamically upon fetching a web page. The Java sandbox has three parts to it:

  1. The byte-code verifier tries to ensure that the code looks like valid Java byte code with no attempts to circumvent access restrictions, convert data illegally, or forge pointers.
  2. The class loader enforces restrictions on whether a program is allowed to load additional applets and that an applet is not allowed to access classes belonging to other applets.
  3. The security manager is invoked to provide run-time verification of whether a program has rights to invoke certain methods, such as file I/O or network access.

Virtualization

As a general concept, virtualization is the addition of a layer of abstraction to physical devices. With virtual memory, a process has the impression that it owns the entire memory address space. Different processes can all access the same virtual memory location and the memory management unit (MMU) on the processor maps each access to the unique physical memory locations that are assigned to the process.

With storage virtualization, a computer gets a logical view of disks connected to a machine. In reality, those disks may be networked to a computer via a fibre channel switch or ethernet interface and may be parts of physical disks or collections of disks that appear to the computer as one disk.

CPU virtualization allows programs to execute on a machine that does not really exist. The instructions are interpreted by a program that simulates the architecture of the pseudo machine. Early pseudo-machines included o-code for BCPL and P-code for Pascal. The most popular pseudo-machine today is the Java virtual machine.

Machine virtualization allows a physical computer to act like several real machines with each machine running its own operating system (on a virtual machine) and applications that interact with that operating system. The key to machine virtualization is to not allow each guest operating system to have direct access to certain privileged instructions in the processor. These instructions would allow an operating system to directly access I/O ports, MMU settings, the task register, the halt instruction and other parts of the processor that could interfere with the processor's behavior and with other operating systems. Instead, such instructions as well as system interrupts are intercepted by the Virtual Machine Monitor (VMM), also known as a hypervisor. The hypervisor arbitrates access to physical resources and presents a set of virtual device interfaces to each guest operating system (including the memory management unit, I/O ports, disks, and network interfaces).

Two configurations of virtual machines are hosted virtual machines and native virtual machines. With a hosted virtual machine, the computer has a primary operating system installed that has access to the raw machine (all devices, memory, and file system). One or more guest operating systems can be run on virtual machines. The VMM serves as a proxy, converting requests from the virtual machine into operations that get executed on the host operating system. A native virtual machine is one where there is no "primary" operating system that owns the system hardware. The hypervisor is in charge of access to the devices and provides each operating system drivers for an abstract view of all the devices.

The latest processors from Intel and AMD support the concept of a virtual machine layer and the ability to intercept privileged instructions. Prior to that, one of two approaches was used to implement virtualization. Binary translation pre-scans the instruction stream of any code that has to run in privileged mode and replaces all privileged instructions with interrupts to the VMM. Paravirtualization requires modifying the operating system to replace privileged instructions with calls to a VMM API. This, of course, requires access to the source code of the operating system.