Confinement

Isolating programs

Paul Krzyzanowski

March 12, 2025

Two lessons we learned from experience are that applications can be compromised and may not always be trustworthy.

The first risk is compromise. Server applications, in particular, such as web servers, databases, and mail servers have been compromised time and again. This is particularly harmful as they often run with elevated privileges and on systems on which normal users do not have accounts. This provides a way for an attacker to get access to a system.

The second risk is trust. We may not always trust an application. We cannot necessarily trust that the game we downloaded from some unknown developer will not try to upload our files, destroy our data, or try to change our system configuration. Unless we have the ability to scrutinize the codebase of a service, we will not know for sure if it tries to modify any system settings or writes files to unexpected places.

With this realization that we might not be immune to attacks, security in modern computing depends on confinement – creating isolation mechanisms that can protect sensitive processes and data from unauthorized access or modification. Traditionally, two widely used approaches for isolation have been containerization and full virtualization. While both provide security benefits, they differ in their design and security guarantees.

Access control isn’t enough

Our initial approach to achieving confinement may involve properly using access controls. For example, we can run server applications as low-privilege users and ensure we have set proper read/write/execute permissions on files, read/write/search permissions on directories, or even set up role-based policies.

However, access controls usually do not allow us to set permissions for “don’t allow access to anything else.” For example, we may want our web server to have access to all files in /home/httpd but nothing outside of that directory. Access controls do not let us express that rule. Instead, we are responsible for changing the protections of every file on the system and making sure it cannot be accessed by “other”. We also have to hope that no users change those permissions. In essence, we must disallow the ability for anyone to make files publicly accessible because we never want our web server to access them. We may be able to use mandatory access control mechanisms if they are available but, depending on the system, we may not be able to restrict access properly either. More likely, we will be at risk of comprehension errors and be likely to make a configuration error, leaving parts of the system vulnerable. To summarize, even if we can get access controls to help, we will not have high assurance that they do.

Access controls also only focus on protecting access to files and devices. A system has other resources, such as CPU time, memory, disk space, and network. We may want to control how much of these an application is allowed to use. POSIX systems provide a setrlimit system call that allows one to set limits on certain resources for the current process and its children. These controls include the ability to set file size limits, CPU time limits, various memory size limits, and maximum number of open files.

We also may want to control the network identity for an application. All applications share the same IP address on a system, but this may allow a compromised application to exploit address-based access controls. For example, you may be able to connect to or even log into a system that believes you are a trusted computer. An exploited application may end up confusing network intrusion detection systems.

Just limiting access through resource limits and file permissions is also insufficient for services that run as root. If an attacker can compromise an app and get root access to execute arbitrary functions, she can change resource limits (just call setrlimit with different values), change any file permissions, and even change the IP address and domain name of the system.

In order to truly confine an application, we would like to create a set of mechanisms that enforce access controls to all of a system’s resources, are easy to use so that we have high assurance in knowing that the proper restrictions are in place, and work with a large class of applications. We can’t quite get all of this yet, but we can come close.

Early efforts at containment: chroot and BSD Jails

chroot

The oldest app confinement mechanism is Unix’s chroot system call and command, originally introduced in 1979 in the seventh edition¹. The chroot system call changes the root directory of the calling process to the directory specified as a parameter.

chroot("/home/httpd/html");

Sets the root of the file system to /home/httpd/html for the process and any processes it creates. The process cannot see any files outside that subset of the directory tree. This isolation is often called a chroot jail.

Jailkits

If you run chroot, you will likely get an error along the lines of:

# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

This is because /bin/bash is not within the root (in this case, the newroot directory). You’ll then create a bin subdirectory and try running chroot again and get the same error:

# mkdir newroot/bin
# ln /bin/bash newroot/bin/bash
# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

You’ll find that is also insufficient and that you’ll need to bring in the shared libraries that /bin/bash needs by mounting /lib, /lib64, and /usr/lib within that root just to enable the shell to run. Otherwise, it cannot load the libraries it needs since it cannot see above its root (i.e., outside its jail). To simplify this process, a jailkit simplifies the process of setting up a chroot jail by providing a set of utilities to make it easier to create the desired environment within the jail and populate it with basic accounts, commands, and directories.

Problems with chroot

Chroot only limits access to the file system namespace. It does not restrict access to resources and does not protect the machine’s network identity. Applications that are compromised to give the attacker root access make the entire system vulnerable since the attacker has access to all system calls.

Chroot is available only to administrators. If this was not the case then any user would be able to get root access within the chroot jail. You would:

Create a chroot jail
Populate it with the shell program and necessary support libraries
Create a link inside your jail to the /bin/su command (set user, which allows you to authenticate to become any user)
Create password files within the jail with a known password for root. On Linux systems, you would typically create an etc/passwd file that contains information about the user account (name, user ID, home directory, startup shell) and an etc/shadow file that contains the actual passwords.
Use the chroot command to enter the jail.
Run su root to become the root user. The command will prompt you for a password and validate it against the password file. Since all processes run within the jail, the password file is the one you set up in your jail, so you knwo the password.

You’re still in the jail but you have root access. Now you will need to escape from the jail.

Escaping from chroot

If someone manages to compromise an application running inside a chroot jail and become root, they are still in the jail but have access to all system calls, including privileged ones. For example, they can send signals to kill all other processes or shut down the system. This would be an attack on availability.

Attaining root access also provides a few ways of escaping the jail. On POSIX systems, all non-networked devices are accessible as files within the filesystem. Even memory is accessible via a file (/dev/mem). An intruder in a jail can create a memory device (on Linux, it is a character device with major number = 1, minor number = 1):

mknod mem c 1 1

With the memory device, the attacker can patch system memory to change the root directory of the jail. More simply, an attacker can create a block device with the same device numbers as that of the main file system. For example, the root file system on one of my Linux systems is /dev/sda1 with a major number of 8 and a minor number of 1. An attacker can recreate that in the jail:

mknod rootdisk b 8 1

and then mount it as a file system within the jail:

mount -t ext4 rootdisk myroot

Now the attacker, still in the jail, has full access to the entire file system, which is as good as being out of the jail. He can add user accounts, change passwords, delete log files, run any commands, and even reboot the system to get a clean login.

FreeBSD Jails

Chroot was good in confining the namespace of an application but useless against providing security if an application had root access and did nothing to restrict access to other resources.

FreeBSD Jails are an enhancement to the idea of chroot. Jails provide a restricted filesystem namespace, just like chroot does, but also place restrictions on what processes are allowed to do within the jail, including selectively removing privileges from the root user in the jail. For example, processes within a jail may be configured to:

Bind only to sockets with a specified IP address and specific ports
Communicate only with other processes within the jail and none outside
Not be able to load kernel modules, even if root
Have restricted access to system calls that include:
- Ability to create raw network sockets
- Ability to create devices
- Modify the network configuration
- Mount or unmount filesystems

FreeBSD Jails are a huge improvement over chroot since known escapes, such as creating devices and mounting filesystems and even rebooting the system are disallowed. Depending on the application, policies may be coarse. The changed root provides all-or-nothing access to a part of the file system. This does not make Jails suitable for applications such as a web browser, which may be untrusted but may need access to files outside of the jail. Think about web-based applications such as email, where a user may want to upload or download attachments. Jails also do not prevent malicious apps from accessing the network and trying to attack other machines … or from trying to crash the host operating system. Moreover, FreeBSD Jails is a BSD-only solution. With an estimated 0.95…1.7% share of server deployments, it is a great solution on an operating system that is not that widely used.

Linux containment mechanisms: namespaces, control groups, and capabilities

Linux’s answer to FreeBSD Jails was a combination of three elements: control groups, namespaces, and capabilities.

Control groups (cgroups)

Linux control groups, also called cgroups, allow you to allocate resources such as CPU time, system memory, disk bandwidth, network bandwidth, and the ability to monitor resource usage among user-defined groups of processes. This allows, for example, an administrator to allocate a larger share of the processor to a critical server application.

An administrator creates one or more cgroups and assigns resource limits to each of them. Then any application can be assigned to a control group and will not be able to use more than the resource limits configured in that control group. Applications are unaware of these limits. Control groups are organized in a hierarchy similar to processes. Child cgroups inherit some attributes from the parents.

Linux namespaces

While chroot offers restricts the filesystem namespace, processes running under chroot, can still see every other process running in the system, share the same network address, the same set of user IDs, and share all the file systems mounted by the operating systems. These processes can still see and communicate with any other processes in the system.

Chroot only restricted the filesystem namespace. The filesystem namespace is the best-known namespace in the system but not the only one. Linux namespaces Namespaces provide control over how processes are isolated in the following namespaces:

Namespace	Description	Controls
IPC	System V IPC, POSIX message queues	Objects created in an IPC namespace are only visible to other processes in that namespace (CLONE_NEWIPC)
Network	Network devices, stacks, ports	Isolates IP protocol stacks, IP routing tables, firewalls, socket port numbers ( CLONE_NEWNET)
Mount	Mount points	A set of processes can have their own distinct mount points and view of the file system (CLONE_NEWNS)
PID	Process IDs	Processes in different PID namespaces can have their process IDs – the child cannot see parent processes or other namespaces (CLONE_NEWPID)
User	User & group IDs	Per-namespace user/group IDs. Also, you can be root in a namespace but have restricted privileges ( CLONE_NEWUSER )
UTS	host name and domain name	setting hostname and domainname will not affect rest of the system (CLONE_NEWUTS)
Cgroup	control group	Sets a new control group for a process (CLONE_NEWCGROUP)

A process can dissociate any or all of these namespaces from its parent via the unshare system call. For example, by unsharing the PID namespace, a process gets a no longer sees other processes and will only see itself and any child processes it creates.

The Linux clone system call is similar to fork in that it creates a new process. However, it allows you to pass flags that will specify which parts of the execution context will be shared with the parent. For example, a cloned process may choose to share memory and open file descriptors, which will make it behave like threads. It can also choose to share – or not – any of the elements of the namespace.

Linux namespaces allow for the creation of isolated environments for processes, including separate namespaces for filesystem, network, process IDs, user IDs, and more. This enables finer-grained control over process isolation, allowing different processes to have their own distinct views of the system.

By creating a separate mount namespace, each isolated process can also have a distinct view of the file system’s mount point structure. This enables different roots for isolated processes with specific mount points tailored to their needs.

Capabilities

A problem that FreeBSD Jails tackled was that of restricting the power of root inside a Jail. You could be a root user but still be disallowed from executing certain system calls. POSIX (Linux) capabilities² tackle this issue as well.

Traditionally, Unix systems distinguished privileged versus unprivileged processes. Privileged processes were those that ran with a user ID of 0, called the root user. When running as root, the operating system would allow access to all system calls and all access permission checks were bypassed. You could do anything.

Linux capabilities identify groups of operations, called capabilities, that can be controlled independently on a per-thread basis. The list is somewhat long, 38 groups of controls, and includes capabilities such as:

CAP_CHOWN: make arbitrary changes to file UIDs and GIDs
CAP_DAC_OVERRIDE: bypass read/write/execute checks
CAP_KILL: bypass permission checks for sending signals
CAP_NET_ADMIN: network management operations
CAP_NET_RAW: allow RAW sockets
CAP_SETUID: arbitrary manipulation of process UIDs
CAP_SYS_CHROOT: enable chroot

The kernel keeps track of four capability sets for each thread. A capability set is a list of zero or more capabilities. The sets are:

Permitted: If a capability is not in this set, the thread or its children can never require that capability. This limits the power of what a process and its children can do.
Inheritable: These capabilities will be inherited when a thread calls execve to execute a program (POSIX programs are executed with the same thread; we are not creating a new process)
Effective: This is the current set of capabilities that the thread is using. The kernel uses these to perform permission checks.
Ambient: This is similar to Inheritable and contains a set of capabilities that are preserved across an execve of a program that is not privileged. If a setuid or setgid program is run, will clear the ambient set. These are created to allow a partial use of root features in a controlled manner. It is useful for user-level device drivers or software that needs a specific privilege (e.g., for certain networking operations).

A child process created via fork (the standard way of creating processes) will inherit copies of its parent’s capability sets following the rules of which capabilities have been marked as inheritable.

A set of capabilities can be assigned to an executable file by the administrator. They are stored as a file’s extended attributes (along with access control lists, checksums, and arbitrary user-defined name-value pairs). When the program runs, the executing process may further restrict the set of capabilities under which it operates if it chooses to do so. For example, after performing an operation that required the capability and knowing that it will no longer need to do so.

The key concept of capabilities is that they allow us to provide a restricted set of privileged access to a process. A root user can assign a set of capabilities to a program file or a running process. The process does not need to run as root (user ID 0) and can be granted specific privileges that would normally not be available to a non-root user. Capabilities provide similar restrictions to a process running as root. Capabilities remove the implicit association between the user ID and privileged operations.

For example, we can grant the ping command the ability to access raw sockets so it can send an ICMP ping message on the network but not have any other administrative powers. The application does not need to run as root and even if an attacker manages to inject code, the opportunities for attack will be restricted.

The Linux combination of cgroups, namespaces, and capabilities provides a powerful set of mechanisms to

Set limits on the system resources (processor, disk, network) that a group of processes will use.
Constrain the namespace, making parts of the filesystem or the existence of other processes or users invisible.
Give restricted privileges to specific applications so they do not need to run as root.

This enables us to create stronger jails and have a fine degree of control as to what processes are or are not allowed to do in that jail.

While bugs have been found in these mechanisms, the more serious problem is that of comprehension. The system has become far, far more complex than it was in the days of chroot. A user has to learn quite a lot to use these mechanisms properly. Failure to understand their behavior fully can create vulnerabilities. For example, namespaces do not prohibit a process from making privileged system calls. They simply limit what a process can see. A process may not be able to send a kill signal to another process only because it does not share the same process ID namespace.

Together with capabilities, namespaces allow a restricted environment that also places limits on the abilities to perform operations even if a process is granted root privileges. This enables ordinary users to create namespaces. You can create a namespace and even create a process running as a root user (UID 0) within that namespace but it will have no capabilities beyond those that were granted to the user; the user ID of 0 gets mapped by the kernel to a non-privileged user.

Containers

Software rarely lives as an isolated application. Some software requires multiple applications and most software relies on the installation of other libraries, utilities, and packages. Keeping track of these dependencies can be difficult. Worse yet, updating one shared component can sometimes cause another application to break. What was needed was a way to isolate the installation, execution, and management of multiple software packages that run on the same system.

Various attempts were undertaken to address these problems.

The most basic was to fix problems when they occurred. This required carefully following instructions for installing, updating, and configuring software and extensive testing of all services on the system when anything changed. Should something break, the service would be unavailable until the problems were fixed.
A drastic, but thorough, approach to isolation was to simply run each service on its own computer. That avoids conflicts in library versions and other dependencies. However, it is an expensive solution, is cumbersome, and is often overkill in most environments.
Finally, administrators could deploy virtual machines. This is a technology that allows one to run multiple operating systems on one computer and gives the illusion of services running on distinct systems. However, this is a heavyweight solution. Every service needs its own installation of the operating system and all supporting software for the service as well as standard services (networking, device management, shell, etc.). It is not efficient in terms of CPU, disk, or memory resources – or even administration effort.
A drastic, but thorough, approach to isolation was to simply run each service on its own computer. That avoids conflicts in library versions and other dependencies. However, it is an expensive solution, is cumbersome, and is often overkill in most environments.
Finally, administrators could deploy virtual machines. This is a technology that allows one to run multiple operating systems on one computer and gives the illusion of services running on distinct systems. However, this is a heavyweight solution. Every service needs its own installation of the operating system and all supporting software for the service as well as standard services (networking, device management, shell, etc.). It is not efficient in terms of CPU, disk, or memory resources – or even administration effort.

Containers are a mechanism that was originally created not for security but to make it easy to package, distribute, relocate, and deploy collections of software. The focus of containers is not to enable end users to install and run their favorite apps but rather for administrators to be able to deploy a variety of services on a system. A container encapsulates all the necessary software for a service, all of its dependencies, and its configuration into one package that can be easily passed around, installed, and removed.

In many ways, a container feels like a virtual machine. Containers provide a service with a private process namespace, its own network interface, and its own set of libraries to avoid problems with incompatible versions used by other software. Containers also allow an administrator to give the service restricted powers even if it runs with root (administrator) privileges. Unlike a virtual machine, however, multiple containers on one system all share the same operating system and kernel modules.

Containers are not a new mechanism. They are implemented using Linux’s control groups, namespaces, and capabilities to provide resource access, isolation, and privilege control, respectively. They also make use of a copy on write (CoW) file system. This makes it easy to create new containers where the file system can track the changes made by that container over a clean base version of a file system. Containers can also take advantage of AppArmor, which is a Linux kernel module that provides a basic form of mandatory access controls based on the pathnames of files. It allows an administrator to restrict the ability of a program to access specific files even within its file system namespace.

The best-known and most widely-used container framework is Docker. A Docker Image is a file format that creates a package of applications, their supporting libraries, and other needed files. This image can be stored and deployed on many environments. Docker made it easy to deploy containers using git-like commands (docker push, docker commit) and also to perform incremental updates. By using a copy on write file system, Docker images can be kept immutable (read-only) while any changes to the container during its execution are stored separately.

As people found Docker useful, the next design goal was to make it easier to manage containers across a network of many computers. This is called container orchestration. There are many solutions for this, including Apache Mesos, Kubernetes, Nomad, and Docker Swarm. The best known of these is kubernetes, which was designed by Google. It coordinates storage of containers, failure of hardware and containers, and dynamic scaling: deploying the container on more machines to handle increased load. Kubernetes is coordination software, not a container system; it uses the Docker framework to run the actual container.

Even though containers were designed to simplify software deployment rather than provide security to services, they do offer several benefits in the area of security:

They make use of namespaces, cgroups, and capabilities with restricted capabilities configured by default. This provides isolation among containers.
Containers provide a strong separation of policy (defined by the container configuration) from the enforcement mechanism (handled by the operating system).
They improve availability by providing the ability to have a watchdog timer monitor the running of applications and restarting them if necessary. With orchestration systems such as Kubernetes, containers can be re-deployed on another system if a computer fails.
The environment created by a container is reproducible. The same container can be deployed on multiple systems and tested in different environments. This provides consistency and aids in testing and ensuring that the production deployment matches the one used for development and test. Moreover, it is easy to inspect exactly how a container is configured. This avoids problems encountered by manual installation of components where an administrator may forget to configure something or may install different versions of a required library.
While containers add nothing new to security, they help avoid comprehension errors. Even default configurations will provide improved security over the defaults in the operating system, and configuring containers is easier than learning and defining the rules for capabilities, control groups, and namespaces. Administrators are more likely to get this right or import containers that are already configured with reasonable restrictions.

Containers are not a security panacea. Because all containers run under the same operating system, any kernel exploits can affect the security of all containers. Similarly, any denial of service attacks, whether affecting the network or monopolizing the processor, will impact all containers on the system. If implemented and configured properly, capabilities, namespaces, and control groups should ensure that privilege escalation cannot take place. However, bugs in the implementation or configuration may create a vulnerability. Finally, one has to be concerned with the integrity of the container itself. Who configured it, who validated the software inside of it, and is there a chance that it may have been modified by an adversary either at the server or in transit?

Virtual Machines

As a general concept, virtualization is the addition of a layer of abstraction to physical devices. With virtual memory, for example, a process has the impression that it owns the entire memory address space. Different processes can all access the same virtual memory location and the memory management unit (MMU) on the processor maps each access to the unique physical memory locations that are assigned to the process.

Process virtual machines present a virtual CPU that allows programs to execute on a processor that does not physically exist. The instructions are interpreted by a program that simulates the architecture of the pseudo machine. Early pseudo-machines included o-code for BCPL and P-code for Pascal. The most popular pseudo-machine today is the Java Virtual Machine (JVM). This simulated hardware does not even pretend to access the underlying system at a hardware level. Process virtual machines will often allow “special” calls to invoke system functions or provide a simulation of some generic hardware platform.

Operating system virtualization is provided by containers, where a group of processes is presented with the illusion of running on a separate operating system but in reality shares the operating system with other groups of processes – they are just not visible to the processes in the container.

System virtual machines allow a physical computer to act like several real machines with each machine running its own operating system (on a virtual machine) and applications that interact with that operating system. The key to this machine virtualization is to not allow each operating system to have direct access to certain privileged instructions in the processor. These instructions would allow an operating system to directly access I/O ports, MMU settings, the task register, the halt instruction and other parts of the processor that could interfere with the processor’s behavior and with the other operating systems on the system. Instead, a trap and emulate approach is used. Privileged instructions, as well as system interrupts, are caught by the Virtual Machine Monitor (VMM), also known as a hypervisor. The hypervisor arbitrates access to physical resources and presents a set of virtual device interfaces to each guest operating system (including the memory management unit, I/O ports, disks, and network interfaces). The hypervisor also handles preemption. Just as an operating system may suspend a process to allow another process to run, the hypervisor will suspend an operating system to give other operating systems a chance to run.

The two configurations of virtual machines are hosted virtual machines and native virtual machines. With a hosted virtual machine (also called a type 2 hypervisor), the computer has a primary operating system installed that has access to the raw machine (all devices, memory, and file system). This host operating system does not run in a virtual environment. One or more guest operating systems can then be run on virtual machines. The VMM serves as a proxy, converting requests from the virtual machine into operations that get sent to and executed on the host operating system. A native virtual machine (also called a type 1 hypervisor) is one where there is no “primary” operating system that owns the system hardware. The hypervisor is in charge of access to the devices and provides each operating system drivers for an abstract view of all the devices.

Security implications

Unlike app confinement mechanisms such as jails, containers, or sandboxes, virtual machines enable isolation all the way through the operating system. A compromised application, even with escalated privileges, can wreak havoc only within the virtual machine. Even compromises to the operating system kernel are limited to that virtual machine. However, a compromised virtual machine is not much different form having a compromised physical machine sitting inside your organization: not desirable and capable of attacking other systems in your environment.

Multiple virtual machines are usually deployed on one physical system. In cases such as cloud services (e.g., such as those provided by Amazon Web Services, Microsoft Azure, or the Google Cloud), a single physical system may host virtual machines from different organizations or running applications with different security requirements. If a malicious application on a highly secure system can detect that it is co-resident on a computer that is hosting another operating system and that operating system provides fewer restrictions, the malware may be able to create a covert channel to communicate between the highly secure system with classified data and the more open system. A covert channel is a general term to describe the the ability for processes to communicate via some hidden mechanism when they are forbidden by policy to do so. In this case, the channel can be created via a side channel attack. A side channel is the ability to get or transmit information using some aspects of a system’s behavior, such as changes in power consumption, radio emissions, acoustics, or performance. For example, processes on both systems, even though they are not allowed to send network messages, may create a means of communicating by altering and monitoring system load. The malware on the classified VM can create CPU-intensive task at specific times. Listener software on the unclassified VM can do CPU-intensive tasks at a constant rate and periodically measure their completion times. These completion times may vary based on whether the classified system is doing CPU-intensive work. The variation in completion times creates a means of sending 1s and 0s and hence transmitting a message.

Microsoft Virtualization-Based Security (VBS)

Virtualization-Based Security (VBS) is a Windows security feature that uses hardware virtualization to create secure, isolated memory regions. This isolated environment is used to protect sensitive security features, such as credential protection and code integrity enforcement.

The main goal of VBS is to protect critical security functions from being compromised by malware, privilege escalation attacks, or other unauthorized system modifications. Unlike traditional security measures that rely solely on software-based access controls, VBS uses Microsoft’s Hyper-V hypervisor to strictly enforce isolation at the hardware level without deploying separate instances of the operating system.

VBS enclaves extend this concept to provide a more secure execution environment for security-focused operations. Within a VBS enclave, data and processes remain completely inaccessible to the normal Windows operating system, even if the OS itself is compromised. This level of protection ensures that credentials, encryption keys, and authentication mechanisms remain secure from even the most advanced threats.

VBS Enclave Design

At the core of VBS enclaves is the Hyper-V hypervisor, a lightweight virtualization layer that operates below the operating system. The hypervisor is responsible for creating and managing secure memory regions that cannot be accessed by the main OS or any user-mode processes. This provides a fundamental security boundary that protects enclave data from unauthorized access.

To further strengthen this protection, Windows also employs a specialized execution environment known as the secure kernel. Unlike the traditional Windows kernel, the secure kernel operates in a highly restricted mode where only trusted, verified code is allowed to execute. The secure kernel works in conjunction with the hypervisor to enforce strict access controls, ensuring that even if an attacker gains administrative privileges over the Windows OS, they cannot manipulate or extract data from VBS enclaves. When encryption and decryption processes take place inside a VBS enclave, cryptographic keys remain inaccessible to any unauthorized process, even if the operating system is compromised. This makes VBS enclaves useful for securing sensitive communications, digital signatures, and encrypted data storage.

Here are a few examples of how VBS protections go beyond Linux mechanims such as namespaces, cgroups, and capabilities:

Protecting credentials:: Linux capabilities and namespaces can remove certain root privileges and filesystem access to limit damage. However, if an attacker can get root access or read memory via proc/<pid>/mem. VBS makes it impossible to access another process' memory by isolating authentication secrets in a protected memory region.
Enforcing code integrity:: Capabilities like CAP_SYS_MODULE prevent unprivileged users from loading kernel modules, but a root user or a compromised kernel can still load unsigned or malicious kernel modules. With VBS, the hypervisor enforces that only signed and verified kernel drivers can be loaded and run.
Protecting kernel memory (read-only):: While Linux supports configuring hardened kernels, in general, a root user has several ways to modify kernel code and kernel data structures. With VBS, kernel memory is configured to be read-only.
Secure execution environments:: On certain architectures (like Intel SGX), Linux allows enclave-based secure execution, but Linux namespaces do not provide these capabilities directly, VBS enclaves create hardware-protected execution environments within normal applications. Even the OS kernel cannot access enclage-protected memory.

VBS Limitations

Despite their strong security guarantees, VBS enclaves are not without limitations. They require hardware support for virtualization, such as Intel VT-x or AMD-V. VBS also introduces some performance overhead due to the additional processing required to maintain secure execution environments, which can impact system responsiveness, particularly in latency-sensitive applications. Compatibility is also a challenge, as some legacy software may not function correctly when VBS is enabled.

Containerization vs. Virtualization vs. Virtualization-Based Security

Containerization, used in technologies like Docker and Kubernetes, provides process-level isolation by creating separate user-space environments that share the same underlying operating system kernel. Containers offer a lightweight and efficient method for running applications securely since they eliminate the need to duplicate the full operating system stack for each instance. However, because all containers share the same kernel, they remain vulnerable to kernel-level attacks. If an attacker gains access to the kernel, they could potentially compromise all running containers.

Full virtualization, provides a much stronger form of isolation by running each virtual machine with its own dedicated OS and kernel. Hypervisors such as VMware, VirtualBox, and Microsoft Hyper-V allow multiple virtual machines (VMs) to operate on the same hardware, each completely isolated from the others. This approach greatly enhances security by ensuring that a compromise in one virtual machine does not affect others. However, full virtualization comes with significant performance overhead. Each VM requires its own OS, consuming additional memory and processing power, making it less efficient compared to containerization.

Microsoft’s Virtualization-Based Security (VBS) enclaves offer a hybrid approach that balances the efficiency of containers with the strong isolation properties of full virtualization. Unlike traditional virtual machines, VBS enclaves do not require an entire OS instance for each process, but unlike containers, they provide hardware-enforced memory isolation through virtualization. This means that even if an attacker gains control of the operating system kernel, they cannot access the secure enclave’s data or execution environment. VBS enclaves are particularly useful in protecting sensitive security functions such as cryptographic key management and authentication processes. By leveraging hardware-based protections, VBS enclaves ensure that critical security operations remain isolated from both user-space and kernel-space threats, providing a powerful mechanism for modern cybersecurity defenses.

Note that Wikipedia and many other sites refer to this as “Version 7 Unix”. Unix has been under continuous evolution at Bell Labs from 1969 through approximately 1989. As such, it did not have versions. Instead, an updated set of manuals was published periodically. Installations of Unix have been referred to by the editions of their manuals. ↩︎
Linux capabilities are not to be confused with the concept of capability lists, which are a form of access control that Linux does not use). ↩︎