pk.org: Computer Security/Lecture Notes

Containment and Application Isolation

Study Guide

Paul Krzyzanowski – 2025-10-25

Containment limits what a compromised process can do after an attack succeeds. Even with proper input validation, vulnerabilities may remain, and if an attacker gains control of a process, traditional access controls become ineffective since the operating system assumes the process acts within its assigned privileges. Containment creates isolation boundaries that confine the impact of a faulty or malicious program, preventing it from affecting the rest of the system.

Containment operates at multiple layers:

Application Sandboxing

A sandbox is a restricted execution environment that mediates interactions between an application and the operating system by limiting resource access, system calls, and visible state. Sandboxing evolved from early filesystem-based confinement to kernel-level and language-level environments that can restrict both native and interpreted code.

Filesystem-Based Containment

chroot

The chroot system call changes a process's view of the root directory to a specified path, so all absolute paths are resolved relative to that new root. Child processes inherit this environment, creating a chroot jail. This mechanism affects only the filesystem namespace and does not restrict privileges or system calls.

A process with root privileges inside a chroot jail can escape by:

The chroot mechanism provides no limits on CPU, memory, or I/O usage and requires copying all dependent executables, libraries, and configuration files into the jail. While still used for testing or packaging, chroot is not suitable for reliable containment.

FreeBSD Jails

FreeBSD Jails extended chroot by adding process and network restrictions, but still lacked fine-grained resource management.

System Call-Based Sandboxes

The system call interface defines the actual power of a process, as every interaction with resources goes through a system call. A system call sandbox intercepts calls and applies a policy before allowing execution, with enforcement occurring either in user space or within the kernel.

User-Level Interposition

Early implementations operated entirely in user space, often using the ptrace debugging interface to monitor processes. Janus (UC Berkeley) and Systrace (OpenBSD) are examples of this approach that relied on user-level processes for policy enforcement. Each system call was intercepted and checked against a policy before being allowed or denied. A policy might allow file access under a specific directory but deny network activity.

This approach had significant weaknesses:

User-level interposition demonstrated feasibility but was not robust enough for production use.

Kernel-Integrated Filtering: seccomp-BPF

Linux moved sandbox enforcement into the kernel with Secure Computing Mode (seccomp). Modern systems use seccomp-BPF, which adds programmable filtering through BPF bytecode. The process installs a filter that the kernel executes whenever it attempts a system call, inspecting the system call number and arguments and returning actions such as:

Once installed, filters cannot be relaxed—only replaced with stricter ones.

Advantages:

Limitations:

Seccomp-BPF is now widely used in browsers, container runtimes, and service managers to reduce kernel attack surfaces.

AppArmor

While seccomp-BPF provides powerful system call filtering, it cannot inspect pathnames passed as arguments to system calls. For example, it can allow or deny the open() system call entirely, but cannot distinguish between opening /etc/passwd versus /tmp/file. This limitation exists because seccomp-BPF operates at the system call interface and can only examine raw arguments like file descriptors and memory addresses, not the filesystem paths they reference.

AppArmor addresses this gap by enforcing Mandatory Access Control (MAC) policies based on pathnames. It operates as a Linux Security Module (LSM) in the kernel and mediates access to files and directories by checking the requested path against a per-program security profile. An AppArmor profile can specify rules like "allow read access to /var/www/**" or "deny write access to /etc/**."

AppArmor complements seccomp-BPF: seccomp-BPF restricts which system calls a process can make, while AppArmor restricts which resources those calls can access. Together, they provide defense in depth—one limiting the interface to the kernel, the other limiting access to specific objects within the filesystem namespace.

Language-Based Sandboxing

Some sandboxes operate entirely in user space by running code inside managed execution environments called process virtual machines. These environments provide language-level isolation by interpreting or compiling bytecode to a restricted instruction set.

Common examples include:

These environments emulate a CPU and manage memory internally. Programs run as bytecode (which may be interpreted or compiled just-in-time) and cannot directly access hardware or invoke system calls. All external interaction goes through controlled APIs.

Strengths:

Limitations:

Language-based sandboxes often coexist with kernel-level sandboxes. For instance, a web browser runs JavaScript inside an interpreter sandbox while using seccomp or Seatbelt to confine the browser process itself.

Sandbox Evolution

Application sandboxing evolved from restricting what a process can see to restricting what it can do:

OS-Level Isolation Primitives

System call sandboxes confine individual processes, but most applications consist of multiple cooperating processes. To contain such systems, the operating system must isolate groups of processes and the resources they share. Linux provides three kernel mechanisms for this purpose:

Together, these mechanisms form the foundation for containers.

Namespaces

A namespace gives a process its own private copy of part of the system's global state. Processes that share a namespace see the same view of that resource, while those in different namespaces see distinct views. Each namespace type isolates one kernel subsystem.

Linux supports several namespace types:

Each namespace acts like a self-contained copy of a subsystem. Namespaces let multiple isolated environments run on a single kernel, providing the illusion of separate systems without hardware virtualization. However, they hide and partition resources but do not limit consumption.

Control Groups (cgroups)

A control group (cgroup) manages and limits resource usage. While namespaces define what a process can see, cgroups define how much of each resource it can use. A cgroup is a hierarchy of processes with limits on resource usage, where each type of resource is managed by a controller that measures consumption and enforces restrictions.

Common controllers manage:

A service can belong to several cgroups with different controllers. The kernel tracks usage per group and enforces limits through scheduling and memory reclamation. If a process exceeds its memory quota, the kernel's out-of-memory (OOM) handler terminates it without affecting other groups.

Namespaces and cgroups together isolate processes functionally and economically: each process group sees only its own resources and consumes only what it is permitted.

Capabilities

Traditional Unix privilege management treated the root user (UID 0) as all-powerful, checking only whether the process's effective user ID was zero. This binary model violated the principle of least privilege.

Capability Model

Capabilities break up root's privilege into specific pieces. The kernel no longer assumes that UID 0 can do everything by default; each privileged operation now requires the matching capability. Each capability represents authorization for a specific class of privileged operation, such as configuring network interfaces (CAP_NET_ADMIN) or loading kernel modules (CAP_SYS_MODULE). Under this model, UID 0 alone no longer implies complete control—the kernel checks both the user ID and capability bits before allowing any privileged action.

Common Capabilities

Linux defines over 40 distinct capabilities. Some important examples include:

For instance, a web server can be granted only CAP_NET_BIND_SERVICE to bind to port 80 while running as a non-root user. Even if compromised, it cannot mount filesystems, modify network routing, or change the system clock.

Applying Capabilities

Capabilities can be attached to executable files or granted to running processes. Once dropped, capabilities cannot be regained unless the process executes another binary that has them defined. Entering a user namespace alters capability behavior—a process can appear to be root inside the namespace, but its capabilities apply only within that namespace, not to the host.

Root Under Capabilities

A process with UID 0 must still have the appropriate capabilities to perform privileged operations; the UID alone is not sufficient. A non-root process given a specific capability can perform only the operation covered by that capability. Processes can permanently relinquish capabilities, allowing them to perform initialization requiring privilege and then continue safely with minimal rights, implementing the principle of least privilege.

Integration

Together, these mechanisms implement the principle of least privilege at the operating-system level, restricting what a process can see, what it can consume, and what it can do.

Containerization

Containerization builds on namespaces, control groups, and capabilities to package applications and their dependencies into lightweight, portable units that behave like independent systems. Each container has its own processes, filesystem, network interfaces, and resource limits, yet all containers run as ordinary processes under the same kernel.

Purpose and Design

Containers were introduced primarily to simplify the packaging, deployment, and distribution of software services. They made it possible to bundle an application and its dependencies into a single, portable image that could run the same way in development, testing, and production. The underlying mechanisms were developed for resource management and process control, not for security. As container frameworks matured, these same mechanisms also provided practical isolation, making containers useful for separating services, though not as a strong security boundary.

Container Operation

Traditional virtualization runs multiple operating systems by emulating hardware, with each virtual machine including its own kernel and system libraries. This offers strong isolation but duplicates system components, consuming memory and startup time. Containers achieve similar separation with less overhead by virtualizing the operating system interface—the process and resource view provided by the kernel—rather than hardware.

How the three mechanisms combine in containers:

This layered design allows thousands of isolated services to run on one host without the duplication inherent in full virtual machines.

How Containers Work

Containers are a structured way to combine kernel features into a managed runtime. Each container starts as an ordinary process, but the container runtime (such as Docker, containerd, or LXC) configures it with:

  1. New namespaces for isolated process IDs, network stack, hostname, and filesystem

  2. Cgroups that define resource limits

  3. Restricted capabilities so even root inside the container has limited privileges

  4. A filesystem built from an image—a prebuilt snapshot containing all files, libraries, and configuration

Container runtimes automate the setup of kernel mechanisms and apply consistent, minimal-privilege defaults. Images are layered and can be stored in registries, making it easy to distribute and deploy applications consistently across different environments. This combination of isolation, resource control, and portability is why containers became central to modern software deployment.

Security Characteristics

Containers improve isolation but do not create a full security boundary. All containers share the same kernel, so a vulnerability in the kernel could allow one container to affect others. Within a container, the root user has administrative control inside that namespace but not on the host. However, kernel bugs or misconfigured capabilities can weaken that boundary.

To strengthen isolation, systems often combine containers with additional mechanisms:

Containers provide meaningful isolation for ordinary services but are not appropriate for untrusted or hostile code without additional containment layers.

Practical Benefits

Beyond isolation, containers provide significant advantages:

The same kernel features that provide containment also make containers predictable to manage and easy to orchestrate at scale.

Virtualization

Virtualization moves the boundary of isolation to the hardware level. A virtual machine (VM) emulates an entire computer system including CPU, memory, storage, and network interfaces. Each VM runs its own operating system and kernel, independent of the host. From the guest operating system's perspective, it has full control of the hardware, even though that hardware is simulated. This approach provides strong isolation because the guest cannot directly access the host's memory or devices.

Virtualization Mechanics

Virtualization creates the illusion that each operating system has exclusive access to the hardware. A software layer called a hypervisor or Virtual Machine Monitor (VMM) sits between the hardware and the guest operating systems. It intercepts privileged operations, manages memory and device access, and schedules CPU time among the guests.

When a guest operating system issues an instruction that would normally access hardware directly, the hypervisor traps that instruction, performs it safely on the guest's behalf, and returns the result. With modern hardware support, most instructions run directly on the CPU, with the hypervisor only intervening for privileged operations. This allows near-native performance while maintaining separation between guests.

Modern processors include hardware support for virtualization, allowing the CPU to switch quickly between executing guest code and hypervisor code, reducing overhead.

Hypervisor Types

Type 1 (bare-metal) hypervisors run directly on hardware and manage guest operating systems, with the hypervisor effectively serving as the host OS. They are more efficient and used in data centers and clouds.

Type 2 (hosted) hypervisors run as applications under a conventional operating system and use that OS's device drivers. They are easier to install on desktop systems and used for testing, development, or running alternative OSes.

Containers vs. Virtual Machines

A container isolates processes but shares the host kernel. A virtual machine isolates an entire operating system with its own kernel. This key difference means:

VMs can run different operating systems simultaneously; containers must use the host kernel. In practice, many systems combine both: running containers inside VMs to balance efficiency with strong isolation.

Virtualization Advantages

Security Implications

Virtualization offers strong isolation because the hypervisor mediates all access to hardware. A guest cannot normally read or modify another guest's memory or the hypervisor itself. However, vulnerabilities still exist:

Hypervisors are typically small and security-hardened, but their central role makes them high-value targets.

Containment Through Virtualization

From the perspective of containment, virtualization represents a deeper boundary. Process-level and container-level mechanisms rely on kernel enforcement. Virtualization adds a distinct kernel for each guest and isolates them with hardware-level checks. This separation makes virtualization the preferred choice for workloads requiring strong security guarantees, multi-tenant separation, or different operating systems.

In practice, many systems combine layers: containers run inside virtual machines, and those virtual machines run under a hypervisor on shared hardware. This layered approach provides both efficiency and assurance. Virtualization represents the deepest layer of software-based isolation—shifting enforcement from the kernel to the hardware level.

Key Takeaways

Containment operates at multiple layers, each providing different trade-offs between security, performance, and flexibility:

The progression from sandboxing to virtualization represents increasingly deeper isolation boundaries: from controlling what a process can see and do, to isolating groups of processes sharing a kernel, to separating entire operating systems with distinct kernels. Each layer builds on the principle of least privilege and defense in depth, restricting access and limiting the impact of compromise. Modern systems often combine multiple layers—running sandboxed applications in containers inside virtual machines—to balance efficiency with strong security guarantees.

Next: Terms you should know