Containment limits what a compromised process can do after an attack succeeds. Even with proper input validation, vulnerabilities may remain, and if an attacker gains control of a process, traditional access controls become ineffective since the operating system assumes the process acts within its assigned privileges. Containment creates isolation boundaries that confine the impact of a faulty or malicious program, preventing it from affecting the rest of the system.
Containment operates at multiple layers:
-
Application sandboxes restrict individual processes
-
Containers isolate sets of processes
-
Virtual machines emulate entire operating systems
-
Hardware-based isolation enforces security boundaries below the operating system level
Application Sandboxing
A sandbox is a restricted execution environment that mediates interactions between an application and the operating system by limiting resource access, system calls, and visible state. Sandboxing evolved from early filesystem-based confinement to kernel-level and language-level environments that can restrict both native and interpreted code.
Filesystem-Based Containment
chroot
The chroot system call changes a process's view of the root directory to a specified path, so all absolute paths are resolved relative to that new root. Child processes inherit this environment, creating a chroot jail. This mechanism affects only the filesystem namespace and does not restrict privileges or system calls.
A process with root privileges inside a chroot jail can escape by:
-
Manipulating directory structures (creating a subdirectory, chrooting into it, then traversing upward)
-
Using
ptraceto attach to processes outside the jail if accessible -
Creating device nodes to access system memory or disk directly
The chroot mechanism provides no limits on CPU, memory, or I/O usage and requires copying all dependent executables, libraries, and configuration files into the jail. While still used for testing or packaging, chroot is not suitable for reliable containment.
FreeBSD Jails
FreeBSD Jails extended chroot by adding process and network restrictions, but still lacked fine-grained resource management.
System Call-Based Sandboxes
The system call interface defines the actual power of a process, as every interaction with resources goes through a system call. A system call sandbox intercepts calls and applies a policy before allowing execution, with enforcement occurring either in user space or within the kernel.
User-Level Interposition
Early implementations operated entirely in user space, often using the ptrace debugging interface to monitor processes. Janus (UC Berkeley) and Systrace (OpenBSD) are examples of this approach that relied on user-level processes for policy enforcement. Each system call was intercepted and checked against a policy before being allowed or denied. A policy might allow file access under a specific directory but deny network activity.
This approach had significant weaknesses:
-
Race conditions could occur between the check and the actual call (time-of-check-time-of-use vulnerabilities). A program could pass a safe filename during the check, then quickly change it to a sensitive file before execution.
-
Tracking all side effects of system calls was challenging (operations on file descriptors, file descriptor assignment and duplication, relative pathname parsing)
-
Multithreaded programs could bypass monitoring
-
Each system call introduced substantial overhead from context switches to the tracer
User-level interposition demonstrated feasibility but was not robust enough for production use.
Kernel-Integrated Filtering: seccomp-BPF
Linux moved sandbox enforcement into the kernel with Secure Computing Mode (seccomp). Modern systems use seccomp-BPF, which adds programmable filtering through BPF bytecode. The process installs a filter that the kernel executes whenever it attempts a system call, inspecting the system call number and arguments and returning actions such as:
-
SECCOMP_RET_ALLOW: permit the call -
SECCOMP_RET_ERRNO: block with error -
SECCOMP_RET_TRAP: deliver a signal -
SECCOMP_RET_KILL: terminate the process
Once installed, filters cannot be relaxed—only replaced with stricter ones.
Advantages:
-
Enforcement in the kernel eliminates race conditions
-
Fine-grained control over allowed calls and arguments
-
Low runtime overhead compared to user-space approaches
Limitations:
-
Policies are static and written in low-level BPF syntax
-
Does not manage resources or filesystem visibility
Seccomp-BPF is now widely used in browsers, container runtimes, and service managers to reduce kernel attack surfaces.
AppArmor
While seccomp-BPF provides powerful system call filtering, it cannot inspect pathnames passed as arguments to system calls. For example, it can allow or deny the open() system call entirely, but cannot distinguish between opening /etc/passwd versus /tmp/file. This limitation exists because seccomp-BPF operates at the system call interface and can only examine raw arguments like file descriptors and memory addresses, not the filesystem paths they reference.
AppArmor addresses this gap by enforcing Mandatory Access Control (MAC) policies based on pathnames. It operates as a Linux Security Module (LSM) in the kernel and mediates access to files and directories by checking the requested path against a per-program security profile. An AppArmor profile can specify rules like "allow read access to /var/www/**" or "deny write access to /etc/**."
AppArmor complements seccomp-BPF: seccomp-BPF restricts which system calls a process can make, while AppArmor restricts which resources those calls can access. Together, they provide defense in depth—one limiting the interface to the kernel, the other limiting access to specific objects within the filesystem namespace.
Language-Based Sandboxing
Some sandboxes operate entirely in user space by running code inside managed execution environments called process virtual machines. These environments provide language-level isolation by interpreting or compiling bytecode to a restricted instruction set.
Common examples include:
-
Java Virtual Machine (JVM): Verifies bytecode before execution, ensuring operations stay within defined type and memory bounds
-
Microsoft .NET Common Language Runtime (CLR): Provides managed execution for C#, VB.NET, and other languages
-
Python interpreter: Can confine execution by controlling access to modules
-
JavaScript engines: Browser engines restrict access to filesystem and network, allowing only specific APIs
These environments emulate a CPU and manage memory internally. Programs run as bytecode (which may be interpreted or compiled just-in-time) and cannot directly access hardware or invoke system calls. All external interaction goes through controlled APIs.
Strengths:
-
Memory safety and portability across platforms
-
No direct system calls
-
Logical separation between user code and host resources
Limitations:
-
Depend on runtime correctness—a flaw in the interpreter breaks isolation
-
Limited ability to enforce fine-grained resource policies
-
The runtime itself must be sandboxed at the OS level
Language-based sandboxes often coexist with kernel-level sandboxes. For instance, a web browser runs JavaScript inside an interpreter sandbox while using seccomp or Seatbelt to confine the browser process itself.
Sandbox Evolution
Application sandboxing evolved from restricting what a process can see to restricting what it can do:
-
Filesystem-based approaches like chroot provided simple legacy compatibility but no control of system calls or privileges
-
System call-based sandboxes at the kernel level offer fine-grained and efficient control but require complex or static configuration
-
Language-based sandboxes provide memory-safe and portable environments but depend on runtime integrity
OS-Level Isolation Primitives
System call sandboxes confine individual processes, but most applications consist of multiple cooperating processes. To contain such systems, the operating system must isolate groups of processes and the resources they share. Linux provides three kernel mechanisms for this purpose:
-
Namespaces: Define which resources a process can see
-
Control groups (cgroups): Define how much of each resource a process can use
-
Capabilities: Define what privileged actions a process may perform
Together, these mechanisms form the foundation for containers.
Namespaces
A namespace gives a process its own private copy of part of the system's global state. Processes that share a namespace see the same view of that resource, while those in different namespaces see distinct views. Each namespace type isolates one kernel subsystem.
Linux supports several namespace types:
-
PID namespaces: Isolate process IDs so each namespace has its own PID 1; processes cannot see or signal those in other namespaces
-
Mount namespaces: Allow each to mount or unmount filesystems independently
-
UTS namespaces: Isolate hostname and domain name
-
Network namespaces: Provide private network stacks with their own interfaces, routing tables, and sockets
-
IPC namespaces: Isolate System V and POSIX IPC objects like shared memory or semaphores
-
User namespaces: Map internal UIDs to different real UIDs on the host
-
Cgroup namespaces: Control visibility of control-group resources
Each namespace acts like a self-contained copy of a subsystem. Namespaces let multiple isolated environments run on a single kernel, providing the illusion of separate systems without hardware virtualization. However, they hide and partition resources but do not limit consumption.
Control Groups (cgroups)
A control group (cgroup) manages and limits resource usage. While namespaces define what a process can see, cgroups define how much of each resource it can use. A cgroup is a hierarchy of processes with limits on resource usage, where each type of resource is managed by a controller that measures consumption and enforces restrictions.
Common controllers manage:
-
CPU: Scheduling and quotas
-
Memory: Physical and swap memory limits
-
PIDs: Process count limits to prevent fork bombs
A service can belong to several cgroups with different controllers. The kernel tracks usage per group and enforces limits through scheduling and memory reclamation. If a process exceeds its memory quota, the kernel's out-of-memory (OOM) handler terminates it without affecting other groups.
Namespaces and cgroups together isolate processes functionally and economically: each process group sees only its own resources and consumes only what it is permitted.
Capabilities
Traditional Unix privilege management treated the root user (UID 0) as all-powerful, checking only whether the process's effective user ID was zero. This binary model violated the principle of least privilege.
Capability Model
Capabilities break up root's privilege into specific pieces. The kernel no longer assumes that UID 0 can do everything by default; each privileged operation now requires the matching capability. Each capability represents authorization for a specific class of privileged operation, such as configuring network interfaces (CAP_NET_ADMIN) or loading kernel modules (CAP_SYS_MODULE). Under this model, UID 0 alone no longer implies complete control—the kernel checks both the user ID and capability bits before allowing any privileged action.
Common Capabilities
Linux defines over 40 distinct capabilities. Some important examples include:
-
CAP_NET_ADMIN: Modify network configuration
-
CAP_SYS_MODULE: Load and unload kernel modules
-
CAP_SYS_TIME: Change the system clock
-
CAP_NET_BIND_SERVICE: Bind to privileged ports (below 1024)
-
CAP_DAC_OVERRIDE: Bypass file permission checks
For instance, a web server can be granted only CAP_NET_BIND_SERVICE to bind to port 80 while running as a non-root user. Even if compromised, it cannot mount filesystems, modify network routing, or change the system clock.
Applying Capabilities
Capabilities can be attached to executable files or granted to running processes. Once dropped, capabilities cannot be regained unless the process executes another binary that has them defined. Entering a user namespace alters capability behavior—a process can appear to be root inside the namespace, but its capabilities apply only within that namespace, not to the host.
Root Under Capabilities
A process with UID 0 must still have the appropriate capabilities to perform privileged operations; the UID alone is not sufficient. A non-root process given a specific capability can perform only the operation covered by that capability. Processes can permanently relinquish capabilities, allowing them to perform initialization requiring privilege and then continue safely with minimal rights, implementing the principle of least privilege.
Integration
-
Namespaces isolate visibility by giving each process its own view of system resources
-
Control groups enforce limits on resource consumption
-
Capabilities break up root privilege into narrowly scoped rights
Together, these mechanisms implement the principle of least privilege at the operating-system level, restricting what a process can see, what it can consume, and what it can do.
Containerization
Containerization builds on namespaces, control groups, and capabilities to package applications and their dependencies into lightweight, portable units that behave like independent systems. Each container has its own processes, filesystem, network interfaces, and resource limits, yet all containers run as ordinary processes under the same kernel.
Purpose and Design
Containers were introduced primarily to simplify the packaging, deployment, and distribution of software services. They made it possible to bundle an application and its dependencies into a single, portable image that could run the same way in development, testing, and production. The underlying mechanisms were developed for resource management and process control, not for security. As container frameworks matured, these same mechanisms also provided practical isolation, making containers useful for separating services, though not as a strong security boundary.
Container Operation
Traditional virtualization runs multiple operating systems by emulating hardware, with each virtual machine including its own kernel and system libraries. This offers strong isolation but duplicates system components, consuming memory and startup time. Containers achieve similar separation with less overhead by virtualizing the operating system interface—the process and resource view provided by the kernel—rather than hardware.
How the three mechanisms combine in containers:
-
Namespaces give each container its own process IDs, network stack, hostname, and filesystem view
-
Cgroups limit how much CPU time, memory, and disk bandwidth each container can consume
-
Capabilities restrict privileged operations so that even root inside a container is not root on the host
This layered design allows thousands of isolated services to run on one host without the duplication inherent in full virtual machines.
How Containers Work
Containers are a structured way to combine kernel features into a managed runtime. Each container starts as an ordinary process, but the container runtime (such as Docker, containerd, or LXC) configures it with:
-
New namespaces for isolated process IDs, network stack, hostname, and filesystem
-
Cgroups that define resource limits
-
Restricted capabilities so even root inside the container has limited privileges
-
A filesystem built from an image—a prebuilt snapshot containing all files, libraries, and configuration
Container runtimes automate the setup of kernel mechanisms and apply consistent, minimal-privilege defaults. Images are layered and can be stored in registries, making it easy to distribute and deploy applications consistently across different environments. This combination of isolation, resource control, and portability is why containers became central to modern software deployment.
Security Characteristics
Containers improve isolation but do not create a full security boundary. All containers share the same kernel, so a vulnerability in the kernel could allow one container to affect others. Within a container, the root user has administrative control inside that namespace but not on the host. However, kernel bugs or misconfigured capabilities can weaken that boundary.
To strengthen isolation, systems often combine containers with additional mechanisms:
-
seccomp-BPF filters block dangerous system calls
-
Mandatory Access Control (MAC) frameworks like SELinux or AppArmor restrict filesystem and process access
-
Running containers inside virtual machines for an extra hardware barrier
Containers provide meaningful isolation for ordinary services but are not appropriate for untrusted or hostile code without additional containment layers.
Practical Benefits
Beyond isolation, containers provide significant advantages:
-
Portability: Applications run the same way in development, testing, and production because each container includes its dependencies
-
Efficiency: Containers start quickly and use fewer resources than virtual machines
-
Density: Many containers can share a single kernel, allowing high utilization of servers
-
Manageability: Tools automate deployment, scaling, and monitoring
The same kernel features that provide containment also make containers predictable to manage and easy to orchestrate at scale.
Virtualization
Virtualization moves the boundary of isolation to the hardware level. A virtual machine (VM) emulates an entire computer system including CPU, memory, storage, and network interfaces. Each VM runs its own operating system and kernel, independent of the host. From the guest operating system's perspective, it has full control of the hardware, even though that hardware is simulated. This approach provides strong isolation because the guest cannot directly access the host's memory or devices.
Virtualization Mechanics
Virtualization creates the illusion that each operating system has exclusive access to the hardware. A software layer called a hypervisor or Virtual Machine Monitor (VMM) sits between the hardware and the guest operating systems. It intercepts privileged operations, manages memory and device access, and schedules CPU time among the guests.
When a guest operating system issues an instruction that would normally access hardware directly, the hypervisor traps that instruction, performs it safely on the guest's behalf, and returns the result. With modern hardware support, most instructions run directly on the CPU, with the hypervisor only intervening for privileged operations. This allows near-native performance while maintaining separation between guests.
Modern processors include hardware support for virtualization, allowing the CPU to switch quickly between executing guest code and hypervisor code, reducing overhead.
Hypervisor Types
Type 1 (bare-metal) hypervisors run directly on hardware and manage guest operating systems, with the hypervisor effectively serving as the host OS. They are more efficient and used in data centers and clouds.
Type 2 (hosted) hypervisors run as applications under a conventional operating system and use that OS's device drivers. They are easier to install on desktop systems and used for testing, development, or running alternative OSes.
Containers vs. Virtual Machines
A container isolates processes but shares the host kernel. A virtual machine isolates an entire operating system with its own kernel. This key difference means:
-
VMs are more secure (stronger isolation, each has its own kernel) but heavier (duplicate OS components, slower startup)
-
Containers are more efficient (share kernel, fast startup, low overhead) but provide weaker isolation (kernel vulnerabilities affect all containers)
VMs can run different operating systems simultaneously; containers must use the host kernel. In practice, many systems combine both: running containers inside VMs to balance efficiency with strong isolation.
Virtualization Advantages
-
Strong isolation: Each guest runs in its own protected memory space and cannot interfere with others
-
Hardware independence: The hypervisor emulates a uniform hardware interface, allowing guests to run on different physical machines without modification
-
Snapshotting and migration: The state of a VM—its memory, CPU, and disk—can be saved, cloned, or moved to another host
-
Consolidation: Multiple virtual servers can share one machine, increasing hardware utilization and reducing costs
-
Testing and recovery: Virtual machines can be paused, restored, or reset easily, supporting software development and disaster recovery
Security Implications
Virtualization offers strong isolation because the hypervisor mediates all access to hardware. A guest cannot normally read or modify another guest's memory or the hypervisor itself. However, vulnerabilities still exist:
-
VM escape: A compromised guest gains control over the hypervisor or host, usually by exploiting vulnerabilities in how the hypervisor emulates devices. This breaks isolation and gives the attacker access to all other virtual machines on the same host
-
Hypervisor vulnerabilities: Bugs in management interfaces or exposed APIs used for remote administration. Because the hypervisor controls all guest systems, these weaknesses are critical targets
-
Side-channel attacks: Exploit shared hardware resources to infer information by measuring timing or behavior
-
Shared-device risks: Multiple VMs using the same physical devices can allow information leakage or denial-of-service through poorly isolated drivers
Hypervisors are typically small and security-hardened, but their central role makes them high-value targets.
Containment Through Virtualization
From the perspective of containment, virtualization represents a deeper boundary. Process-level and container-level mechanisms rely on kernel enforcement. Virtualization adds a distinct kernel for each guest and isolates them with hardware-level checks. This separation makes virtualization the preferred choice for workloads requiring strong security guarantees, multi-tenant separation, or different operating systems.
In practice, many systems combine layers: containers run inside virtual machines, and those virtual machines run under a hypervisor on shared hardware. This layered approach provides both efficiency and assurance. Virtualization represents the deepest layer of software-based isolation—shifting enforcement from the kernel to the hardware level.
Key Takeaways
Containment operates at multiple layers, each providing different trade-offs between security, performance, and flexibility:
-
Application sandboxes restrict individual processes through filesystem isolation, system call filtering, or language-based environments
-
OS-level primitives (namespaces, cgroups, and capabilities) allow the kernel to isolate groups of processes, limit their resource consumption, and divide system privileges
-
Containers combine these primitives into lightweight, portable units for application deployment
-
Virtual machines provide the strongest isolation by emulating hardware and running separate operating systems
The progression from sandboxing to virtualization represents increasingly deeper isolation boundaries: from controlling what a process can see and do, to isolating groups of processes sharing a kernel, to separating entire operating systems with distinct kernels. Each layer builds on the principle of least privilege and defense in depth, restricting access and limiting the impact of compromise. Modern systems often combine multiple layers—running sandboxed applications in containers inside virtual machines—to balance efficiency with strong security guarantees.