Isolation, Kernel Organization, and System Call APIs Note on System Call APIs

Introduction and Motivation: The Necessity of Isolation

  • Scenario Without Isolation: To understand isolation, one must consider a system lacking it. In such a system, several critical failures occur:

    • Single Buggy Process: A bug in a process can corrupt kernel memory, leading to a full system crash.

    • Malicious Process: Without boundaries, a malicious process can read secrets directly from other processes.

    • Greedy Process: A process can monopolize the CPU, causing starvation for all other tasks.

    • Crashing Driver: A faulty driver can take down the entire Operating System (OS).

  • Historical Context: Early Batch Systems (1950s–1960s):

    • These systems ran one job at a time.

    • A single bug required a physical reboot of the machine.

    • There were no concurrent users and no security model was required.

  • Drivers of Change: Several technological shifts made isolation mandatory:

    • Timesharing: Enabling multiple users on a single machine.

    • Networked Systems: The introduction of untrusted inputs from external networks.

    • Multi-process Workloads: The need to run many applications simultaneously.

    • Commercial Software: The prevalence of software with unknown quality and potential vulnerabilities.

Defining Isolation and Protection

  • Isolation (Property): The property that a fault, bug, or attack in one component cannot propagate beyond its defined boundary. Isolation is effectively about containment; the damage stays "inside the box."

  • Protection (Mechanism): The mechanism that enforces isolation. This is the hardware or software that prevents violations of the defined boundaries. Protection is the actual enforcement of the goal that is isolation.

  • Comparison of Isolation vs. Protection:

    • Nature: Isolation is a conceptual boundary; Protection is an enforcement mechanism.

    • Definition Source: Isolation is defined by OS/architecture design; Protection is defined by hardware and the kernel.

    • Examples: Isolation includes address spaces and privilege levels; Protection includes page table bits, CPLCPL, and syscall gates.

    • Failure Mode: Isolation failure leads to spreading contamination; Protection failure is a hardware fault or kernel bug.

    • Cost: Isolation has conceptual overhead; Protection has runtime overhead.

The Four-Way Tradeoff of Isolation

Every isolation decision involves a tradeoff between four properties:

  1. Performance: More isolation equals higher overhead (e.g., switching between syscall and function calls). Less isolation is faster due to fewer switches.

  2. Safety: More isolation ensures faults are contained. Less isolation allows bugs to propagate freely.

  3. Usability: More isolation can feel restrictive (e.g., browser sandboxing). Less isolation is flexible but risky (e.g., sudosudo privilege escalation).

  4. Complexity: More isolation requires more kernel code (e.g., microkernels). Less isolation involves a simpler system (e.g., monolithic kernels).

The Isolation Taxonomy: What and Why We Isolate

  • Why Isolate Processes?

    • Fault Containment: Crashes should not propagate.

    • Security: Processes must not read memory belonging to others.

    • Resource Fairness: Preventing a single process from starving others of CPU or memory.

    • Accountability: The OS must identify which entity is performing specific actions.

  • Taxonomy of Layers:

    • App vs. App: Process A cannot read Process B's memory.

    • App vs. Kernel: User code cannot execute privileged instructions.

    • User vs. Admin: Normal users cannot modify system files.

    • Device vs. Device: Direct Memory Access (DMADMA) from a Network Interface Card (NICNIC) cannot access Graphic Processing Unit (GPUGPU) memory.

  • Resource Isolation Mechanisms:

    • CPU Isolation: Enforced by preemptive scheduling and hardware Current Privilege Levels (CPLCPL). Each process believes it owns the CPU.

    • Memory Isolation: Enforced by the Memory Management Unit (MMUMMU) using per-process page tables. Kernel memory is mapped into the address space but remains inaccessible in user mode.

    • Device Isolation: I/O ports are restricted to Ring 0. IOMMUIOMMU is used for DMADMA isolation. Drivers typically live in the kernel.

    • Filesystem/Namespace Isolation: Unix permissions (owner/group/other), chrootchroot jails, and Linux namespaces (mntmnt, pidpid, netnet, ipcipc).

Kernel Organization: Monolithic vs. Microkernels

  • The Kernel Content Question:

    • Must Be in Kernel: Hardware abstraction, memory management, CPU scheduling, and trap handling.

    • Probably in Kernel: Filesystems, device drivers, and the network stack.

    • Could/Should be in User Space: Filesystem servers, protocol stacks, applications, window systems, and language runtimes.

  • Monolithic Kernels (e.g., Linux, FreeBSD, xv6):

    • All OS services (Scheduler, FS, Network, Drivers) execute in the kernel address space with full privilege.

    • Advantages: High performance (no context switching for services), direct access between components, simpler Inter-Process Communication (IPCIPC) via function calls.

    • Disadvantages: A bug anywhere can crash the system; large attack surface (Linux kernel is approximately 30×10630 \times 10^{6} lines of code); difficult to formally verify.

  • Microkernels (e.g., Mach, seL4, QNX, Fuchsia/Zircon):

    • A minimal kernel handles only IPCIPC, address spaces, scheduling, and Interrupt Request (IRQIRQ) dispatch. Services like filesystems and drivers run in user space.

    • Advantages: Fault isolation (drivers can be restarted); smaller Trusted Computing Base (TCBTCB); modularity; high security (seL4seL4 is formally verified).

    • Disadvantages: High IPCIPC overhead; performance can be 10×10\times to 100×100\times slower for I/O; complex asynchronous design.

  • The IPC Performance Gap:

    • Microkernel Filesystem Read: Requires a syscall to kernel, message send to file server, context switch to file server, file server work, message reply back, and context switch back to caller.

    • Monolithic Filesystem Read: Requires a syscall to kernel, followed by a direct function call to filesystem code.

Hardware Isolation Support

  • x86 Protection Rings:

    • Ring 0: Kernel (privileged).

    • Ring 1–2: Optional (rarely used, intended for OS services/drivers).

    • Ring 3: User applications (unprivileged).

  • Current Privilege Level (CPL):

    • Stored in the two lowest bits of the %cs (code segment) register.

    • CPL=0CPL = 0: Ring 0 (Kernel Mode).

    • CPL=3CPL = 3: Ring 3 (User Mode).

    • The hardware checks CPLCPL on every memory access against the Descriptor Privilege Level (DPLDPL).

  • Restricted Operations in Ring 3:

    • User mode cannot execute instructions like hlthlt, lidtlidt, or clicli.

    • User mode cannot modify page tables or change privilege levels directly (must use a trap).

    • I/O port access requires an IOPLIOPL check.

  • Privilege Transitions:

    • Entering Kernel (Ring 3 → Ring 0): Via system call (int0x80int\,0x80 or syscallsyscall), hardware interrupts (timer, I/O), or exceptions (page fault).

    • Returning to User (Ring 0 → Ring 3): Via iretiret (restores CPLCPL, eipeip, espesp) or sysretsysret (faster path).

System Call APIs

  • Definition: A controlled gate allowing user code to request kernel services without granting kernel privileges.

  • Syscall Counts:

    • Linux (x86-64): 350+∼ 350+

    • FreeBSD / macOS: 500+∼ 500+

    • Windows (NT): 450+∼ 450+

    • xv6: 2121

  • Invocation Mechanisms:

    • int 0x80: Classic x86 software interrupt. CPU saves state, switches to kernel stack. Slower due to full interrupt path.

    • syscall / sysenter: Modern fast paths. Does not use the Interrupt Descriptor Table (IDTIDT). Kernel entry point is stored in the LSTARMSRLSTAR\,MSR (Model Specific Register).

  • API Stability and Evolution:

    • A primary constraint for Linux is "not breaking userspace." Binaries from 20012001 must run in 20242024.

    • New features are added by creating new syscalls (e.g., pipepipe2pipe \rightarrow pipe2, selectpollepollselect \rightarrow poll \rightarrow epoll) rather than modifying existing ones.

    • API Design Challenges: You only ship it once; you cannot predict future needs (e.g., the need for O_CLOEXECO\_CLOEXEC to prevent race conditions); versioning is effectively impossible.

xv6 Implementation Details

  • xv6 System Call Path:

    1. User program calls a function like write()write().

    2. User-side stub in usys.Susys.S executes: movl$SYS_write,movl\,\$SYS\_write,\,%eax;\,int\,0x40.

    3. CPU switches to CPL0CPL\,0, saves state, and jumps to IDTIDT entry via vector.Svector.S.

    4. alltraps()alltraps() saves remaining registers and calls trap()trap() in trap.ctrap.c.

    5. trap()trap() identifies T_SYSCALLT\_SYSCALL and calls syscall()syscall().

    6. syscall()syscall() in syscall.csyscall.c uses the value in %eax to index into a static dispatch table of function pointers.

    7. The handler (e.g., sys_writesys\_write) executes, and the return value is placed back in proctfeaxproc\rightarrow tf\rightarrow eax.

  • The Trap Frame: A structure containing state saved by hardware (%ss, %esp, %eflags, %cs, %eip, and error code) and software (general-purpose and segment registers).

  • Adding a Syscall to xv6:

    1. Define number in syscall.hsyscall.h.

    2. Implement handler in a sys.csys*.c file.

    3. Register in the dispatch table in syscall.csyscall.c.

    4. Declare for user space in user.huser.h and usys.Susys.S.

    5. Validate arguments using argint()argint(), argptr()argptr(), or argstr()argstr().

Virtual Memory and Time Slicing

  • Memory Isolation: Each process has its own page table. In xv6, KERNBASE=0x80000000KERNBASE = 0x80000000. Memory above this is supervisor-only (U/S=0U/S = 0).

  • Page Table Bits:

    • U/S: User/Supervisor (0=0 = kernel only).

    • R/W: Read/Write (0=0 = read-only).

    • P: Present (0=0 = fault on access).

    • NX: No-Execute (prevents code injection).

  • Time Slicing: OS divides CPU time into quanta (e.g., 10ms10\,ms). A hardware timer interrupt triggers a context switch.

    • Cooperative Scheduling: Process voluntarily yields (dangerous, infinite loops stall system).

    • Preemptive Scheduling: Mandatory for multi-user systems; OS forces the yield.

  • Context Switch Costs: Register save/restore (2˘23c100ns\u223c 100\,ns), TLBTLB flush (2˘23c1000ns\u223c 1000\,ns), and cache cold start (2˘23cμs\u223c \mu s).

Kernel Concurrency Models

  • Many-to-One: Multiple user threads map to one kernel thread. If one user thread blocks, all threads in that process block.

  • One-to-One (Linux pthreads, Windows): Each user thread has a corresponding kernel thread. Allows true parallelism. High memory overhead (2˘23c8KB\u223c 8\,KB per kernel stack).

  • Many-to-Many (Go goroutines, Solaris): MM user threads multiplexed over NN kernel threads. Complex implementation but supports massive concurrency.

  • Concurrency inside the Kernel: Multicore kernels must be re-entrant. Shared structures like the process table and memory allocator must be protected. xv6 uses spinlocks that disable interrupts during acquisition to prevent data races. Failure to synchronize leads to catastrophic corruption.

Questions & Discussion

  • What happens if the filesystem crashes? In a monolithic kernel, it results in a kernel panic and system reboot. In a microkernel, the server crashes but the kernel can restart it, potentially leaving other processes unaffected. In an exokernel, the crash is limited to the specific application managing its own storage.

  • Why does hardware enforce protection? Software is bypassable; a process could jump to any address. Hardware enforcement is unconditional and acts as the root of trust for the Trusted Computing Base (TCBTCB).

  • Why are microkernels difficult? Primarily due to the cumulative latency of IPCIPC context switching which can make I/O significantly slower than monolithic equivalents.