DEV Community

Javad
Javad

Posted on

Deep Dive into Operating System Internals

Hey Dev Community! 👋⚙️

I’m glad because you’re here and you’re watching this technical blog of mine!

This is a masterclass: the most comprehensive, professional, and deeply detailed deep dive into operating system internals you’ll read this year. We’ll go step by step, no hand‑waving, with practical Linux examples and careful explanations that connect hardware, kernels, and userland — so you can reason about performance, correctness, and design at a professional level. Bring coffee. This is long. And worth it.


Table of contents

  1. Scheduling fundamentals and Linux CFS in practice
  2. Multiprocessor scheduling: affinity, load balancing, NUMA, RT classes
  3. Virtual memory architecture: page tables, TLBs, faults, swapping, huge pages, COW
  4. Linux memory management: VMA, page cache, THP, NUMA policies, OOM
  5. File systems: VFS, inode/dentry/superblock, journaling, caching, ext4/XFS/btrfs/ZFS
  6. Kernel modules: architecture, driver interfaces, safety, char devices, sysfs/procfs/netlink
  7. Hands‑on: build a syscall‑like interface via a safe kernel module and user client
  8. Observability: ftrace, perf, eBPF, bpftrace, tracepoints, kgdb, kprobes
  9. Performance engineering: locking, scalability, isolation, cgroups/namespaces, NUMA tuning
  10. Security architecture: capabilities, LSM (SELinux/AppArmor), seccomp, signed modules
  11. Production guidance: ABI stability, kernel configs, CI for modules, panic hygiene
  12. Closing notes and next steps

  1. Scheduling fundamentals and Linux CFS in practice

Goals and constraints

  • Fairness, responsiveness, throughput, and predictability. OS scheduling must trade off context‑switch overhead vs responsiveness, and maintain per‑CPU invariants in SMP systems.

Canonical policies

  • Round Robin, Priority Scheduling, MLFQ — you know the classics. What matters in production: preemption, priority inversion handling, starvation prevention (aging), and latency targets.

Linux CFS (Completely Fair Scheduler)

  • Core idea: approximate an ideal fair queue. Each task has a virtual runtime (vruntime) scaled by weight (nice priority). The “leftmost” node in a red‑black tree (smallest (vruntime)) is picked next.
  • Time slice is dynamic: tasks with larger weights get more CPU; interactive tasks keep low latency via periodic checks and sleep accounting.
  • Key structures:
    • Per‑CPU runqueue.
    • cfsrq with rb‑tree of schedentity.
    • picknexttask_fair chooses the leftmost entity.
  • Latency tunables: schedlatencyns, schedmingranularityns, schedwakeupgranularityns in /proc/sys/kernel/.

Real‑time classes

  • SCHED_FIFO: non‑timesliced, strict priority.
  • SCHED_RR: round‑robin within priority.
  • SCHED_DEADLINE: EDF‑like scheduler for tasks with deadlines (runtime/period).

Practical pitfalls

  • Priority inversion: solve via priority inheritance on mutexes.
  • Timer tick granularity: nohz full can reduce scheduler overhead but risks latency spikes.
  • CPU isolation: isolate housekeeping CPUs for RT workloads.

  1. Multiprocessor scheduling: affinity, load balancing, NUMA, RT classes

Affinity and locality

  • CPU affinity keeps threads on the same core to preserve cache warmth. Use sched_setaffinity and cpusets.
  • Thread migration ruins cache locality and can explode latency under contention.

Load balancing

  • Per‑CPU runqueues rebalance via push/pull. Balancing domains (LLC, NUMA node) define when to steal tasks.
  • sched_domain heuristics control periodic balancing; heavy RT load may disable balancing.

NUMA awareness

  • Memory close to CPU matters. Use NUMA‑aware schedulers and allocators; otherwise, remote memory access burns latency and bandwidth.
  • Bind threads and memory (numactl --cpunodebind --membind) for deterministic performance.

Real‑time on SMP

  • Avoid RT throttling; dedicate cores to RT with isolcpus, nohz_full, and rcu‑nocbs.
  • Pin IRQs away from RT cores; use irqbalance cautiously.

  1. Virtual memory architecture: page tables, TLBs, faults, swapping, huge pages, COW

Address spaces

  • Each process has its own virtual address space partitioned into regions: code, data, heap, stack, mappings. Protection (r/w/x) enforced per region.

Page tables (x86‑64 example)

  • Multi‑level (typically 4‑5 levels): PML4 → PDPT → PD → PT → page.
  • Entries encode physical page frame number, flags (present, writable, user, accessed, dirty, NX).
  • Kernel vs user mappings: split high/low halves; kernel mappings are global.

TLB (Translation Lookaside Buffer)

  • Hardware cache of recent translations. TLB misses force page table walks.
  • PCIDs (process context identifiers) reduce TLB flushes on context switch.

Page faults

  • Demand paging loads a page from storage on first access.
  • Protection faults (e.g., write to read‑only) can trigger COW.
  • Minor vs major faults: minor when page is cached; major when disk I/O needed.

Swapping

  • When memory pressure rises, inactive pages move to swap.
  • Swappiness controls aggressiveness. Trade‑off: responsiveness vs memory overcommit.

Huge pages

  • 2MB/1GB pages reduce TLB pressure.
  • THP (transparent huge pages) automatically coalesce small pages. Watch for fragmentation and NUMA side effects.

Copy‑on‑Write (COW)

  • fork() shares pages between parent/child until a write occurs, then the page is copied.
  • Great for performance, but careful with large dirty working sets.

  1. Linux memory management: VMA, page cache, THP, NUMA policies, OOM

VMA (vmareastruct)

  • Describes contiguous virtual memory regions with permissions and backing (file/anonymous).
  • Operations: mmap, munmap, mprotect manipulate VMAs; kernel merges and splits as needed.

Page cache

  • Unified cache for file data; all I/O goes through it unless O_DIRECT.
  • Dirty pages flushed by writeback; tunables in /proc/sys/vm/.

NUMA policies

  • MPOLBIND, MPOLINTERLEAVE, MPOL_PREFERRED.
  • Align thread placement with memory policy to avoid remote access.

OOM killer

  • When memory is exhausted, the kernel selects a victim based on badness heuristic (memory usage, oomscoreadj).
  • Design for backpressure: signal, slow down, fail gracefully — don’t rely on OOM as control flow.

  1. File systems: VFS, inode/dentry/superblock, journaling, caching, ext4/XFS/btrfs/ZFS

VFS layer

  • Abstracts filesystem operations: open/read/write/stat/mmap.
  • Core objects:
    • inode: metadata (size, permissions, blocks).
    • dentry: directory entries (names → inodes).
    • superblock: filesystem‑wide metadata and ops.
    • address_space: page cache mapping for file I/O.

Journaling

  • Write‑ahead log records intent before applying changes.
  • ext4 modes: journal/data, ordered, writeback. Ordered is common: metadata in journal, data written before metadata commit.

COW filesystems

  • btrfs/ZFS write new blocks and update metadata atomically, enabling snapshots, checksums, deduplication.

Caching, writeback, and memory pressure

  • Page cache caches file data; readahead prefetches sequential pages.
  • Under memory pressure, cache is reclaimed; buffered I/O may stall on writeback.

Choosing a filesystem

  • ext4: balanced general purpose.
  • XFS: large files and parallel I/O.
  • btrfs/ZFS: data integrity, snapshots, but heavier metadata.

  1. Kernel modules: architecture, driver interfaces, safety, char devices, sysfs/procfs/netlink

Module architecture

  • moduleinit/moduleexit lifecycle.
  • Object files linked against kernel headers; built via kbuild.
  • Symbols resolved by kernel; versioning matters (CONFIG_MODVERSIONS).

Safety fundamentals

  • No blocking in atomic context.
  • Always validate user pointers: copyfromuser/copytouser.
  • Use appropriate GFP flags (e.g., GFPKERNEL, GFPATOMIC).
  • Clean up resources in error paths; refcounts for objects; RCU for read‑mostly structures.

Character devices

  • Implement file_operations (open/read/write/ioctl/mmap).
  • Create /dev node via udev or manual mknod.
  • For control paths, prefer ioctl or sysfs attributes; for bulk data, use read/write.

sysfs, procfs, netlink

  • sysfs: typed attributes under /sys for configuration/state.
  • procfs: process and kernel info under /proc.
  • netlink: message‑based channel between kernel and user space (good for complex control planes).

  1. Hands‑on: syscall‑like interface via a safe kernel module and user client

Adding a real syscall requires kernel rebuild and ABI changes. In production, expose functionality through a device or netlink. We’ll build a robust char device with basic IOCTLs.

Kernel module (complete)

`c
// simplecall.c

include

include

include

include

include

include

include

define DEVICE_NAME "simplecall"

define CLASS_NAME "simpcls"

define BUF_SIZE 256

define SCIOCTLMAGIC 'S'

define SCIOCTLECHO IOWR(SCIOCTL_MAGIC, 0, int)

static int major;
static struct class* sc_class;
static struct device* sc_dev;
static struct cdev sc_cdev;
static char kernelbuf[BUFSIZE];
static DEFINEMUTEX(scmutex);

static int sc_open(struct inode inodep, struct file filep) {
return 0;
}
static int sc_release(struct inode inodep, struct file filep) {
return 0;
}
static ssizet scread(struct file filep, char user buf, sizet len, lofft off) {
size_t n;
if (mutexlockinterruptible(&sc_mutex)) return -ERESTARTSYS;
n = strnlen(kernelbuf, BUFSIZE);
if (len < n) n = len;
if (copytouser(buf, kernelbuf, n)) { mutexunlock(&sc_mutex); return -EFAULT; }
mutexunlock(&scmutex);
return n;
}
static ssizet scwrite(struct file filep, const char user buf, sizet len, lofft *off) {
sizet n = min(len, (sizet)(BUF_SIZE-1));
if (mutexlockinterruptible(&sc_mutex)) return -ERESTARTSYS;
if (copyfromuser(kernelbuf, buf, n)) { mutexunlock(&sc_mutex); return -EFAULT; }
kernel_buf[n] = '\0';
mutexunlock(&scmutex);
return n;
}
static long sc_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) {
int val;
switch (cmd) {
case SCIOCTLECHO:
if (copyfromuser(&val, (int user
)arg, sizeof(int))) return -EFAULT;
val = val ^ 0x5A5A; // toy transform
if (copytouser((int user*)arg, &val, sizeof(int))) return -EFAULT;
return 0;
default:
return -ENOTTY;
}
}
static const struct fileoperations scfops = {
.owner = THIS_MODULE,
.open = sc_open,
.release = sc_release,
.read = sc_read,
.write = sc_write,
.unlockedioctl = scioctl,
};

static int init sc_init(void) {
int ret;
dev_t dev;
ret = allocchrdevregion(&dev, 0, 1, DEVICE_NAME);
if (ret) return ret;
major = MAJOR(dev);
cdevinit(&sccdev, &sc_fops);
ret = cdevadd(&sccdev, dev, 1);
if (ret) { unregisterchrdevregion(dev, 1); return ret; }
scclass = classcreate(THISMODULE, CLASSNAME);
if (ISERR(scclass)) { cdevdel(&sccdev); unregisterchrdevregion(dev,1); return PTRERR(scclass); }
scdev = devicecreate(scclass, NULL, dev, NULL, DEVICENAME);
if (ISERR(scdev)) {
classdestroy(scclass); cdevdel(&sccdev); unregisterchrdevregion(dev,1);
return PTRERR(scdev);
}
pr_info("simplecall: loaded major=%d\n", major);
return 0;
}
static void exit sc_exit(void) {
dev_t dev = MKDEV(major, 0);
devicedestroy(scclass, dev);
classdestroy(scclass);
cdevdel(&sccdev);
unregisterchrdevregion(dev, 1);
pr_info("simplecall: unloaded\n");
}
moduleinit(scinit);
moduleexit(scexit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("You");
MODULE_DESCRIPTION("Safe syscall-like char device");
`

Makefile

makefile
obj-m += simplecall.o
KDIR := /lib/modules/$(shell uname -r)/build
all:
make -C $(KDIR) M=$(PWD) modules
clean:
make -C $(KDIR) M=$(PWD) clean

User client (complete)

`c
// user_sc.c

include

include

include

include

include

define SCIOCTLMAGIC 'S'

define SCIOCTLECHO IOWR(SCIOCTL_MAGIC, 0, int)

int main() {
int fd = open("/dev/simplecall", O_RDWR);
if (fd < 0) { perror("open"); return 1; }
const char *msg = "hello kernel";
if (write(fd, msg, strlen(msg)) < 0) { perror("write"); return 1; }
char buf[256]; int n = read(fd, buf, sizeof(buf)-1);
if (n < 0) { perror("read"); return 1; }
buf[n] = '\0';
printf("read: %s\n", buf);
int x = 12345;
if (ioctl(fd, SCIOCTLECHO, &x) < 0) { perror("ioctl"); return 1; }
printf("ioctl echo transformed: %d\n", x);
close(fd);
return 0;
}
`

Build/run steps

  • Build module: make
  • Insert: sudo insmod simplecall.ko
  • Confirm: dmesg | tail, ls -l /dev/simplecall
  • Build user: gcc usersc.c -o usersc
  • Run: ./user_sc
  • Remove: sudo rmmod simplecall

  1. Observability: ftrace, perf, eBPF, bpftrace, tracepoints, kgdb, kprobes
  • ftrace: echo function > /sys/kernel/debug/tracing/current_tracer, then enable specific functions.
  • perf: perf stat -e cycles,instructions,cache-misses ./app and perf record/report for hotspots.
  • eBPF: use bpftrace scripts like bpftrace -e 'tracepoint:sched:schedswitch { printf("%s -> %s\n", comm, args->nextcomm); }'
  • kprobes: dynamically hook kernel functions for debugging; uprobes for user space.

  1. Performance engineering: locking, scalability, isolation, cgroups/namespaces, NUMA tuning
  • Locking: prefer RCU for read‑mostly structures; avoid global locks; partition data per CPU.
  • Isolation: cpusets and isolcpus for dedicated workloads.
  • cgroups: limit CPU/memory/IO; use cpu.max, memory.max, io.max.
  • Namespaces: PID/mount/net/user for container isolation.
  • NUMA: bind threads and memory; avoid cross‑node contention; use huge pages if beneficial.

  1. Security architecture: capabilities, LSM (SELinux/AppArmor), seccomp, signed modules
  • Capabilities split root privileges into fine‑grained bits (CAPSYSADMIN, etc.).
  • LSM: SELinux/AppArmor enforce mandatory policies; label files/processes; deny‑by‑default.
  • seccomp: restrict syscalls via whitelists; reduce attack surface.
  • Signed modules/secure boot: only load trusted modules in production.

  1. Production guidance: ABI stability, kernel configs, CI for modules, panic hygiene
  • Keep module ABI compatible with target kernels; enable CONFIG_MODVERSIONS.
  • CI: build against multiple kernel versions/headers; run kselftests.
  • Panic hygiene: avoid BUG(); prefer graceful error handling; set paniconoops appropriately.

  1. Closing notes and next steps

You now have a professional, end‑to‑end picture of OS internals: how the scheduler chooses tasks, how virtual memory maps addresses, how filesystems ensure consistency, and how kernel modules safely extend the kernel — with real Linux code you can compile and run.

Next up in the series:

  • Distributed Systems & Networking: RDMA, NVMe‑oF, cluster scheduling, plus MPI and ZeroMQ practicals.
  • Compiler Design 101 (But Advanced): build a serious language and compiler in C++.

If you’re still here, you’re my kind of engineer. Follow along — we’ll keep writing, keep shipping, and keep turning curiosity into craft. Stay bold 🚀🔥.

Top comments (2)

Collapse
 
javadinteger profile image
Javad
Collapse
 
javadinteger profile image
Javad

This post is the first entry in the Operating System Internals series 🚀
Stay tuned for the next one