Javad

Posted on Jan 2

Deep Dive into Operating System Internals

#programming #tutorial #devops #discuss

Hey Dev Community! 👋⚙️

I’m glad because you’re here and you’re watching this technical blog of mine!

This is a masterclass: the most comprehensive, professional, and deeply detailed deep dive into operating system internals you’ll read this year. We’ll go step by step, no hand‑waving, with practical Linux examples and careful explanations that connect hardware, kernels, and userland — so you can reason about performance, correctness, and design at a professional level. Bring coffee. This is long. And worth it.

Table of contents

Scheduling fundamentals and Linux CFS in practice
Multiprocessor scheduling: affinity, load balancing, NUMA, RT classes
Virtual memory architecture: page tables, TLBs, faults, swapping, huge pages, COW
Linux memory management: VMA, page cache, THP, NUMA policies, OOM
File systems: VFS, inode/dentry/superblock, journaling, caching, ext4/XFS/btrfs/ZFS
Kernel modules: architecture, driver interfaces, safety, char devices, sysfs/procfs/netlink
Hands‑on: build a syscall‑like interface via a safe kernel module and user client
Observability: ftrace, perf, eBPF, bpftrace, tracepoints, kgdb, kprobes
Performance engineering: locking, scalability, isolation, cgroups/namespaces, NUMA tuning
Security architecture: capabilities, LSM (SELinux/AppArmor), seccomp, signed modules
Production guidance: ABI stability, kernel configs, CI for modules, panic hygiene
Closing notes and next steps

Scheduling fundamentals and Linux CFS in practice

Goals and constraints

Fairness, responsiveness, throughput, and predictability. OS scheduling must trade off context‑switch overhead vs responsiveness, and maintain per‑CPU invariants in SMP systems.

Canonical policies

Round Robin, Priority Scheduling, MLFQ — you know the classics. What matters in production: preemption, priority inversion handling, starvation prevention (aging), and latency targets.

Linux CFS (Completely Fair Scheduler)

Core idea: approximate an ideal fair queue. Each task has a virtual runtime (vruntime) scaled by weight (nice priority). The “leftmost” node in a red‑black tree (smallest (vruntime)) is picked next.
Time slice is dynamic: tasks with larger weights get more CPU; interactive tasks keep low latency via periodic checks and sleep accounting.
Key structures:
- Per‑CPU runqueue.
- cfsrq with rb‑tree of schedentity.
- picknexttask_fair chooses the leftmost entity.
Latency tunables: schedlatencyns, schedmingranularityns, schedwakeupgranularityns in /proc/sys/kernel/.

Real‑time classes

SCHED_FIFO: non‑timesliced, strict priority.
SCHED_RR: round‑robin within priority.
SCHED_DEADLINE: EDF‑like scheduler for tasks with deadlines (runtime/period).

Practical pitfalls

Priority inversion: solve via priority inheritance on mutexes.
Timer tick granularity: nohz full can reduce scheduler overhead but risks latency spikes.
CPU isolation: isolate housekeeping CPUs for RT workloads.

Multiprocessor scheduling: affinity, load balancing, NUMA, RT classes

Affinity and locality

CPU affinity keeps threads on the same core to preserve cache warmth. Use sched_setaffinity and cpusets.
Thread migration ruins cache locality and can explode latency under contention.

Load balancing

Per‑CPU runqueues rebalance via push/pull. Balancing domains (LLC, NUMA node) define when to steal tasks.
sched_domain heuristics control periodic balancing; heavy RT load may disable balancing.

NUMA awareness

Memory close to CPU matters. Use NUMA‑aware schedulers and allocators; otherwise, remote memory access burns latency and bandwidth.
Bind threads and memory (numactl --cpunodebind --membind) for deterministic performance.

Real‑time on SMP

Avoid RT throttling; dedicate cores to RT with isolcpus, nohz_full, and rcu‑nocbs.
Pin IRQs away from RT cores; use irqbalance cautiously.

Virtual memory architecture: page tables, TLBs, faults, swapping, huge pages, COW

Address spaces

Each process has its own virtual address space partitioned into regions: code, data, heap, stack, mappings. Protection (r/w/x) enforced per region.

Page tables (x86‑64 example)

Multi‑level (typically 4‑5 levels): PML4 → PDPT → PD → PT → page.
Entries encode physical page frame number, flags (present, writable, user, accessed, dirty, NX).
Kernel vs user mappings: split high/low halves; kernel mappings are global.

TLB (Translation Lookaside Buffer)

Hardware cache of recent translations. TLB misses force page table walks.
PCIDs (process context identifiers) reduce TLB flushes on context switch.

Page faults

Demand paging loads a page from storage on first access.
Protection faults (e.g., write to read‑only) can trigger COW.
Minor vs major faults: minor when page is cached; major when disk I/O needed.

Swapping

When memory pressure rises, inactive pages move to swap.
Swappiness controls aggressiveness. Trade‑off: responsiveness vs memory overcommit.

Huge pages

2MB/1GB pages reduce TLB pressure.
THP (transparent huge pages) automatically coalesce small pages. Watch for fragmentation and NUMA side effects.

Copy‑on‑Write (COW)

fork() shares pages between parent/child until a write occurs, then the page is copied.
Great for performance, but careful with large dirty working sets.

Linux memory management: VMA, page cache, THP, NUMA policies, OOM

VMA (vmareastruct)

Describes contiguous virtual memory regions with permissions and backing (file/anonymous).
Operations: mmap, munmap, mprotect manipulate VMAs; kernel merges and splits as needed.

Page cache

Unified cache for file data; all I/O goes through it unless O_DIRECT.
Dirty pages flushed by writeback; tunables in /proc/sys/vm/.

NUMA policies

MPOLBIND, MPOLINTERLEAVE, MPOL_PREFERRED.
Align thread placement with memory policy to avoid remote access.

OOM killer

When memory is exhausted, the kernel selects a victim based on badness heuristic (memory usage, oomscoreadj).
Design for backpressure: signal, slow down, fail gracefully — don’t rely on OOM as control flow.

File systems: VFS, inode/dentry/superblock, journaling, caching, ext4/XFS/btrfs/ZFS

VFS layer

Abstracts filesystem operations: open/read/write/stat/mmap.
Core objects:
- inode: metadata (size, permissions, blocks).
- dentry: directory entries (names → inodes).
- superblock: filesystem‑wide metadata and ops.
- address_space: page cache mapping for file I/O.

Journaling

Write‑ahead log records intent before applying changes.
ext4 modes: journal/data, ordered, writeback. Ordered is common: metadata in journal, data written before metadata commit.

COW filesystems

btrfs/ZFS write new blocks and update metadata atomically, enabling snapshots, checksums, deduplication.

Caching, writeback, and memory pressure

Page cache caches file data; readahead prefetches sequential pages.
Under memory pressure, cache is reclaimed; buffered I/O may stall on writeback.

Choosing a filesystem

ext4: balanced general purpose.
XFS: large files and parallel I/O.
btrfs/ZFS: data integrity, snapshots, but heavier metadata.

Kernel modules: architecture, driver interfaces, safety, char devices, sysfs/procfs/netlink

Module architecture

moduleinit/moduleexit lifecycle.
Object files linked against kernel headers; built via kbuild.
Symbols resolved by kernel; versioning matters (CONFIG_MODVERSIONS).

Safety fundamentals

No blocking in atomic context.
Always validate user pointers: copyfromuser/copytouser.
Use appropriate GFP flags (e.g., GFPKERNEL, GFPATOMIC).
Clean up resources in error paths; refcounts for objects; RCU for read‑mostly structures.

Character devices

Implement file_operations (open/read/write/ioctl/mmap).
Create /dev node via udev or manual mknod.
For control paths, prefer ioctl or sysfs attributes; for bulk data, use read/write.

sysfs, procfs, netlink

sysfs: typed attributes under /sys for configuration/state.
procfs: process and kernel info under /proc.
netlink: message‑based channel between kernel and user space (good for complex control planes).

Hands‑on: syscall‑like interface via a safe kernel module and user client

Adding a real syscall requires kernel rebuild and ABI changes. In production, expose functionality through a device or netlink. We’ll build a robust char device with basic IOCTLs.

Kernel module (complete)

`c
// simplecall.c

include

define DEVICE_NAME "simplecall"

define CLASS_NAME "simpcls"

define BUF_SIZE 256

define SCIOCTLMAGIC 'S'

define SCIOCTLECHO IOWR(SCIOCTL_MAGIC, 0, int)

static int major;
static struct class* sc_class;
static struct device* sc_dev;
static struct cdev sc_cdev;
static char kernelbuf[BUFSIZE];
static DEFINEMUTEX(scmutex);

static int sc_open(struct inode inodep, struct file filep) {
return 0;
}
static int sc_release(struct inode inodep, struct file filep) {
return 0;
}
static ssizet scread(struct file filep, char user buf, sizet len, lofft off) {
size_t n;
if (mutexlockinterruptible(&sc_mutex)) return -ERESTARTSYS;
n = strnlen(kernelbuf, BUFSIZE);
if (len < n) n = len;
if (copytouser(buf, kernelbuf, n)) { mutexunlock(&sc_mutex); return -EFAULT; }
mutexunlock(&scmutex);
return n;
}
static ssizet scwrite(struct file filep, const char user buf, sizet len, lofft *off) {
sizet n = min(len, (sizet)(BUF_SIZE-1));
if (mutexlockinterruptible(&sc_mutex)) return -ERESTARTSYS;
if (copyfromuser(kernelbuf, buf, n)) { mutexunlock(&sc_mutex); return -EFAULT; }
kernel_buf[n] = '\0';
mutexunlock(&scmutex);
return n;
}
static long sc_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) {
int val;
switch (cmd) {
case SCIOCTLECHO:
if (copyfromuser(&val, (int user)arg, sizeof(int))) return -EFAULT;
val = val ^ 0x5A5A; // toy transform
if (copytouser((int user*)arg, &val, sizeof(int))) return -EFAULT;
return 0;
default:
return -ENOTTY;
}
}
static const struct fileoperations scfops = {
.owner = THIS_MODULE,
.open = sc_open,
.release = sc_release,
.read = sc_read,
.write = sc_write,
.unlockedioctl = scioctl,
};

static int init sc_init(void) {
int ret;
dev_t dev;
ret = allocchrdevregion(&dev, 0, 1, DEVICE_NAME);
if (ret) return ret;
major = MAJOR(dev);
cdevinit(&sccdev, &sc_fops);
ret = cdevadd(&sccdev, dev, 1);
if (ret) { unregisterchrdevregion(dev, 1); return ret; }
scclass = classcreate(THISMODULE, CLASSNAME);
if (ISERR(scclass)) { cdevdel(&sccdev); unregisterchrdevregion(dev,1); return PTRERR(scclass); }
scdev = devicecreate(scclass, NULL, dev, NULL, DEVICENAME);
if (ISERR(scdev)) {
classdestroy(scclass); cdevdel(&sccdev); unregisterchrdevregion(dev,1);
return PTRERR(scdev);
}
pr_info("simplecall: loaded major=%d\n", major);
return 0;
}
static void exit sc_exit(void) {
dev_t dev = MKDEV(major, 0);
devicedestroy(scclass, dev);
classdestroy(scclass);
cdevdel(&sccdev);
unregisterchrdevregion(dev, 1);
pr_info("simplecall: unloaded\n");
}
moduleinit(scinit);
moduleexit(scexit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("You");
MODULE_DESCRIPTION("Safe syscall-like char device");
`

Makefile

makefile obj-m += simplecall.o KDIR := /lib/modules/$(shell uname -r)/build all: make -C $(KDIR) M=$(PWD) modules clean: make -C $(KDIR) M=$(PWD) clean

User client (complete)

`c
// user_sc.c

include

define SCIOCTLMAGIC 'S'

define SCIOCTLECHO IOWR(SCIOCTL_MAGIC, 0, int)

int main() {
int fd = open("/dev/simplecall", O_RDWR);
if (fd < 0) { perror("open"); return 1; }
const char *msg = "hello kernel";
if (write(fd, msg, strlen(msg)) < 0) { perror("write"); return 1; }
char buf[256]; int n = read(fd, buf, sizeof(buf)-1);
if (n < 0) { perror("read"); return 1; }
buf[n] = '\0';
printf("read: %s\n", buf);
int x = 12345;
if (ioctl(fd, SCIOCTLECHO, &x) < 0) { perror("ioctl"); return 1; }
printf("ioctl echo transformed: %d\n", x);
close(fd);
return 0;
}
`

Build/run steps

Build module: make
Insert: sudo insmod simplecall.ko
Confirm: dmesg | tail, ls -l /dev/simplecall
Build user: gcc usersc.c -o usersc
Run: ./user_sc
Remove: sudo rmmod simplecall

Observability: ftrace, perf, eBPF, bpftrace, tracepoints, kgdb, kprobes

ftrace: echo function > /sys/kernel/debug/tracing/current_tracer, then enable specific functions.
perf: perf stat -e cycles,instructions,cache-misses ./app and perf record/report for hotspots.
eBPF: use bpftrace scripts like bpftrace -e 'tracepoint:sched:schedswitch { printf("%s -> %s\n", comm, args->nextcomm); }'
kprobes: dynamically hook kernel functions for debugging; uprobes for user space.

Performance engineering: locking, scalability, isolation, cgroups/namespaces, NUMA tuning

Locking: prefer RCU for read‑mostly structures; avoid global locks; partition data per CPU.
Isolation: cpusets and isolcpus for dedicated workloads.
cgroups: limit CPU/memory/IO; use cpu.max, memory.max, io.max.
Namespaces: PID/mount/net/user for container isolation.
NUMA: bind threads and memory; avoid cross‑node contention; use huge pages if beneficial.

Security architecture: capabilities, LSM (SELinux/AppArmor), seccomp, signed modules

Capabilities split root privileges into fine‑grained bits (CAPSYSADMIN, etc.).
LSM: SELinux/AppArmor enforce mandatory policies; label files/processes; deny‑by‑default.
seccomp: restrict syscalls via whitelists; reduce attack surface.
Signed modules/secure boot: only load trusted modules in production.

Production guidance: ABI stability, kernel configs, CI for modules, panic hygiene

Keep module ABI compatible with target kernels; enable CONFIG_MODVERSIONS.
CI: build against multiple kernel versions/headers; run kselftests.
Panic hygiene: avoid BUG(); prefer graceful error handling; set paniconoops appropriately.

Closing notes and next steps

You now have a professional, end‑to‑end picture of OS internals: how the scheduler chooses tasks, how virtual memory maps addresses, how filesystems ensure consistency, and how kernel modules safely extend the kernel — with real Linux code you can compile and run.

Next up in the series:

Distributed Systems & Networking: RDMA, NVMe‑oF, cluster scheduling, plus MPI and ZeroMQ practicals.
Compiler Design 101 (But Advanced): build a serious language and compiler in C++.

If you’re still here, you’re my kind of engineer. Follow along — we’ll keep writing, keep shipping, and keep turning curiosity into craft. Stay bold 🚀🔥.