Yegane Golipour

Posted on Jul 16

Building My First Kernel: Understanding Bare-Metal Operating Systems

#c #assembly #kernel #programming

Introduction
Project Overview
Working Features
Concepts
Debugging
What I'd Like to Add Next
What I Learned
I'm Open to Feedback
References & Resources

Introduction

I've always been fascinated by operating systems. One of the main reasons I learned the C programming language was to understand OS development. So, after finishing my last project (a mini shell), I couldn’t resist developing a very minimal kernel for an x86 32-bit CPU.

Most of my time wasn’t spent coding. In fact it was spent understanding the concepts. Finding good resources was challenging, but I learned so much. It even helped me revisit the OS concepts I studied at university.

This post covers the current features of the kernel, the concepts I learned, an explanation of how each one works, the most challenging parts, and what I’d like to add next. Many code snippets were sourced from various tutorials, but understanding even 15 lines of code often took hours. This project wasn’t just about writing code, it was about grasping the architecture of how a kernel works. You can view the full code on my GitHub page.

Project Overview

My mini kernel currently prints a classic "Hello World" on the screen, handles interrupts (both exceptions and IRQs), and includes keyboard and timer drivers. It also supports minimal memory management, setting up a heap and implementing physical memory allocation and freeing.

Terminal output is handled via a serial driver.

Tools used:

Emulator: QEMU
Assembler: NASM
Cross Compiler: i686-elf-gcc
Linker: GNU Linker from Binutils
Bootloader: GRUB
Debugger: GDB with QEMU
Build: Makefile

Demo:

Boot Phases:

Working Features

These are the features currently implemented:

Display
- Prints: "Hello, Welcome To Yega Kernel!"
- Implemented via VGA text mode
Descriptor Tables
- Global Descriptor Table (GDT)
- Interrupt Descriptor Table (IDT)
Interrupts
- Remaps the PIC
- Handles CPU exceptions and hardware IRQs via ISRs
Drivers
- Keyboard driver (prints keys using VGA)
- Timer driver (ticks clock at 100 MHz)
Memory Management
- Inspects free memory using Multiboot memory map
- Initializes heap
- Implements kalloc and kfree
Error Handling
- Halts CPU and disables interrupts if boot errors occur

Concepts

Boot Flow

Understanding the boot sequence is essential. Here’s a simplified modern boot flow:

BIOS/UEFI starts and loads GRUB
GRUB parses the ELF kernel file
GRUB loads the kernel at a known location (1MB or 2MB)
GRUB jumps to kernel entry
Kernel sets up paging
Kernel enables paging
Kernel jumps to higher-half address space

Also, it’s critical to understand CPU modes like Real Mode, Protected Mode, and Long Mode.

At this stage, these were some of the big questions I had, each one led me to dig deeper and truly understand how early system boot and memory addressing worked:

Why do modern OSes still boot in Real Mode?

It felt outdated and unnecessary. why start in a 16-bit mode? Turns out, it’s for legacy compatibility. Even UEFI systems simulate Real Mode for early boot.

How was all memory addressed with only a 20-bit address bus and 16-bit registers?

This confused me a lot. With a 20-bit address bus (1MB total addressable memory) and only 16-bit registers (64KiB max value), how could we access the full memory? I learned that segment registers are used with an offset (segment:offset) to calculate addresses, effectively extending the range.

Why does the bootloader load the kernel at low memory addresses like 1MB or 2MB?

I didn’t understand the choice at first. But it's rooted in convention (1MB is where Real Mode ends), and it ensures compatibility with older memory maps and leaves low memory available for BIOS data.

Cross Compiler

A cross compiler is needed because the default host compiler includes platform-specific libraries (like glibc). With a cross compiler, you:

Generate binaries for a target system
Avoid linking host-specific libraries
Define your own runtime

These are some terms and questions I looked up:

Going Self-hosted
Bootstrapping
How to build a cross compiler

ELF (Executable and Linkable Format)

The ELF format tells the OS how to load and run your binary. It's used for:

Executables
Object files
Shared libraries
Core dumps

I also was wondering why this file format is so special and is used in LINUX:

it's not just any binary format. It's a well-defined standard that the OS kernel, compiler, linker, loader, and debugger all agree on — to communicate how a program is laid out in memory and how to run it.

GNU Linker

At first it was surprising to me that I had to write my own kernel script, because up until now I only thought that the linker is only used to link the object files (which is true, but it's a lot more than that).
Linker controls the memory layout of our program.

The linker:

Combines object files
Assigns memory addresses
Resolves symbols
Produces the final binary

A custom linker script lets you define memory layout: where code starts, where .text, .data, .bss go, and what the entry point is.

Multiboot Header

Based on the OSDev Tutorial:

There exists a Multiboot Standard that defines a simple interface between the bootloader and the operating system kernel.

It works by placing a few magic values in specific global variables (known as the multiboot header), which the bootloader searches for.

When the bootloader finds these values, it recognizes the kernel as multiboot-compatible, knows how to load it, and can even pass important information such as memory maps.

I used the NASM assembler. The code is based on this resource.

Important Note:

Since there is no stack set up yet and you must ensure the global variables are initialized correctly, this initialization has to be done in assembly.

Because I’ve always programmed at the user level, I didn’t realize that when the bootloader first loads our kernel, there is no stack; so using C right away is impossible. We must set up the stack and stack pointers in this assembly file first.

Also, make sure to properly set up the EBX and EAX registers!

I wasn’t aware of this until near the end of my project. GRUB passes a lot of crucial information to your kernel_main, with the most important being the address of the memory map. You can use this memory map to inspect the memory layout, which is essential for setting up memory management.

GRUB stores the address of this information in the EBX register. You need to define structs that exactly match the layout GRUB uses. You can read more about this in the GRUB Multiboot Specification.

VGA Text Mode Buffer

VGA text mode buffer starts at 0xB8000. Writing here displays characters on the screen. It’s the easiest way to output text early in boot.

Later, I added a serial driver for debugging via COM1.

Segmentation & Flat Memory Model

One of the many steps in building a mini hobby kernel is setting up the Global Descriptor Table (GDT). It’s a table of segment descriptors.

Segmentation was a big question for me. Back in early OSes, segmentation felt like a hack to access all memory (remember the 20-bit address bus and 16-bit registers?). It also helped protect processes, which is why segmentation and segment registers were necessary. But while reading about this, I wondered: what’s the point of segment registers in modern OSes running in protected mode? I spent a lot of time trying to find the answer, and honestly, I’m still not 100% convinced. One reason is that segment registers are crucial for the CPU; it won’t work properly without them.

If you look at this table from OSDev, you’ll see that the base and limit for all segments are the same. That’s because we’re not really using a segmented memory model. Instead, there is one big segment that covers the entire memory, unlike the traditional segmented memory model:

Offset	Use	Content
0x0000	Null Descriptor	Base = 0 Limit = 0x00000000 Access Byte = 0x00 Flags = 0x0
0x0008	Kernel Mode Code Segment	Base = 0 Limit = 0xFFFFF Access Byte = 0x9A Flags = 0xC
0x0010	Kernel Mode Data Segment	Base = 0 Limit = 0xFFFFF Access Byte = 0x92 Flags = 0xC
0x0018	User Mode Code Segment	Base = 0 Limit = 0xFFFFF Access Byte = 0xFA Flags = 0xC
0x0020	User Mode Data Segment	Base = 0 Limit = 0xFFFFF Access Byte = 0xF2 Flags = 0xC
0x0028	Task State Segment	Base = &TSS Limit = sizeof(TSS)-1 Access Byte = 0x89 Flags = 0x0

It’s subtle in a flat memory model, but imagine someone tells the CPU: “Go to this address and run code” or “Go to this address and get data.” It sounds correct, but it’s not that simple. How does the CPU know if it’s allowed to execute code or read data from that address? What if it doesn’t have permission? What if it shouldn’t access that memory at all?

Segments tell the CPU where data starts, whether user-mode is allowed to access it, and more.

If we weren’t in a flat memory model, each segment would have different base and limit values.

PIC Remapping

source

The Programmable Interrupt Controller (PIC) — in legacy x86 systems, the Intel 8259A chip — is responsible for:

Accepting interrupt signals (IRQs) from hardware devices
Deciding which IRQ to send to the CPU based on priority and masking
Sending the corresponding interrupt vector to the CPU when requested

Without the PIC, the CPU would have no sane way to handle multiple interrupt sources.

Why remap the PIC?

By default, the PIC maps IRQ0-15 to interrupt vectors 0-15. This clashes with CPU exceptions like divide-by-zero (vector 0) or page fault (vector 14), making it impossible to distinguish hardware interrupts from CPU exceptions. So, remapping is essential to avoid this conflict.

This is a very good Example of what happens when an interrupt happens that I found in ChatGPT:
You Press a Key:

The keyboard controller tells PIC to cause an interrupt
The controller sends IRQ1 to PIC
PIC decides whether the CPU should be immediately notified or not and translate the IRQ number into a [[Interrupt Vector]] for the CPU's table
PIC forwards this interrupt to CPU
CPU jumps to ISR for vector 33
The OS is supposed to handle the interrupt by talking to the keyboard, via in and out instructions (or inportb/outportb, inportw/outportw, and inportd/outportd in C)
Asking what key was pressed, doing something about it (such as displaying the key on the screen, and notifying the current application that a key has been pressed) and returning to whatever code was executing when the interrupt came in
ISR reads scancode from I/O port 0x60
ISR decodes key, puts it in a buffer
Sends EOI
CPU resumes

What's the difference between controller and driver? (for example keyboard controller and keyboard driver)

Controller: A physical chip managing hardware communication (e.g., keyboard controller with ports 0x60 for data and 0x64 for commands/status). It only sends raw scancodes. It doesn’t interpret key presses.
Driver: Kernel code that interacts with the controller, interprets scancodes, and acts on them (e.g., printing characters on the screen).

Memory Management

This is the part that i loved the MOST. Because finally I could write some code myself and decide what kind of design I want to have.

Heap Initialization

Used Multiboot memory map to find free memory blocks.

I wrote a function that checked available blocks (up to 32 blocks because at this point we cannot use dynamic memory allocation.):

int find_available_memory(multiboot_info_t *mbi) {
    serial_writestring("\nmmap addr= ");
    serial_writehex(mbi->mmap_addr);

    serial_writestring("\nmmap length= ");
    serial_writehex(mbi->mmap_length);

    int num_block = 0;
    uint8_t *mmap = (uint8_t *)mbi->mmap_addr;
    uint8_t *mmap_end = mmap + mbi->mmap_length;

    serial_writestring("\nflags= ");
    serial_writehex(mbi->flags);

    if (!CHECK_FLAG(mbi->flags, 6))
        return 0;

    while (mmap < mmap_end) {
        multiboot_mmap_entry_t *entry = (multiboot_mmap_entry_t *)mmap;

        serial_writestring("\nentry addr= ");
        serial_writehex(entry->addr);

        serial_writestring("\nentry len= ");
        serial_writehex(entry->len);

        serial_writestring("\nentry type= ");
        serial_writehex(entry->type);

        serial_writestring("\nentry size= ");
        serial_writehex(entry->size);

        if (num_block < MAX_MEMORY_BLOCKS && entry->type == 1) {
            uint64_t entry_start = entry->addr;
            uint64_t entry_end = entry_start + entry->len;

            if (entry_start <= KERNEL_END && entry_end > KERNEL_END) {
                free_memory_blocks[num_block].start = KERNEL_END;
                free_memory_blocks[num_block].end = entry_end;
                num_block++;
            } else if (entry_start > KERNEL_END) {
                free_memory_blocks[num_block].start = entry_start;
                free_memory_blocks[num_block].end = entry_end;
                num_block++;
            }
        }

        mmap += entry->size + sizeof(entry->size);
    }

    return num_block;
}

This gives us important information about where to start our heap:

heap_start = free_memory_blocks[0].start;
heap_start = ALIGNUP(heap_start, PAGE_SIZE);
heap_end = heap_start + heap_size;
kalloc_ptr = heap_start;

`kalloc`

I use a pointer, kalloc_ptr, which points to the next free block on the heap.
During memory management initialization, kalloc_ptr is set to the heap start address (heap_start).
Each memory block has a header describing its size, whether it’s free, and a pointer to the next block:

typedef struct heap_block {
    size_t size;
    bool is_freed;
    struct heap_block *next;
} heap_block_t;

Allocation works like a simplified malloc:

If there’s no existing free block big enough to satisfy the request, a new block is allocated at kalloc_ptr.
I implemented a First-Fit strategy: scan from kalloc_ptr for the first free block whose size is ≥ requested size.

When allocating, kalloc_ptr is advanced by the total required size, which includes the block header:

size_t total_req = req + sizeof(heap_block_t);
...
kalloc_ptr += total_req;

We add sizeof(heap_block_t) because the header itself consumes memory. When returning a pointer to the user, we skip the header:

return (void *)(curr + 1);

This is the implementation of kalloc:

heap_block_t *head = NULL;

void *kalloc(size_t req) {

    req = ALIGNUP(req, ALIGN);
    size_t total_req = req + sizeof(heap_block_t);

    if ((kalloc_ptr + total_req) > heap_end) {
        serial_writestring("Not enough memory!\n");
        return NULL;
    }

    heap_block_t *curr = head;
    heap_block_t *prev = NULL;

    while (curr) {
        if (curr->is_freed && curr->size >= req) {
            curr->is_freed = false;
            return (void *)(curr + 1);
        }

        prev = curr;
        curr = curr->next;
    }

    curr = (heap_block_t *)kalloc_ptr;
    curr->size = req;
    curr->is_freed = false;
    curr->next = NULL;

    if (!head)
        head = curr;
    else
        prev->next = curr;

    kalloc_ptr += total_req;

    return (void *)(curr + 1);
}

`kfree`

Freeing memory is much simpler.

The user passes a pointer to the allocated memory.
To access the block header, I subtract 1 from the pointer (since the header is placed just before the returned memory).
I mark the block as free.
If the freed block is the last one (i.e. node->next == NULL), I move kalloc_ptr back to reclaim space.

Here’s the implementation:

void kfree(void *ptr) {
  if (!ptr) return;
  heap_block_t *node = ((heap_block_t *)ptr) - 1;
  node->is_freed = true;
  if (!node->next)
    kalloc_ptr -= node->size + sizeof(heap_block_t);
}

Alignment

This part was honestly confusing at first. I always knew alignment happened “under the hood,” and I assumed the compiler took care of it. But working on my kernel project made it unavoidable.

For this project, I had to care about alignment because:

The heap had to be aligned to the page size which is essential for paging and virtual memory, which I’ll add later.
Each allocation had to be aligned to 8 bytes, which is standard on 32-bit systems to avoid unaligned memory access penalties.

The alignment formula I used is from this Wikipedia article:

padding = (align - (offset & (align - 1))) & (align - 1)
        = -offset & (align - 1)

aligned = (offset + (align - 1)) & ~(align - 1)
        = (offset + (align - 1)) & -align

In my code, I implemented it like this:

#define ALIGNUP(offset, align) (((offset) + (align - 1)) & ~((align) - 1))

Debugging

I believe the most difficult part of writing C codes is debugging. In my kernel project, I ad to check many registers and figure out if their values make sense.

Checking if after initializing GDT, segment register values are correct and based on how I implemented them. I did this:

Checking if interrupts are enabled. For this we have to check EFLAGS registers:

`EFL` means `EFLAGS` which is: `0000 0000 0000 0000 0000 0010 0000 0110` here.
The 9th bit is for `IF` flag which shows if interrupt is enabled. here it is `0` so interrupts are not enabled.

Check if PIC is running correctly. PIC has three registers:

- `IRR`: Interrupt Request Register (pending interrupts)
- `ISR`: In-Service Register (interrupts currently being handled)
- `IMR` = Interrupt Mask Register (masked/disabled IRQ lines)

1. If `IRR` is non-zero but `ISR` is zero, it means interrupts are enables but the CPU isn’t acknowledging them.
2. If you see `ISR` non-zero forever, it means you forgot to send an `EOI` to `PIC` after handling the interrupt.
3. If you see `IMR` masking all, then no new interrupts will come.

For example one time I had a `General Protection Fault` exception:

and I got these values for `IRR`, `ISR`, `IMR`:

*Before pressing a key:*

first bit of `IRR` is 1; so it means `IRQ0` is fired.

*After pressing a key:*

first and second bit of `IRR` are 1; so `IRQ0` and `IRQ1` are fired

But because of an error none of them are being acknowledged by the CPU.

What I'd Like to Add Next

Virtual memory and paging
printf and basic stdio-like functions
Multitasking between dummy tasks
Porting my mini shell into the kernel

What I Learned

This project taught me more than university courses. Key takeaways:

Assembly for OS dev
Boot flow and GRUB
GDT and IDT setup
ISRs and IRQs
PIC remapping
CPU modes
Inline assembly
Cross compilation
Multiboot headers
PIT and timers
ELF structure
Memory alignment and heap allocators
VGA and serial debugging
GDB with QEMU

I'm Open to Feedback

If you've worked on kernels or OS dev and have feedback, suggestions, or corrections, I'd really appreciate it. Especially if you can point out where my understanding may be off or recommend what to explore next.