Ali Smadi

Posted on Mar 21

I Vibe Coded an Entire Operating System From Scratch — Here's What I Learned

#ai #vibecoding #programming

I've always considered building an operating system the Mount Everest of programming. Not the "follow a tutorial and boot into Hello World" kind — I mean the real thing: UEFI boot, page tables, preemptive multitasking, a filesystem, a window manager, user-space applications, and a desktop environment. The kind of project that touches every layer of the stack and punishes every wrong assumption.

So naturally, I decided to see if Claude could help me build one.

The result is ASOS — a hobbyist x86-64 operating system written in C and Assembly, runs on 128MB of RAM, built from absolute scratch. No Linux kernel underneath. No libc. No safety net. 25,709 lines of code (excluding ported DOOM), 82 commits over 8 days, booting from UEFI firmware all the way to a desktop environment with a window manager, four GUI applications, an interactive shell, and yes — a playable port of the original 1993 DOOM.

This post isn't about OS development theory. It's about what actually happened when I sat down with Claude (both the web app and Claude Code) and tried to push AI-assisted development to its limits on one of the hardest possible projects. What worked, what broke, and the lessons that generalize far beyond operating systems.

The Setup: Two Claudes

My workflow split into two surfaces:

Claude AI (web app) — my consultant on architecture, OS knowledge, and prompt engineer. I used it to understand what building an OS actually requires, make informed architectural decisions, and — crucially — craft detailed prompts for the next step.
Claude Code (CC) — my executor. It received those prompts and wrote the actual C and Assembly, dealt with compilation errors, and iterated until things worked.

This separation turned out to be one of the most important decisions of the entire project. More on that later.

Why C and Not Rust?

My first instinct was to use Rust — modern language, and memory safety for a kernel sounds like a no-brainer. Claude itself advised against it. Its reasoning: vastly more training data exists for OS development in C and Assembly than in Rust. The model would be more capable, hit fewer dead ends, and produce more reliable code in C.

This turned out to be a real insight about working with AI coding agents: you need to balance the agent's capabilities against the intrinsic properties of the language. The theoretically "better" language isn't better if the model can't execute fluently in it. C won on pragmatism.

The Architecture

Before a single line of code was written, I spent significant time with Claude mapping out the full architecture. Here's what ASOS looks like now from 30,000 feet:

Boot
├── UEFI Bootloader — gnu-efi, framebuffer/memory map, page tables, jump to kernel
└── Shared Boot Info — struct passed from bootloader to kernel

Kernel Core
├── Entry — BSS clear, init all subsystems, stack switch, launch first user process
├── Panic — kpanic(): print, halt
├── Memory Management
│   ├── PMM — bitmap frame allocator, 4 KB pages, zeroed frames
│   ├── VMM — 4-level page tables, per-process address spaces
│   └── Heap — free-list kmalloc/kfree, 16 MB
├── CPU Tables & Interrupts
│   ├── GDT/TSS — kernel/user segments, double-fault stack
│   └── IDT/ISR — 256 vectors, exception + IRQ dispatch
├── Hardware Drivers
│   ├── Serial (COM1), PIC, PIT (1000 Hz tick)
│   ├── PS/2 Keyboard + Mouse
│   └── ATA PIO disk driver
├── Filesystem
│   ├── GPT partition table parser
│   ├── FAT32 read/write with subdirectories
│   └── VFS layer with full path resolution
├── Process Management
│   ├── Preemptive round-robin scheduler (20 ms slices)
│   ├── ELF64 loader
│   ├── Per-process page tables + heap
│   └── 34 syscalls (process, file I/O, graphics, windowing)
└── Graphics & Windowing
    ├── Double-buffered 2D drawing engine
    ├── Layered compositor (16 windows, Painter's Algorithm)
    └── Taskbar, title bars, mouse cursor, drag/focus

User Space
├── libasos — freestanding C runtime (printf, malloc, string, syscall wrappers)
├── Desktop — init process, spawns shell, drives render loop
├── Shell — ~20 built-in commands, external program execution
├── Calculator, Drawing App, Text Editor, DOOM
└── All linked at 0x400000, start.asm → main → exit

Getting this architecture right before writing code was non-negotiable. And this is where trusting Claude's architectural judgment had its first failures. A primary example of this occurred when the AI proposed a decoupled data disk strategy—a decision that introduced a complex array of persistent bugs from which it could not autonomously recover. It was only after I explicitly challenged this approach that the model conceded it was a suboptimal path. Resolution was finally achieved by reverting to a unified disk image.

This experience served as a turning point, prompting me to adopt a more granular approach to oversight. Moving forward, I committed to thoroughly interrogating every milestone, technical dependency, and high-stakes decision to ensure architectural integrity

The Milestone Roadmap

With the architecture mapped out, I broke the entire project into discrete milestones — each one a self-contained chunk of work that I could hand to Claude Code as a single prompt. This is the actual sequence I followed:

UEFI Bootloader — Load the kernel ELF into memory, set up a framebuffer via GOP, exit boot services, jump to the kernel. First sign of life: "ASOS" printed to both serial and screen.
GDT + IDT + Exception Handlers — Set up a proper 64-bit GDT, load an IDT, wire up handlers for division error, page fault, general protection fault, and double fault. Now a fault gives a diagnostic message instead of a triple-fault reboot.
Physical Memory + Virtual Memory + Kernel Heap — Parse the UEFI memory map, build a bitmap frame allocator, set up 4-level page tables (replacing the UEFI-provided ones), map the kernel at a higher-half virtual address, and get a free-list heap allocator working.
Interrupts + Keyboard + Timer — PIC initialization, PS/2 keyboard driver, PIT programmed at 1000 Hz (1 ms tick for scheduling granularity later). After this: typing characters and measuring time.
Storage + FAT32 Read — ATA PIO disk driver, GPT partition table parsing, read-only FAT32. Files can now be loaded from disk.
Process Management (3 sub milestones) — Kernel threads first, then a preemptive round-robin scheduler, then full ring-3 user processes with TSS setup for ring transitions.
Syscall Interface + ELF Loader (2 sub milestones) — SYSCALL/SYSRET via MSRs, a syscall table (read, write, exit, getpid, yield, sbrk, waitpid, spawn), and an ELF64 loader to run binaries from the FAT32 volume.
Minimal C Runtime + Shell (2 sub milestones) — A freestanding libc (libasos) with printf, malloc, and string ops. A shell that reads commands and launches ELF binaries.
Shell Ecosystem Syscalls — A rapid sequence of kernel syscalls to support the shell: readdir (for ls), pidof (find PID by name), kill + proclist (process management), working directory support (getcwd/chdir), filesystem stats (fsstat), and file I/O with offset/size control.
FAT32 Write Support (2 sub milestones) — Write support for the filesystem, then subdirectory support (mkdir, rename, move, copy).
Full Shell Commands — All ~20 built-in commands wired up: help, ls/l, cd/go, pwd/path, mkdir/md, touch/new, cp/copy, mv/move, rm/del, cat/show, head/top, tail/bottom, echo/say, kill/end, df/disk, clear/clean.
PS/2 Mouse Driver — Mouse input for the GUI layer ahead.
Graphics Framebuffer Library — Double-buffered 2D drawing engine with an 8×16 VGA font.
Window Manager + Compositor — Layered compositor with Painter's Algorithm, up to 16 windows, title bars, focus, drag.
Desktop Environment — Desktop process as init (PID 2), spawning the shell, driving the render loop.
Terminal Emulator — A proper terminal window inside the desktop.
GUI Toolkit + Syscall API — Windowing syscalls (win_create, win_update, key_poll) so user-space apps can create and manage windows.
Launcher + Taskbar — A launcher button on the taskbar with shutdown and shortcuts to GUI apps.
Desktop Apps — Calculator, text editor, and drawing app.
Window Focus Z-ordering — Clicking a window brings it to front.
Port DOOM — The original 1993 DOOM, running on a custom OS.

Each milestone was its own prompt cycle: I'd describe the goal to Claude (web app), it would generate a detailed implementation prompt, and Claude Code would execute it. The milestones built on each other — you can't write a shell without a filesystem, can't run user apps without a syscall interface, can't port DOOM without a graphics engine. Getting this ordering right was half the battle.

The Build: Day by Day

Days 1–2: Bootloader Through Memory Management

UEFI bootloader, GDT/IDT, physical memory manager (bitmap allocator), virtual memory manager (4-level page tables), kernel heap. This was the foundation, and it went relatively smoothly using Sonnet 4.6 on Claude Code.

One thing became immediately clear: Claude Code faces compilation errors after almost every milestone. C and Assembly for x86-64 kernel development is unforgiving — a wrong flag, a misaligned struct, a missing section .note.GNU-stack directive in NASM, and nothing links. But CC consistently managed to diagnose and fix its own compilation errors without intervention.

Days 3–4: Interrupts, Keyboard, Disk, FAT32

PIC initialization, PIT timer at 1000 Hz, PS/2 keyboard driver, ATA PIO disk I/O, GPT parsing, and then — FAT32.

FAT32 is where things went sideways.

The filesystem implementation introduced a class of bugs that Sonnet couldn't recover from. I burned through attempts and back-and-forths, and eventually the boot process itself broke. I switched to Opus for Claude Code, and it spent approximately 30 minutes untangling the mess — finding issues that Sonnet had been circling around without converging on fixes.

This taught me an expensive lesson about model selection: using Sonnet on complex tasks to save on usage limits was a false economy. The multiple back-and-forths cost more tokens total than Opus would have spent solving it correctly the first time. From that point on, my rule became: default to Sonnet, but switch to Opus immediately for anything genuinely complex.

Day 5: Interactive Shell

On day 5, I had an OS booting into an interactive shell. The shell supported ~20 built-in commands (ls, cd, cat, cp, mv, rm, mkdir, etc.) and could execute external ELF programs via spawn + waitpid.

I want to be honest about how this felt: beyond belief. Five days from an empty repo to an interactive operating system with a filesystem and process management. This is the kind of thing that would take a solo developer months, maybe longer, working from scratch.

Days 6–8: Graphics, Window Manager, Desktop, DOOM

Once the base OS was solid, I moved into higher-level territory: a double-buffered 2D graphics engine, a layered window compositor supporting 16 windows, a desktop environment with a taskbar and system clock, and four GUI applications.

Two interesting things happened here:

Features started consuming fewer tokens. The higher-level code could lean on the kernel infrastructure already in place. A new user-space app is fundamentally simpler than a page table walker.
Even Opus started leaving bugs. The window manager and desktop environment were where Opus finally hit its limits. Multiple rounds of debugging were needed — the compositor's layer ordering, focus handling, mouse hit-testing, and the interaction between the render loop and the shell's I/O all created subtle issues that required iteration.

Porting DOOM was its own adventure. It took Opus roughly 2–3 hours of total work to get the original 1993 DOOM running on ASOS — fixing incompatibilities, adapting the rendering pipeline, wiring up keyboard input. The game runs and is playable, though with some graphical flickering I chose not to spend more time on.

Interestingly, porting DOOM was harder than many core OS components. My theory: building an OS from scratch is well-represented in the model's training data (OSDev wiki, xv6, countless hobby kernels). Porting a specific game to a novel, custom OS is a far more unusual task. The model had fewer patterns to draw from.

The Prompt Engineering That Actually Mattered

The single most impactful technique was using one Claude to generate prompts for another Claude.

My initial prompts to Claude Code were... fine. They worked for simple things. But OS development has an enormous surface area of decisions, and every unspecified detail becomes a default decision CC makes on its own — and those defaults can create entire classes of bugs.

So I developed a workflow:

Describe the next milestone to Claude (web app)
Claude generates a comprehensive, detailed prompt with all the decisions made explicit (or advises to break it down into multiple sub milestones)
Feed that prompt to Claude Code

The prompts Claude generated were probably unnecessarily long and granular for the higher-level milestones. Later in the project, I switched to Gemini as my consultant (to save Claude usage limits), and Gemini produced shorter prompts that still got the job done. But for the low-level kernel work — bootloader, memory management, context switching — i believe that level of detail was genuinely necessary.

Audit Prompts

For complex, high-risk implementations, I used what I call "audit prompts" — prompts specifically designed to review and catch errors in code that CC had just written. These were separate from the implementation prompts, and they consistently found bugs that the initial implementation missed.

I also asked Claude to generate a prompt that instructs CC to verify everything after high-complexity steps. This kind of meta-prompting — using the AI to improve its own verification process — was surprisingly effective and managed to catch some issues even before the first execution.

The CLAUDE.md Force Multiplier

Generating a comprehensive CLAUDE.md file for the project was one of the highest-ROI things I did. This file describes the project structure, build system, architecture, conventions, and critical implementation notes. It means Claude Code doesn't waste tokens rediscovering the project every session. I estimate this alone reduced token consumption by 20–30%.

Where Claude Failed — And What That Tells Us

Architectural Overconfidence

Early on, I fully trusted Claude with architectural decisions. It was very confident about its choices. Later, when those decisions led to failures, it essentially said "yeah, that was actually a bad approach.", which obviously in turn quickly eroded my trust in it.

This is a pattern I've seen repeatedly: LLMs will present architectural decisions with high confidence regardless of whether they'll actually work. They don't model uncertainty well at the system design level. You need a human who understands architecture and system design reviewing these decisions before committing to them.

The Comprehension Debt Problem

This is the most important thing I learned, and it generalizes to every AI-assisted project.

AI generates massive amounts of code in very little time. At some point during ASOS, I reached a threshold where I couldn't fully understand what Claude was explaining to me about the code it had written. The low-level x86-64 paging details, the specifics of the SYSCALL/SYSRET MSR configuration, the FAT32 cluster chain traversal — I was forced to trust Claude's smaller decisions because being fully involved in every single one would have made the project take 10–100x longer or more.

That's the bargain. And for a hobby project, it's fine. But for professional work, this is a critical concern: comprehension debt accumulates silently as AI writes code faster than humans can review it. If you're shipping to production, someone needs to understand every line. The time you saved generating code gets paid back during review — or worse, during an incident.

Dependency Hell

I got into a prolonged battle with dependencies — toolchain setup, cross-compilation flags, library paths. This is exactly the kind of environment-specific, state-dependent problem that AI agents struggle with, because the solution depends on your specific machine state rather than universal programming knowledge.

Concrete Lessons for AI-Assisted Development

On Token Economy

I was on the Pro plan and hit my daily usage limit every single day. Here's what I learned about managing token consumption:

Use an external AI (Gemini, etc.) as your consultant instead of the Claude web app. Reserve Claude's limits for Claude Code execution.
/clear aggressively between unrelated tasks. Free, keeps sessions lean.
/compact to shrink context mid-session when you want continuity without the full history.
Default to Sonnet, switch to Opus only for genuinely hard problems. Opus consumes roughly 3x more tokens per request.
Invest heavily in CLAUDE.md. The upfront cost of documenting your project saves tokens on every subsequent session.
Write specific, detailed prompts. Vague prompts cause back-and-forth that burns tokens on clarification rather than progress.

On What Humans Actually Need to Do

AI coding agents are remarkably capable. But after building an entire OS with one, here's what I believe the human role needs to be:

Human approval for high-impact decisions. The model shouldn't unilaterally choose your memory layout, your syscall ABI, or your filesystem format.
Observability into reasoning, not just results. You need to understand why the agent made a choice, not just see the diff.
Clear override and rollback paths. When the agent goes down a bad path (and it will), you need to be able to revert cleanly.
Feedback loops to correct behavior. The agent learns within a session but not across sessions. Your CLAUDE.md and your prompts are the feedback mechanism.

And most importantly: depending on size and complexity of course, you don't just need a human in the loop — you need a software engineer in the loop. Someone who understands architecture, system design, security, and the overall vision. AI is a force multiplier, but zero times any multiplier is still zero.

On Model Selection and Language Choice

The model's fluency in a language matters more than the language's theoretical advantages. C won over Rust here because the model could execute more reliably in C.
Claude is demonstrably better at systems programming tasks (like this OS) than at complex and novel business logic — at least in my experience as a senior SWE who works with both daily.
Harder doesn't always mean "more complex code." Porting DOOM was harder than writing the scheduler because it was a more novel task relative to training data. Novelty, not complexity, is what trips these models up.

The Numbers

Metric	Value
Total lines of code (excluding DOOM)	25,709
Languages	C, x86-64 Assembly (NASM)
Source files	331
Git commits	82
Calendar days	8
Kernel syscalls	34
User-space applications	8 (shell, desktop, calculator, drawing app, text editor, graphics test, loop test, DOOM)
Compositor windows	Up to 16
Model used (execution)	Claude Sonnet 4.6 → Opus 4.6
Model used (consulting)	Claude Opus → Gemini
Daily usage limit hit	Every single day
Cost in extra usage	~$125 AUD / ~$88 USD

Final Thoughts

Eight days ago, I had an empty repository and a question: how hard can it be?

The answer: extremely hard — but possible in a way it absolutely wasn't before. I went from zero to a desktop operating system with a window manager, a shell, GUI applications, and DOOM, in just over a week. That's not because OS development got easier. It's because the leverage AI provides is genuinely transformative when applied to the right kind of problem.

But the AI didn't build this OS. I built this OS, with AI. Every major architectural decision went through me. Every milestone was planned by a human who understood (or learned to understand) what an operating system actually needs. The AI wrote the code, which traditionally has been the biggest bottleneck! but the engineering judgment, the vision, the "wait, that's a terrible idea" moments — those were mine.

ASOS is open source: github.com/smadi-a/ASOS

DEV Community