An AI wrote a working operating-system kernel from scratch in 38 minutes

#codingagents #anthropic #capabilities #systems

An Anthropic model built a bootable operating-system kernel from an empty folder in roughly thirty-eight minutes of compute time, across about two hundred unassisted back-and-forth turns. The kernel boots inside an emulator and passes its own built-in tests — a task normally requiring months of specialist work.

Key facts

What: A blow-by-blow log shows one of the now-suspended models building bootable low-level systems code from an empty folder -- the kind of feat that made regulators nervous.
When: 2026-06-22
Primary source: read the source

The full write-up is documented here: Tolmo: When the model writes the kernel. A kernel is the innermost core of an operating system — the part that talks directly to hardware, manages memory, and decides which program runs when. It is among the hardest, most unforgiving code in software: a single wrong assumption about how the processor works yields a dead screen with no error message. Kernels are normally the domain of small specialist teams working for months. Watching a model take an empty folder to a booting kernel in well under an hour is like handing a robot raw steel and an empty lot, then coming back to find a small, running engine.

The headline oversells it. What the model built is a minimal kernel shaped like the core of Windows — it boots and runs its self-checks, but it is not a full operating system. There is no login, no place to run programs; it is the engine block, not the finished car. It runs inside an emulator rather than on a real laptop. "An AI wrote Windows" is wrong. "An AI wrote, unassisted, the hardest layer of a real operating system, well enough to boot and self-test, in the time it takes to watch a sitcom" is right, and that is startling enough.

There is a near-poetic detail buried in the write-up: the project ran longer than the original session, and the later stretch had to switch to a different, older model because the model that started the job had been export-suspended partway through — the very shutdown described in this week's bigger story. The kernel demo is a live illustration of the exact capability that got the model pulled, interrupted by the pulling.

How does a language model do this? The same underlying machinery behind chatbots — a system trained to predict the next chunk of text — wrapped in a loop that lets it act like a developer: write a file, try to compile it, read the error, fix it, try again, run the tests, repeat. That tight feedback cycle separates a model that can describe a kernel from one that can produce a working one. Each failed compile is information, and the model folds that information back in until the thing boots. For a broader picture of how these self-directed coding systems work, see our explainer on AI agents.

Why it matters is straightforward and double-edged. The same ability that lets a model stand up systems code from scratch is the ability that lets it understand, and potentially exploit, the systems code everyone else relies on. That dual-use quality is precisely what made this capability tier a target for the new oversight rules. It is also why this single anecdote has been passed around so widely: it is concrete in a way that benchmark charts never are. You don't need to trust a score; you can read the log.

The honest caveat: this is one impressive run, documented by one developer, and a curated success story is not the same as reliability. We don't see how many attempts failed, how brittle the result is, or how it would fare on hardware that doesn't behave as politely as an emulator. A model that can do this once under good conditions is genuinely remarkable; a model that can do it on demand, every time, would be a different and more consequential thing — and that second claim isn't established here.

Originally published on Ground Truth, where every claim is checked against the primary source.