A tiny Darwin Gödel Machine that edits itself and keeps only changes that verifiably score higher.
TL;DR: I built a small program that improves itself. It looks at the tasks it's failing, edits its own code to fix them, and keeps a change only if the change actually makes it score better on a test. It goes from passing 1 of 8 tasks to 8 of 8 — and nobody wrote those fixes but the program itself. It runs on a laptop in under a second. No fancy hardware, no API key.
The old dream: software that improves itself
Normally, software only gets better when we make it better. You write code, you find a bug, you fix it, you ship again. The program never improves on its own.
People have wanted "software that improves itself" for decades. The classic version (called a "Gödel Machine") had one rule that made it impossible to build: before the program could change a line of its own code, it had to mathematically prove the change would help. Proving that about real code is basically impossible, so the idea never worked.
In 2025, researchers found a way around it with the Darwin Gödel Machine. They dropped the "prove it first" rule and replaced it with something every engineer already trusts:
Try the change. Run the tests. If the score went up, keep it. If not, throw it away.
That's it. It's basically how we all work — make an edit, run the test suite, keep what passes. The twist is that the program is the one making the edits. In the real paper, this let an AI coding assistant improve its own tooling and jump from solving 20% to 50% of a hard benchmark of real GitHub issues.
I wanted to actually see this happen, so I built the tiniest version I could.
The 10-second version
| Start | After improving itself | |
|---|---|---|
| What it can do | only uppercase
|
learned 6 more skills on its own |
| Test score | 🔴 1 / 8 | 🟢 8 / 8 |
| Who wrote the fixes? | — | the program did |
Start: ███░░░░░░░░░░░░░░░░░░░░░ 1/8 (only knows: uppercase)
+reverse ██████░░░░░░░░░░░░ 2/8
+dedup_csv █████████░░░░░░░░░ 3/8
+sum_csv ████████████░░░░░░ 4/8
+sort_csv ███████████████░░░ 5/8
+title ██████████████████ 6/8
+normalize_inputs ████████████████████ 8/8 ← one fix unlocked TWO tasks
✅ SOLVED 8/8
How it works (the whole thing)
There are only three pieces.
1. The "agent" is just a bag of skills. Each skill is a tiny function — uppercase text, reverse it, sort a list, etc. It starts out knowing almost nothing.
2. A test with known answers. Every task has a correct answer, so checking the score is a plain equality check — output == expected. No human grading it, no second AI judging it. Just: did it get the right answer or not? (This "write a checker, then measure" idea is the same trick behind today's reasoning models.)
3. The loop. Over and over: look at what's failing, add one skill to try to fix it, re-run the test, and keep the change only if the score went up. It also saves every improved version, so it can branch off any of them later instead of getting stuck.
new_version = old_version + add_a_skill(things_it_is_failing)
if score(new_version) > score(old_version): # did the test score actually improve?
keep(new_version) # yes -> save it and build on it
The cool part: small fixes unlock big ones
One of the skills it adds, "clean up the input" (trim weird spacing), does nothing by itself. But the agent had earlier learned a "title-case" skill that kept breaking on messy text like " the quick fox ". The moment it adds the cleanup step, two stuck tasks start passing at once — that's the +2 jump at the end.
This is the whole point in miniature: the agent isn't just adding features. It's making itself better at getting better. A boring little fix becomes the stepping stone that makes later fixes work. The real research sees the same thing at full scale — the AI invents helpers like "try a few solutions and pick the best one," which then make every future fix more effective.
Why I think this is where things are going
For ten years, the way to make AI better was: make the model bigger. The newer idea is to make it improve itself while it runs:
- This post — an agent that rewrites its own code.
- "Language Models Need Sleep" (2026) — agents that tidy up their own memory during an offline "sleep."
- Small models that think harder instead of being bigger.
The common thread: improvement is shifting from us retraining the model to the program improving itself, with a simple test telling it whether each change was good. Software that edits itself starts to feel less like a fixed program and more like something that grows.
Try it (under a minute)
git clone https://github.com/Shridhar-2205/living-software
cd living-software/01-self-rewriting-agent
python demo_cli.py # watch the score climb 1/8 → 8/8
pytest -q # the same claims, as automated tests
One honest note on safety: a real self-rewriting agent runs code it wrote itself, which is risky. In my version the "edits" come from a fixed list of safe skills, so nothing dangerous ever runs — the loop matches the research, the risk is zero. (The real one runs inside a sandbox for exactly this reason.)
The takeaway
The old dream needed a mathematical proof before changing any code. The new version just needs a test. If you can write a check that says "this got better," you can let a program improve itself — and watch it find clever fixes you never wrote.
Shridhar Shah — Senior Software Engineer on the AI team at Cisco. Part 1 of Toward Living Software.
Source: Zhang, Hu, Lu, Lange, Clune, "Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents," arXiv:2505.22954 (2025) — reports SWE-bench 20.0% → 50.0%.
Top comments (0)