Milky

Posted on Jan 30

AAoM-04: A Python 3.12 Interpreter

#vibecoding #moonbit #python

AAoM-04: A Python 3.12 Interpreter

January 2026

Happy New Year! This entry covers moonpython: a Python interpreter in MoonBit. After about two weeks of vibe coding with Codex (GPT-5.2 & GPT-5.2-Codex), the project now runs a large, pragmatic subset of Python 3.12.

Toolchain Updates

This time I used Codex CLI with GPT-5.2 and kept three MoonBit skills active (same set as before: moonbit-lang, moonbit-agent-guide, moon-ide). Codex is slower than Claude but much more stable on long, reasoning-heavy tasks. That tradeoff would work well for a language interpreter.

The moon-ide recently gained three commands that are especially handy in a growing interpreter codebase:

hover shows hover information for a symbol.
outline shows an outline of a specified file.
doc shows documentation for a symbol.

Codex can use these tools very skillfully.

Problem

A useful subset of Python needs to capture at least:

Dynamic semantics: scoping, closures, globals/nonlocals, and the descriptor model.
Generators, async/await, and exception handling
Pragmatic import support and enough builtins to run real scripts.

I targeted Python 3.12 semantics, but intentionally skipped full stdlib parity, C extensions, packaging, and bytecode compatibility. The aim is correctness for a useful subset, a clean architecture, and repeatable testing. More specifically, moonpython is meant to be used as a library to run real Python snippets, not as a CPython replacement for large production-scale projects. Given that scope, a JIT is poor ROI: it usually demands deep, platform-specific optimization for each OS and architecture, and the effort is often 5-10x the cost of building the interpreter itself.

Runtime Reality Check

During implementation I kept adding new features: most builtins, async/await, generators, type hints, exception groups, and so on. Yet real-world Python projects still failed to run. The lesson was clear: the hardest part of an industrial-strength interpreter is not syntax coverage, but whether the runtime is dirty enough.

By "dirty," I mean the unglamorous compatibility details that real code quietly depends on: import caching rules, path search order, namespace packages, module metadata (__file__, __package__, __spec__), descriptor binding semantics, edge-case exception types, and even tiny differences in string/float formatting. A clean design is not enough; you have to copy CPython's weird corners. For example, __file__ is not part of "Python-the-language", but for file-backed modules it is a widely relied-upon convention that real projects assume exists. In practice, this means the import system must be almost CPython-identical, the object model has to match descriptor and attribute rules, and error messages must be stable. This is where most of the remaining work lies, not in adding new syntax nodes.

Python has documentation and PEPs, but there is no single executable specification; on the messy edges (especially imports), compatibility is ultimately defined by what CPython does and what the ecosystem expects. This is also why I treat CPython behavior as the ground truth: it is at least verifiable, and that matters enormously when you are building with AI in the loop. The downside is that it pushes moonpython toward a lot of edge-case handling code, because matching a living ecosystem inevitably means handling its corners.

If you want a vivid reminder of how these "small" features accumulate into ecosystem constraints, Armin Ronacher’s classic post Revenge of the Types is a great read.

Approach

Tests Generation

Following the previous AAoM pattern, I built the test harness first. The script scripts/generate_spec_tests.py harvests a subset of snippets from CPython Lib/test, runs the snippets under a restricted builtins set and emits MoonBit snapshot tests into spec_generated_test.mbt. That single file contains 2,709 generated tests. Together with the (AI) hand-written tests, the suite currently has 2,894 tests.

A typical generated test looks like this:

test "generated/expr/0001" {
  let source =
    #|'This string will not include \
    #|backslashes or newline characters.'
  let result = Interpreter::new_spec().eval_source(source)
  let expected = "[\"ok\", [\"Str\", \"This string will not include backslashes or newline characters.\"], \"\", \"\"]"
  assert_run(result, expected)
}

Limitations are inevitable. The harvesting is heuristic and can miss real-world patterns. The sandboxed evaluator only allows a small builtins set and a few imports, so many library behaviors are simply out of scope. Snippets that take too long are skipped, and some values (like NaN/Inf or arbitrary objects) are deliberately excluded from serialization. Error reporting is normalized rather than bit-for-bit identical with CPython. I avoided generating too unbounded tests early on purpose. Based on my experience writing a WebAssembly runtime, if I did that, there was a good chance moonpython passes literally zero tests at the start. When everything is red, the agent may tend to over-engineer locally (especially in the parser) to satisfy a huge, noisy failure surface, and it becomes hard to enter an iterative development rhythm.

So I kept the generator constrained to produce a manageable bootstrap suite which is enough to guide the first implementation and validate core semantics, but not so much that it prevents early wins. Once the interpreter becomes usable, the CPython Lib/test suite is the real endgame. Of course, I never expected the interpreter to pass them completely.

Long-time Grinding

These two weeks were effectively 7x24 for Codex. I would check in a few times a day, but most of the time the agent was working on its own. In total, Codex initiated 104 commits during this period.

At the beginning of the conversation, we repeatedly discussed shaping the spec-driven test harness and stabilizing CPython evaluation under a restricted builtins set. Once the generator was in place, I let Codex run unattended and only stepped in when the failure rate stopped decreasing. Intervention was not about fixing individual bugs. I would interrupt Codex only when I noticed clear conceptual errors. One recurring example was a misunderstanding of how "import" fields in moon.pkg.json should be authored, which led Codex to apply the same incorrect pattern repeatedly. When that happened, I wrote the correct convention into AGENTS.md or a SKILL.md, and then restarted the agent using codex resume --last.

Results

The interpreter is a direct AST evaluator (which may not be the best choice. I will discuss this later).

The project layout:

lexer.mbt and parser.mbt implement a Python 3.12 grammar.
A dedicated spec file defines the public AST and value model.
runtime_*.mbt implements the syntax including builtins, scoping, exceptions, generators, and object model.
cmd/main and cmd/repl provide a runner and a simple REPL. Besides, moonpython can also be used as a library.

More language features were implemented than I had expected:

Python 3.12 syntax for match, with, and full f-strings.
Generators (yield, yield from) and async (async def, await, async for/with, async generators).
Exception groups and except* (PEP 654), plus tracebacks with line/column spans.
Type parameter syntax (PEP 695) parsed and preserved in the AST (runtime no-op for now).
Core data model: big ints, floats, complex numbers, bytes/bytearray, lists/tuples/sets/dicts, slicing assignment.
A file-based import system and a vendored CPython Lib/ snapshot for pure-Python modules.

You can try some real world Python programs with moonpython:

All generated 2,894 tests passed. So, how many tests in CPython Lib/test has passed? Well, zero at the moment, because the suite currently aborts early due to missing support for variable annotations (PEP 526). This feature might have been supported when this post is published. Anyway, to pass the whole Lib/test is still a long-term goal, and I fully expect it to land as the project matures.

Time Investment:

One day to find and download the factual standards;
Then, about two weeks of active development, mostly spent on the runtime (scoping + generators/async) and on shrinking the long tail of test failures.

Reflections and Takeaways

The Implementation Code Is Far from Clean

As I said, useful tools have to be dirty because the problems in the real world are dirty. Import in Python is a great example of why the runtime must be dirty: because the spec is messy. Fortunately, this is exactly the kind of dirty work that AI is good at: it can grind through edge cases, keep the bookkeeping consistent, and iterate until imports behave like the ecosystem expects.

Is AST-walking Interpretation a Good Idea?

Another counterintuitive lesson is that a pure AST-walking interpreter is not always simpler than "compile to bytecode & run a bytecode VM".

When Codex first presented this trivial design, I didn't object because it was the most straightforward solution. But once you need suspend/resume semantics like generators and async/await, plus correct try/finally, and fine-grained tracebacks, an AST interpreter has to handle defunctionalized continuations manually and often ends up re-creating a minimal VM anyway. Even worse, with an AST-walking interpreter, there is nowhere to apply any analysis and optimizations (that's why you need to analyze the programs on a IR).

Which Model Should We Choose?

I also used GPT-5.2-Codex for days which is the latest agentic coding model released by OpenAI. But based on my observations, GPT-5.2-Codex is not always superior to GPT-5.2. In fact, for non-programming tasks, GPT-5.2 is clearly better than GPT-5.2-Codex. Overall, Codex (GPT-5.2) works well, though it usually takes several times longer than Claude (Opus-4.5) to complete easy tasks.

Although Codex is slow, its accuracy is significantly higher than Claude's, and it almost never causes rework. Claude, by contrast, often tries to tackle difficult problems through repeated trial-and-error; when an attempt fails, it frequently reverts all changes with git checkout, only to head back into the same dead end.

In this respect, Codex is highly reassuring. You can safely hand tasks over to it and step in only to update your SKILLs with necessary guidance. Once the rules were made explicit, Codex usually absorbed them and continued working productively with little further guidance. Btw, MoonBit's strong typing and predictable tooling turned out to be particularly helpful once the runtime logic grew large.

Multiple Codexes Work Together

In parallel, I was also using Codex to build other software. We increasingly do not need to micromanage AI, today. Given enough time and a solid test loop, the AI can make major progress with minimal human interaction. That suggests a better workflow: keep multiple AI sessions open in separate terminals and let them work concurrently.

In theory, with git worktree, having multiple Codexes develop the same project in parallel should not be hard either. The real cost is the merge: resolving conflicts is a heavy cognitive load, especially when the changes are large and cross-cutting. For now I am having each Codex develop a separate project and will only try multi-agent same-repo work when the payoff is more clear. At the moment, I am developing 5–8 projects in parallel with 5–8 coding agents. Next time, I will share some of the results.

Code is available on github.

DEV Community

AAoM-04: A Python 3.12 Interpreter

AAoM-04: A Python 3.12 Interpreter

Toolchain Updates

Problem

Runtime Reality Check

Approach

Tests Generation

Long-time Grinding

Results

Reflections and Takeaways

The Implementation Code Is Far from Clean

Is AST-walking Interpretation a Good Idea?

Which Model Should We Choose?

Multiple Codexes Work Together

Top comments (0)