December 2025
This post covers aaom-html: a WHATWG HTML5 parser implemented in MoonBit. As with the previous entries in this series, I treat the AI as a high-intensity pair programmer: I provide the goal, constraints, and review; Claude does the rapid implementation and iteration. The key was not "write a tokenizer and a tree builder first", but build the test harness first—after that, progress becomes a mostly automated grind of reducing failures to zero.
Skill
This time, I use a moonbit-library-builder skill turned from experiences in building parsers with agents during the last two AAoM posts. The header looks like as follows:
---
name: moonbit-library-builder
description: "Build MoonBit libraries using spec-driven and test-driven development. Use when asked to implement a library, port a library from another language (JS, Rust, Go, Python), create a parser/compiler, or build any substantial MoonBit package. Triggers on requests like \"implement X in MoonBit\", \"port Y library to MoonBit\", \"create a Z parser\", or \"build a template engine\"."
---
The section titles are like:
## Workflow
### Phase 1: Gather Specs
### Phase 2: Write Tests First
### Phase 3: Implement Incrementally
## Testing Patterns
I also annotate Common Pitfalls in parsers:
## Common Pitfalls
- **String indexing**: `s[i]` returns `UInt16`, not `Char`. Use `for char in str` for char access
- **Mutable fields**: Use `mut` keyword in struct field declarations
Approach
HTML5 is not hard because of syntax. It’s hard because of browser-grade error recovery: insertion modes, the adoption agency algorithm, foster parenting, foreign content (SVG/MathML), and a lot of stateful edge cases. It has no context-free grammar. Instead, it needs an extremely complicated state machine to tokenize and parse. Without a canonical test suite, you end up guessing.
My initial command was simple: "Implement a HTML5 spec‑compliant parser with WHATWG state machine, malformed input recovery". This command instructed Claude make a plan for test
suite generation.
Unfortunately, the number of tests was huge (~8k) and the test-gen script got out of memory. Claude splitted the tests into batches, each including up to 500 tests. I knew there were better ways to overcome this problem, but since Claude had already solved it, I didn't ask him to make any further changes. After all, the scripts are not vital to this project.
As results, 14 tokenizer test files and 4 tree test files were generated. Since each test in html5lib-tests already includes expected output, e.g. for tokenizing,
{"description":"PLAINTEXT content model flag",
"initialStates":["PLAINTEXT state"],
"lastStartTag":"plaintext",
"input":"<head>&body;",
"output":[["Character", "<head>&body;"]]},
and for trees:
#data
<a><p></a></p>
#errors
(1,3): expected-doctype-but-got-start-tag
(1,10): adoption-agency-1.3
#document
| <html>
| <head>
| <body>
| <a>
| <p>
| <a>
we don’t re-run a reference parser to compute expected trees—the .dat files already contain canonical expected output. The job is to make our doc.dump() stable and compatible with those expectations.
// for tokenizing
test "html5lib/tokenizer/namedEntities_bad_named_entity_hat_without_a_semicolon_492" {
let (tokens, _) = @html.tokenize("&Hat")
inspect(
tokens,
content="[Character('&'), Character('H'), Character('a'), Character('t'), EOF]",
)
}
// for trees
test "html5lib/tree/adoption01_2" {
let doc = @html.parse("<a>1<button>2</a>3</button>")
inspect(
doc.dump(),
content=(
#|<html>
#| <head>
#| <body>
#| <a>
#| "1"
#| <button>
#| <a>
#| "2"
#| "3"
),
)
}
Meanwhile, Claude also wrote a Python script to generate entities. It was amazing! Let me briefly explain this. HTML5 defines 2231 Named Character References such as & ≂̸. Claude downloaded the official specification from https://html.spec.whatwg.org/entities.json and then transformed them into entities.mbt:
fn init_entities() -> Map[String, Array[Int]] {
let m : Map[String, Array[Int]] = Map::new()
m["AElig"] = [198]
m["AElig;"] = [198]
...
}
Generating tests is just the start. The time sink is making output stable so the suite is usable:
- Remove the
|prefix from tree dumps (cleaner snapshots) - Stable attribute sorting (lexicographic)
- Escaping control characters and C1 controls
- Matching MoonBit
inspectescaping rules (\b,\u{0X}, soft hyphen, noncharacters, etc.) - Using
#|multi-line strings for readable expected trees
I can hardly imagine how painful it would be to debug the script by myself on this.
Once the entity definitions and conformance tests were done, the rest was straightforward. With the command Continue until finishing all 8251 tests, Claude worked continuously for nearly 6 hours, submitted 23 code commits. When I checked his work status again, he had already processed 8244/8251 tests. But sadly, he wasted nearly 3 hours on the final 7 tests without solving a single one. He was going in circles, getting stuck on the same point, constantly trying to write new code, finding no solution, then deleting the code and repeating the same useless attempts. I asked Claude to think deeper but in vein.
I suddenly remembered a news story I'd heard recently: GPT-5.2 was better at programming than Opus-4.5. I thought, why not let Codex (GPT-5.2) try to solve these seven edge cases? Well, I launched CodeX, selected model GPT-5.2, and chose the extra high reasoning level. It must be said that he thought very slowly. But after about ten minutes, he came up with a very clear solution and solved the problem in another ten minutes. The value wasn’t "more code faster", but a cleaner causal chain through the state machine (stack traces -> tokenizer/tree builder trigger -> specific transition), leading to a fix quickly. That intelligence surprised me.
Results
8251 tests (including conformance tests and smoke tests) passed.
- Full WHATWG HTML5 specification compliance
- 80 tokenizer states
- 25 tree construction insertion modes
- 49 parse error types with graceful recovery
- 2,231 named character references
Reflections
The last 1% is reasoning-heavy. What surprised me most was GPT-5.2's ability to handle difficult problems. After completing this AAoM, I used GPT-5.2 to assist with other projects originally developed with Claude. In contrast, Codex didn't have the forgetfulness that Claude had, as I mentioned before. However, Codex's user experience, speed, and compliance are far inferior to Claude's. The best practice I've found so far is to have Claude do the actual work, and then have Codex review it simultaneously, which works very well.
Time investment: ~7 hours of active development:
- 2 minutes: Elaborate the moonbit-library-builder skill.
- ~3 hours: Without human intervention, download html5lib-tests, writing test-gen scripts, implement the basic features and pass most tests.
- ~3 hour: Try to handle the remaining 7 edge cases, and failed.
- ~0.5 hour: GPT-5.2 help solving the remaining 7 edge cases.
Code is available on github.
I haven't decided what AAoM Day 4 will be yet, but I’m increasingly convinced: for spec-heavy projects, the best use of AI isn't "write everything", it’s "make it converge inside a strong testing loop". Maybe I will try something different like interpreters for a subset of ECMAScript 2025 or Python3.

Top comments (0)