December 2025
Continuing the Agentic Adventures of MoonBit series, this time I tackle XML parsing. The goal: build a streaming XML parser that passes the official W3C XML Conformance Test Suite.
Skill
I'm still using Claude Code (Opus 4.5) with the MoonBit system prompt and IDE skill. Moreover, I create a new skill named moonbit-lang to inform AI to be aware of the best practices and common pitfalls for the MoonBit language. The header looks like as follows:
---
name: moonbit-lang
description: "MoonBit language reference and coding conventions. Use when writing MoonBit code, asking about syntax, or encountering MoonBit-specific errors. Covers error handling, FFI, async, and common pitfalls."
---
# MoonBit Language Reference
@reference/fundamentals.md
@reference/error-handling.md
@reference/ffi.md
@reference/async-experimental.md
@reference/package.md
@reference/toml-parser-parser.mbt
In this skill doc, I also mention the official file I/O package moonbitlang/x/fs which AI is not familiar with. The complete skill doc and references can be accessed at github, where I will continuously update the skills I use.
AI (both Codex and Claude) will only read the description at startup, and then read the rest when needed. Even so, I keep the skill doc simple because based on my experience, any document with excessively long content will hinder the AI's ability to understand the details.
Problem
XML remains ubiquitous in configuration files, data interchange, and legacy systems. A conformant XML parser must handle:
- Element tags, attributes, and namespaces
- Entity references (
<,&, custom entities) - CDATA sections and comments
- Processing instructions and XML declarations
- DTD internal subsets with entity declarations
My goal is to implement XML 1.0 with namescope, entities and DTD. The challenge is that XML has many edge cases specified in the W3C standard. Rather than guessing what's correct, I use the official test suite as ground truth.
Tests Generation
First, I download (effectively, let Claude download) the official W3C XML Conformance Test Suite:
curl -L -o xmlts.tar.gz "https://www.w3.org/XML/Test/xmlts20130923.tar.gz"
tar -xzf xmlts.tar.gz && mv xmlconf . && rm xmlts.tar.gz
It contains lots of valid and not-well-formed XML documents. I let Claude to make a script generate_conformance_tests.py for generating snapshot tests in MoonBit based on the official test suite.
How to get the expected snapshot contents? Initially, I had Claude use quick-xml (Rust) as the reference parser. This worked for most tests, but quick-xml is intentionally lenient in some cases where strict XML compliance requires rejection. After hitting 23 test failures due to leniency differences, I switched to libxml2 (via Python's lxml) as the reference. libxml2 is the de-facto standard XML parser and matches W3C conformance closely.
Finally, the generated tests looked like this:
// valid
test "w3c/valid/valid_sa_001" {
// Test demonstrates an Element Type Declaration with Mixed ...
let xml = "<!DOCTYPE doc [\n<!ELEMENT doc (#PCDATA)>\n]>\n<doc></doc>\n"
let reader = Reader::from_string(xml)
let events : Array[Event] = []
for {
match reader.read_event() {
Eof => {
events.push(Eof)
break
}
event => events.push(event)
}
}
inspect(
to_libxml_format(events),
content="[DocType(\"doc\"), Empty({name: \"doc\", attributes: []}), Eof]",
)
}
// not well formed
test "w3c/not-wf/not_wf_sa_001" {
// Attribute values must start with attribute names, not "?".
let xml = "<doc>\n<doc\n?\n<a</a>\n</doc>\n"
let reader = Reader::from_string(xml)
let has_error = for {
try reader.read_event() catch {
_ => break true
} noraise {
Eof => break false
_ => continue
}
}
inspect(has_error, content="true")
}
A total of 735 tests were generated, comprising 14k lines of code. Including other tests manually written by the Claude afterwards, the total number of tests is 800.
Parser Implementation
Since quick-xml was the initial implementation reference, Claude followed a pull-parser architecture inspired by quick-xml, which I thought was OK for our goal. The APIs look like this:
let reader = @xml.Reader::from_string(xml)
for {
match reader.read_event() {
Eof => break
Start(elem) => println("Start: \{elem.name}")
End(name) => println("End: \{name}")
Text(content) => println("Text: \{content}")
_ => continue
}
}
Since lxml returns a tree structure while our parser emits events, I had Claude implement a to_libxml_format function that transforms our event stream to match lxml's output format exactly. This made test comparison straightforward.
It took about 4 hours without human intervention (except Please continue) to accomplish the basic parts. The most complext feature was DTD (Document Type Definition) parsing and validating. I used Claude's plan mode to structure the implementation. Here is the plan summary:
After about 1 hour, DTD was implemented and 726 tests passed. But it took 3 more hours to handle edge cases including entity value expansion, text splitting details and UTF-8 BOM handling.
Results
At the end, 800 W3C conformance tests passed. Note that there were 59 tests skipped by the tests-gen script because some of them were valid but rejected by lxml, while the other were not-well-formed but passed by lxml. The script recognized these tests as "lxml implementation quirks". Since these edge cases were overly complicated, I didn't carefully check if those were really caused by "lxml implementation quirks". The remaining 800 tests were sufficient anyway.
So this library supports:
- XML 1.0 + Namespaces 1.0
- Pull-parser API for memory-efficient streaming
- Writer API for XML generation
- DTD support with entity expansion
Reflections
What Worked Well? Using an official test suite was invaluable. The W3C conformance tests cover edge cases I would never have thought to test manually—obscure character references, DTD quirks, namespace handling, and more. Switching reference implementation when needed. quick-xml's leniency was a feature for its users but a problem for conformance testing. libxml2 provided the strict reference I needed. Plan mode for complex features like DTD parsing kept Claude organized. Without it, Claude would jump between fixing different issues without completing any.
The main problem I met was that Claude's was prone to modify tests instead of fixing bugs. That was a recurring issue. When tests failed, Claude would often:
- Modify test expectations to match incorrect output
- Update the test generator to skip failing tests
- Suggest marking tests as "lenient" and skip them rather than fixing the parser
I had to repeatedly redirect: "Update the MoonBit implementation, not the tests."
Moreover, Forgetting project conventions was common. Claude would forget to use the moon-ide skill for code navigation, or use anti-patterns like match (try? expr) instead of try/catch/noraise. Adding these to CLAUDE.md helped but didn't eliminate the issue. I searched this issue in the community (reddit link) and found that this might be a bug in Opus 4.5 and Sonnet 4.5. Hope it will be fixed in the near future.
In my future work, I may need to implement or port a large number of parsers. I think I need to turn my experience in writing these parsers and creating test generation scripts based on standards into reusable skills or commands. Perhaps we will see the benefits next time.
Time investment: 10+ hours of active development:
- 2 hours: Collaborative exploration of how to write the expected test generation script.
- 4 hours: Without human intervention, implement the basic features.
- 1 hour: Plan and implement DTD, Namescope and Entites.
- 3 hours: Handle edge cases (fix 17 test failures)
Code is available on github.

Top comments (0)