Pasha

Posted on Apr 26

I wrote my third XML parser. Here's why this one was different.

#kotlin #java #xml #showdev

Hi, I'm Pasha, and I write XML parsers.

Not because the world needs another one. The world has xmlutil by Paul de Vrieze, which I will say specific nice things about further down. The world has JAXB. The world has, depending on how you count, several hundred XML libraries on Maven Central. Adding to the pile is not on anyone's wishlist.

And yet, here I am.

The first one I wrote years ago for a previous employer, behind a closed-source repo I no longer have access to. The second one, staks, is mine and works very well — if you write Kotlin and only Kotlin, and you are happy hand-rolling a small DSL per record. The third one is xml-fluss, which I just released, and the rest of this post is about why it exists.

So, let's get going.

The feed that started it

I have a soft spot for OPDS catalogs — Atom-flavored XML feeds for ebook libraries. They are exactly the kind of thing XML was invented for and exactly the kind of thing modern tooling makes you suffer to read.

Here is a fragment from an OPDS feed:

<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:opds="http://opds-spec.org/2010/catalog">
  <title>My Books</title>
  <updated>2026-04-26T10:00:00Z</updated>
  <entry>
    <id>urn:isbn:9780000000001</id>
    <title>The Master and Margarita</title>
    <author><name>Mikhail Bulgakov</name></author>
    <link rel="http://opds-spec.org/acquisition"
          type="application/epub+zip"
          href="/get/1.epub"/>
  </entry>
  <entry> ... </entry>
  <entry> ... </entry>
  <!-- 200,000 more entries -->
</feed>

I want three things out of every <entry>: the <title>, the author's <name>, and the acquisition <link>'s href. I do not care about anything else. I am also aware that "200,000 more entries" is not a hypothetical — real catalogs ship feeds with hundreds of thousands of records. I am not going to load that into memory.

Why xmlutil wasn't quite the shape

Before I tell you what I did, a quick word on xmlutil. Paul de Vrieze has been working on it for years. It runs on the entire Kotlin Multiplatform target list (JVM, JS, Native, Android), plugs into kotlinx-serialization, handles QName-aware namespaces with prefix repair and Clark notation, encodes XML back out as well as decoding it, and supports mixed content, sealed-class polymorphism, and inheritance. If you need to encode XML, or your document matches a schema you fully control, that's the library to reach for. I mean it.

But.

The default deserialization path, XML.decodeFromString(...), builds the whole tree. Fine for config files, painful for a 2 GB feed.

There is a streaming escape hatch: decodeWrappedToSequence(reader). It is real and it works. I checked. The constraints: it expects a <container><Item/><Item/></container> shape, it is marked @OptIn(ExperimentalXmlUtilApi::class), and the part that mattered to me is that every field on every Item still has to be declared on a @Serializable data class, including the ones I do not care about, because that is how a binder works. Binders mirror.

What I wanted was a different shape of tool: find the records anywhere in this document, decode three fields per record, ignore the rest, never buffer the whole thing. That's a different model from a binder, and there's room for both.

Why staks wasn't the shape either

Quick aside on my own previous attempt. staks works, and I still use it for personal projects. Its DSL is designed for a Kotlin codebase with a Kotlin developer at the keyboard. The moment you put a Java consumer in the picture — say, a Spring Boot service in a polyglot codebase that wants the same parser — you discover that "tiny Kotlin DSL" is the worst surface area to expose to javac.

So when I started xml-fluss, I picked an architecture where the Kotlin user and the Java user could share the same runtime and the same annotation surface, and only the code-generator backend differs.

What xml-fluss looks like in five lines

Here is roughly what xmlutil would have me write:

@Serializable
@SerialName("entry")
data class Entry(
    @XmlElement val id: String,
    @XmlElement val title: String,
    @XmlElement val updated: String,
    @XmlElement val author: Author,
    @XmlElement val link: List<Link>,
    @XmlElement val summary: String? = null,
)
@Serializable data class Author(@XmlElement val name: String, ...)
@Serializable data class Link(@XmlAttribute val rel: String, @XmlAttribute val href: String, ...)

Here is what I get to write with xml-fluss:

@XmlRecord("//atom:entry")
@XmlNs("atom", "http://www.w3.org/2005/Atom")
@XmlNs("opds", "http://opds-spec.org/2010/catalog")
data class Book(
    @XmlChild("atom:title")            val title: String,
    @XmlChild("atom:author/atom:name") val author: String,
    @XmlChild("atom:link/@href")       val download: String?,
)

That is the entire description. No id, no updated, no summary, no nested Link class — because I do not need any of it. The KSP processor walks this data class at compile time and emits a BookParser object:

URI("https://example.org/opds/all").toURL().openStream().use { input ->
    BookParser.parse(input)
        .filter { it.download != null }
        .take(20)
        .collect { book -> println("${book.author} — ${book.title}") }
}

The parser sits on top of Aalto, a fast pull parser. It never holds more than one Book in memory. The path matcher is a small NFA — a nondeterministic finite automaton, which is the same machinery regex engines use under the hood. The compiled path becomes a graph of states, and as the parser walks the document it keeps a stack of which states are currently active. When a START_ELEMENT arrives that satisfies the descendant axis and the namespace, you enter record mode; when the matching END_ELEMENT arrives, you emit. The descendant axis (//) keeps a state alive across deeper elements so that //book matches whether the <book> is two levels down or twelve.

Predicate filters evaluate at START_ELEMENT time using only the attributes present on the opening tag. That's the design, and it's what makes the streaming guarantee real. If the parser had to look ahead at the body of the element to decide whether to enter it, you would buffer.

A small mini-XPath, on purpose

I wanted just enough path syntax to express the things I actually do every week:

//entry                           // descendant axis
/library/section/author           // anchored
//entry[2]                        // positional
//book[@featured='true' and @lang='en']
{http://www.w3.org/2005/Atom}title
//link/@href                      // attribute leaf on a child

Boolean logic is and / or with and precedence higher. Predicates use equality, inequality, and integer position. That's it. It is not XPath 3.1 and it is not trying to be — it is the minimum syntax that lets me select the records I want without writing a visitor. A more elaborate predicate engine would mean buffering, ambiguity around evaluation order, and a much bigger surface area to debug. I'd rather keep the engine small.

The Java side, where jspecify earns its keep

xml-fluss has a second module, xml-fluss-apt, that does the same thing for Java records via a plain javac annotation processor:

@NullMarked
package my.books;

@XmlRecord("//atom:entry")
@XmlNs(prefix = "atom", uri = "http://www.w3.org/2005/Atom")
public record Book(
    @XmlChild(path = "atom:title")            String title,
    @XmlChild(path = "atom:author/atom:name") String author,
    @XmlChild(path = "atom:link/@href")
    @Nullable                                 String download
) {}

// usage
try (Stream<Book> books = BookParser.parse(in)) {
    books.limit(20).forEach(b -> System.out.println(b.author() + " — " + b.title()));
}

Two things are happening here that I am quietly proud of.

First: a single annotation surface (xmlfluss.*) on top of one shared runtime feeds two code generators (KSP for Kotlin, javac APT for Java records), so you get Flow<T> or Stream<T> depending on which side you write on.

Second: there is no separate nullable = true argument anywhere. On the Kotlin side this is free — the language already distinguishes String from String?, so the KSP processor just reads the type and that's the answer. Java doesn't have built-in null-safety, so the APT processor reads jspecify annotations instead. A record component sitting inside a @NullMarked scope without an explicit @Nullable is treated as required at parse time; the @Nullable ones can be absent without complaint. Same end result both languages, sourced from the type system that each one already has.

Why invent another nullable=true argument when both languages already give the answer?

What it deliberately does not do

A short list of things xml-fluss is not, so you can rule it out fast if you need them:

It does not encode XML, and it cannot. This isn't a missing feature, it's structural. The annotation surface is a query over an unknown document. How would you render @XmlRecord("//author//book[3]") back out? Where does the third book go, under which author, with what surrounding tags that the data class never described? The path tells the parser where to look, not how to build. Encoding is a binder's job, because a binder knows the whole structure. If you need to write XML, that is what xmlutil is for.
It does not validate against a schema. Out of scope.
Predicates only see attributes of the current element, never text content of a child — see the streaming reason above.

If you need encoding or schema validation, reach for a different tool. Predicates evaluating only on element entry is a design choice, not an oversight, but if you have a clean idea for a richer predicate language that preserves the streaming guarantee, PRs welcome.

How to try it

// Gradle, Kotlin
plugins {
    kotlin("jvm")
    id("com.google.devtools.ksp") version "2.3.7"
}
dependencies {
    implementation("site.asm0dey.xmlfluss:xml-fluss-runtime:0.0.0.3")
    ksp("site.asm0dey.xmlfluss:xml-fluss-ksp:0.0.0.3")
}

// Gradle, Java records
dependencies {
    implementation("site.asm0dey.xmlfluss:xml-fluss-runtime:0.0.0.3")
    annotationProcessor("site.asm0dey.xmlfluss:xml-fluss-apt:0.0.0.3")
}

The repo is at github.com/asm0dey/xml-fluss — README, more examples, full annotation reference, the lot. Issues and PRs welcome. If you try it on a real feed and something explodes, that is the most useful gift you can give me right now.

If you only need to encode XML, or you have a schema you fully control and a binding model fits your shape, please use xmlutil. Paul has put a lot of careful work into it and it shows.

But if you have ever found yourself writing a SAX ContentHandler at midnight, or modeling 47 wrapper classes to pull three fields out of a feed, give xml-fluss a look.

I might be on parser number four by then.

Top comments (3)

buildbasekit • Apr 27

This is a really sharp piece of engineering.

You didn’t build “another XML parser” — you built a different abstraction
Streaming + selective extraction solves a real pain that binders ignore
The NFA-based path matching is a strong design choice. Simple but powerful
Kotlin + Java parity without duplicating concepts is very well thought out

The best part is restraint:

You clearly defined what this tool is not
No overengineering, no pretending to be a full XPath engine

One push:

The mini-XPath syntax is powerful, but it may become a learning curve for new users → Good docs and examples will decide adoption here

Overall, this is the kind of tool that comes from actually hitting real limits, not just experimenting.

Pasha • Apr 27

Anything you think is worth improving?

buildbasekit • Apr 27

Yeah, one thing I’d focus on:

Add a few real-world examples upfront (large feeds, partial extraction cases)

Right now the concept is strong, but examples will make people instantly “get it” and try it.

Also:

Maybe a simple vs xmlutil comparison table → when to use xml-fluss vs when not

That will reduce confusion and speed up adoption.

Core idea is already solid. Now it’s mostly about making it easier to understand and trust.