Mohammad Raziei

Posted on Apr 11 • Edited on Apr 20

pygixml 0.10.0 released — A Faster, Smarter XML Parser for Python

#python #showdev #xml #cython

XML parsing in Python has had three choices for over a decade: ElementTree (slow but built-in), lxml (fast but heavy), and minidom (don't). I wanted something that sits at the intersection of speed, simplicity, and a small install footprint.

That's what pygixml is — a Cython wrapper around pugixml, one of the fastest C++ XML parsers in existence.

Version 0.10.0 just dropped, and it's the most significant release so far. Let's walk through what's new.

The Numbers (50 iterations, 5 000 elements)

Library	Avg Time	Speedup vs ElementTree
pygixml	0.0009 s	9.2× faster
lxml	0.0041 s	2.0× faster
ElementTree	0.0083 s	1.0× (baseline)

Memory usage tells a similar story: pygixml uses 0.67 MB at 5 000 elements vs ElementTree's 4.84 MB. And the installed package is just 0.45 MB, vs lxml's 5.48 MB, according to the pip-size report.

If you care about these numbers, the full benchmark suite covers 6 XML sizes (100 to 10 000 elements) and is included in the repo. Run it yourself:

git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixml

cmake -B build
cmake --build build --target run_full_benchmarks
# or
pip install .
python benchmarks/full_benchmark.py

What's New in 0.10.0

1. `children()` — Iterate Direct Children (or All Descendants)

Before 0.10.0, iterating over an element's children required manual sibling walking:

# The old way — walk siblings manually
child = node.first_child()
while child:
    if child.name == "student":
        process(child)
    child = child.next_sibling

Now you get a clean Pythonic iterator:

# Direct children only (default)
for child in node.children():
    process(child)

# All descendants in depth-first order
for descendant in node.children(recursive=True):
    process(descendant)

Text, comment, and processing-instruction nodes are automatically skipped — you only get element nodes.

2. `text()` — Recursive Text Extraction with Configurable Joins

Getting text out of an XML element shouldn't require walking the tree yourself. text() collects all text and CDATA nodes from the subtree and joins them:

doc = pygixml.parse_string("""
<article>
    <p>Hello <b>world</b>! This is <i>rich</i> text.</p>
</article>
""")
p = doc.root.child("p")

p.text()                     # "Hello\nworld!\nThis is\nrich\ntext."
p.text(recursive=False)      # "Hello "  (direct text only)
p.text(join=" ")             # "Hello world! This is rich text."

For simple cases where you just want the first child's text, child_value("tag") is still there and is slightly faster.

3. `element.value = "text"` — Finally, This Works

Element nodes in pugixml don't store text directly — they contain child text nodes. In 0.10.0, setting .value on an element automatically creates or replaces that text child:

doc = pygixml.parse_string("<root><item/></root>")
item = doc.root.child("item")

item.value = "Hello"
print(item.value)   # "Hello"  ✅
print(item.text())  # "Hello"  ✅
# XML: <item>Hello</item>

# Replaces existing text
item.value = "World"
print(item.value)   # "World"

And reading back: element.value now returns the first text child's value (or None if there's no text), so set and get are symmetric.

4. `from_mem_id_unsafe()` — O(1) Node Lookup

This is the most powerful — and most dangerous — feature in 0.10.0.

Every XMLNode exposes a mem_id property: a unique numeric identifier derived from the node's internal address. You can use it to reconstruct a node later:

# Fast: O(1), direct pointer cast
node = pygixml.XMLNode.from_mem_id_unsafe(node_id)

# Safe but O(n): walks the tree
node = root.find_mem_id(node_id)

The difference is O(1) vs O(n). But from_mem_id_unsafe treats the identifier as a raw pointer — if the document was freed or the node deleted, using it will cause a segfault.

When to use it: only in performance-critical paths where you've profiled and confirmed that find_mem_id's tree walk is a bottleneck. For most code, find_mem_id is the right choice.

The mem_id system is also hashable, making it ideal for dictionary-based caching.

Why aren't `XMLNode` objects hashable?

You might wonder why you can't just do cache[node] = data. The reason is intentional: XMLNode objects are mutable — you can rename them, change their content, add children, and so on. In Python, mutable objects shouldn't be hashable, because their identity and equivalence would break the moment you modify them. Using mem_id as the key makes the contract explicit: the integer is stable and hashable, while the node wrapper is transient.

Using nodes in dictionaries (the right way)

# Store node data by mem_id (a stable, hashable integer)
cache = {}
for node in doc:
    cache[node.mem_id] = {
        "xpath": node.xpath,
        "depth": node.xpath.count("/"),
    }

# Later, reconstruct the node (O(1) but unsafe)
for mem_id, metadata in cache.items():
    node = pygixml.XMLNode.from_mem_id_unsafe(mem_id)
    if node:  # Always check if the node is still valid
        process(node, metadata)

For safety, use find_mem_id (O(n) but returns None for deleted nodes):

node = root.find_mem_id(mem_id)
if node:
    process(node)

5. `xpath` Property — Generate Absolute XPath to Any Node

doc = pygixml.parse_string("<root><book><title>Gatsby</title></book></root>")
title = doc.root.child("book").child("title")

print(title.xpath)  # /root[1]/book[1]/title[1]

This is a custom O(depth) algorithm that walks from the node up to the root, counting same-name siblings to produce accurate positional predicates. pugixml doesn't provide this natively — it's pygixml's own addition.

6. `xml` Property — One-Liner XML Serialization

node.xml  # same as node.to_string() with 2-space indent

7. `ParseFlags` Enum

All 18 pugixml parse flags are now available as a proper IntFlag enum:

# Fastest parse — skip escapes, EOL normalization, whitespace
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Combine specific flags
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

8. Python 3.6–3.13 Support

pygixml works with every Python from 3.6 through 3.13. .pyi stub generation via stubgen-pyx is only enabled on Python 3.9+ (where the package is available), so older versions still build fine — just without type stubs.

Full Feature Summary

Feature	pygixml	lxml	ElementTree
Parse speed (5K elements)	0.0009 s	0.0041 s	0.0083 s
Memory (5K elements)	0.67 MB	0.67 MB	4.84 MB
Runtime Dependencies	0	libxml2, libxslt	None (stdlib)
Package size	0.45 MB	5.48 MB	built-in
XPath 1.0	✅ full	✅ full	❌ limited
XSLT	❌	✅	❌
Schema validation	❌	✅	❌
`children()` iterator	✅	❌	❌
`text()` recursive	✅	❌	❌
`element.value = "text"`	✅	❌	❌
`xpath` property	✅	❌	❌
`mem_id` caching	✅	❌	❌

Installation

pip install pygixml

Zero Runtime Dependencies

This is a huge advantage that often gets overlooked.

lxml depends on system libraries (libxml2, libxslt). If those have security vulnerabilities or version conflicts, your environment breaks.
pygixml bundles pugixml directly into the Python extension.

It has zero runtime dependencies. No libxml, no external binaries, no transitive dependency chains. Just a single install that works.

Pre-compiled wheels are available for Windows, Linux, and macOS.

Links

If this project helps you, a star on GitHub goes a long way. Thanks for reading.

DEV Community

pygixml 0.10.0 released — A Faster, Smarter XML Parser for Python

The Numbers (50 iterations, 5 000 elements)

What's New in 0.10.0

1. `children()` — Iterate Direct Children (or All Descendants)

2. `text()` — Recursive Text Extraction with Configurable Joins

3. `element.value = "text"` — Finally, This Works

4. `from_mem_id_unsafe()` — O(1) Node Lookup

Why aren't `XMLNode` objects hashable?

Using nodes in dictionaries (the right way)

5. `xpath` Property — Generate Absolute XPath to Any Node

6. `xml` Property — One-Liner XML Serialization

7. `ParseFlags` Enum

8. Python 3.6–3.13 Support

Full Feature Summary

Installation

Zero Runtime Dependencies

Links

Top comments (0)

The Numbers (50 iterations, 5 000 elements)

What's New in 0.10.0

1. children() — Iterate Direct Children (or All Descendants)

2. text() — Recursive Text Extraction with Configurable Joins

3. element.value = "text" — Finally, This Works

4. from_mem_id_unsafe() — O(1) Node Lookup

Why aren't XMLNode objects hashable?

Using nodes in dictionaries (the right way)

5. xpath Property — Generate Absolute XPath to Any Node

6. xml Property — One-Liner XML Serialization

7. ParseFlags Enum

8. Python 3.6–3.13 Support

Full Feature Summary

Installation

Zero Runtime Dependencies

Links

1. `children()` — Iterate Direct Children (or All Descendants)

2. `text()` — Recursive Text Extraction with Configurable Joins

3. `element.value = "text"` — Finally, This Works

4. `from_mem_id_unsafe()` — O(1) Node Lookup

Why aren't `XMLNode` objects hashable?

5. `xpath` Property — Generate Absolute XPath to Any Node

6. `xml` Property — One-Liner XML Serialization

7. `ParseFlags` Enum