DEV Community

Cover image for pygixml 0.10.0 released — A Faster, Smarter XML Parser for Python
Mohammad Raziei
Mohammad Raziei

Posted on

pygixml 0.10.0 released — A Faster, Smarter XML Parser for Python

XML parsing in Python has had three choices for over a decade: ElementTree (slow but built-in), lxml (fast but heavy), and minidom (don't). I wanted something that sits at the intersection of speed, simplicity, and a small install footprint.

That's what pygixml is — a Cython wrapper around pugixml, one of the fastest C++ XML parsers in existence.

Version 0.10.0 just dropped, and it's the most significant release so far. Let's walk through what's new.


The Numbers (50 iterations, 5 000 elements)

Library Avg Time Speedup vs ElementTree
pygixml 0.0009 s 9.2× faster
lxml 0.0041 s 2.0× faster
ElementTree 0.0083 s 1.0× (baseline)

Memory usage tells a similar story: pygixml uses 0.67 MB at 5 000 elements vs ElementTree's 4.84 MB. And the installed package is just 0.45 MB, vs lxml's 5.48 MB, according to the pip-size report.

If you care about these numbers, the full benchmark suite covers 6 XML sizes (100 to 10 000 elements) and is included in the repo. Run it yourself:

git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixml

cmake -B build
cmake --build build --target run_full_benchmarks
# or
pip install .
python benchmarks/full_benchmark.py
Enter fullscreen mode Exit fullscreen mode

What's New in 0.10.0

1. children() — Iterate Direct Children (or All Descendants)

Before 0.10.0, iterating over an element's children required manual sibling walking:

# The old way — walk siblings manually
child = node.first_child()
while child:
    if child.name == "student":
        process(child)
    child = child.next_sibling
Enter fullscreen mode Exit fullscreen mode

Now you get a clean Pythonic iterator:

# Direct children only (default)
for child in node.children():
    process(child)

# All descendants in depth-first order
for descendant in node.children(recursive=True):
    process(descendant)
Enter fullscreen mode Exit fullscreen mode

Text, comment, and processing-instruction nodes are automatically skipped — you only get element nodes.

2. text() — Recursive Text Extraction with Configurable Joins

Getting text out of an XML element shouldn't require walking the tree yourself. text() collects all text and CDATA nodes from the subtree and joins them:

doc = pygixml.parse_string("""
<article>
    <p>Hello <b>world</b>! This is <i>rich</i> text.</p>
</article>
""")
p = doc.root.child("p")

p.text()                     # "Hello\nworld!\nThis is\nrich\ntext."
p.text(recursive=False)      # "Hello "  (direct text only)
p.text(join=" ")             # "Hello world! This is rich text."
Enter fullscreen mode Exit fullscreen mode

For simple cases where you just want the first child's text, child_value("tag") is still there and is slightly faster.

3. element.value = "text" — Finally, This Works

Element nodes in pugixml don't store text directly — they contain child text nodes. In 0.10.0, setting .value on an element automatically creates or replaces that text child:

doc = pygixml.parse_string("<root><item/></root>")
item = doc.root.child("item")

item.value = "Hello"
print(item.value)   # "Hello"  ✅
print(item.text())  # "Hello"  ✅
# XML: <item>Hello</item>

# Replaces existing text
item.value = "World"
print(item.value)   # "World"
Enter fullscreen mode Exit fullscreen mode

And reading back: element.value now returns the first text child's value (or None if there's no text), so set and get are symmetric.

4. from_mem_id_unsafe() — O(1) Node Lookup

This is the most powerful — and most dangerous — feature in 0.10.0.

Every XMLNode exposes a mem_id property: a unique numeric identifier derived from the node's internal address. You can use it to reconstruct a node later:

# Fast: O(1), direct pointer cast
node = pygixml.XMLNode.from_mem_id_unsafe(node_id)

# Safe but O(n): walks the tree
node = root.find_mem_id(node_id)
Enter fullscreen mode Exit fullscreen mode

The difference is O(1) vs O(n). But from_mem_id_unsafe treats the identifier as a raw pointer — if the document was freed or the node deleted, using it will cause a segfault.

When to use it: only in performance-critical paths where you've profiled and confirmed that find_mem_id's tree walk is a bottleneck. For most code, find_mem_id is the right choice.

The mem_id system is also hashable, making it ideal for dictionary-based caching.

Why aren't XMLNode objects hashable?

You might wonder why you can't just do cache[node] = data. The reason is intentional: XMLNode objects are mutable — you can rename them, change their content, add children, and so on. In Python, mutable objects shouldn't be hashable, because their identity and equivalence would break the moment you modify them. Using mem_id as the key makes the contract explicit: the integer is stable and hashable, while the node wrapper is transient.

Using nodes in dictionaries (the right way)

# Store node data by mem_id (a stable, hashable integer)
cache = {}
for node in doc:
    cache[node.mem_id] = {
        "xpath": node.xpath,
        "depth": node.xpath.count("/"),
    }

# Later, reconstruct the node (O(1) but unsafe)
for mem_id, metadata in cache.items():
    node = pygixml.XMLNode.from_mem_id_unsafe(mem_id)
    if not node.is_null():  # Always check if the node is still valid
        process(node, metadata)
Enter fullscreen mode Exit fullscreen mode

For safety, use find_mem_id (O(n) but returns None for deleted nodes):

node = root.find_mem_id(mem_id)
if node is not None:
    process(node)
Enter fullscreen mode Exit fullscreen mode

5. xpath Property — Generate Absolute XPath to Any Node

doc = pygixml.parse_string("<root><book><title>Gatsby</title></book></root>")
title = doc.root.child("book").child("title")

print(title.xpath)  # /root[1]/book[1]/title[1]
Enter fullscreen mode Exit fullscreen mode

This is a custom O(depth) algorithm that walks from the node up to the root, counting same-name siblings to produce accurate positional predicates. pugixml doesn't provide this natively — it's pygixml's own addition.

6. xml Property — One-Liner XML Serialization

node.xml  # same as node.to_string() with 2-space indent
Enter fullscreen mode Exit fullscreen mode

7. ParseFlags Enum

All 18 pugixml parse flags are now available as a proper IntFlag enum:

# Fastest parse — skip escapes, EOL normalization, whitespace
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Combine specific flags
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)
Enter fullscreen mode Exit fullscreen mode

8. Python 3.6–3.13 Support

pygixml works with every Python from 3.6 through 3.13. .pyi stub generation via stubgen-pyx is only enabled on Python 3.9+ (where the package is available), so older versions still build fine — just without type stubs.


Full Feature Summary

Feature pygixml lxml ElementTree
Parse speed (5K elements) 0.0009 s 0.0041 s 0.0083 s
Memory (5K elements) 0.67 MB 0.67 MB 4.84 MB
Package size 0.45 MB 5.48 MB built-in
XPath 1.0 ✅ full ✅ full ❌ limited
XSLT
Schema validation
children() iterator
text() recursive
element.value = "text"
xpath property
mem_id caching

Installation

pip install pygixml
Enter fullscreen mode Exit fullscreen mode

Zero Runtime Dependencies

This is a huge advantage that often gets overlooked.

  • lxml depends on system libraries (libxml2, libxslt). If those have security vulnerabilities or version conflicts, your environment breaks.
  • pygixml bundles pugixml directly into the Python extension.

It has zero runtime dependencies. No libxml, no external binaries, no transitive dependency chains. Just a single install that works.

Pre-compiled wheels are available for Windows, Linux, and macOS.


Links

If this project helps you, a star on GitHub goes a long way. Thanks for reading.

Top comments (0)