Beyond lxml: Faster and More Pythonic Parsing with pygixml and selectolax

#lxml #xml #html #python

For almost two decades, lxml has been the go-to choice for parsing XML and HTML in Python.
It’s fast, reliable, and feature-rich — a powerful C-based library that has served the ecosystem extremely well.

But the world has changed.
XML and HTML parsing have new performance demands, and developers expect cleaner, faster, and more Pythonic APIs.

That’s where pygixml (for XML) and selectolax (for HTML) come in — two modern parsing libraries built with Cython and inspired by low-level speed but high-level usability.

🕰️ A Brief Look Back at lxml

Let’s give credit where it’s due.
lxml revolutionized parsing when it came out — it combined the power of the C-based libxml2 with a clean Python API.
For years, it was the de facto standard for working with XML and HTML.

A simple example:

from lxml import etree

xml = """<root><item>Alpha</item><item>Beta</item></root>"""
tree = etree.fromstring(xml)

for item in tree.xpath("//item"):
    print(item.text)

Output:

Alpha
Beta

Nothing wrong here — it works!
But as XML and HTML documents grow larger, lxml starts to struggle. Its API is also somewhat verbose and its performance, though C-based, is not fully optimized for modern multi-megabyte data workloads.

⚡ Meet pygixml — XML at C++ Speed

pygixml is a modern XML parser for Python, powered by pugixml (C++) and Cython.
It’s not just a wrapper — it’s a reimagined XML API that combines raw C++ speed with Pythonic usability.

Benchmarks show pygixml is 16× to 33× faster than ElementTree, and around 5× faster than lxml, depending on input size.

Example:

import pygixml

xml = """<root><item id="1">Alpha</item><item id="2">Beta</item></root>"""
doc = pygixml.parse_string(xml)

for node in doc.select_nodes("//item"):
    print(node.xpath, node.text())

Output:

/root/item[1] Alpha
/root/item[2] Beta

Each node exposes an .xpath property (something lxml doesn’t provide directly), and every node has a unique mem_id — letting you find or reference elements instantly.

Need to iterate?

for node in doc:
    print(node.name, node.text(recursive=False))

Want the full text recursively?

print(doc.first_child().text(recursive=True))

Pygixml is designed for developers who work with huge XML files, complex XPath queries, and high-performance data pipelines.
It’s production-ready, thread-safe, and ridiculously fast.

📘 Full API documentation:
https://mohammadraziei.github.io/pygixml/api.html

🌐 And for HTML? Meet selectolax

When it comes to HTML parsing, selectolax fills the same niche that pygixml does for XML.
It’s built in Cython, inspired by the speed of lexbor (a fast C-based HTML5 parser), and offers a familiar, Pythonic interface.

Example:

from selectolax.parser import HTMLParser

html = """<html><body><div class='post'>Hello</div><div class='post'>World</div></body></html>"""
tree = HTMLParser(html)

for node in tree.css("div.post"):
    print(node.text())

Output:

Hello
World

It even supports CSS selectors natively, making it ideal for scraping and lightweight DOM manipulation.

In spirit, selectolax feels like pygixml’s sibling — both written in Cython, both extremely fast, both with modern and minimal APIs.

🧠 Why Modern Parsers Make Sense Today

Here’s a summary of how these libraries stack up:

Library	Domain	Language Core	Speed	API Style	Notes
ElementTree	XML	Python	Slow	Standard	Built-in, minimal features
lxml	XML/HTML	C (libxml2)	Medium	Verbose	Mature but aging
pygixml	XML	C++ (pugixml) + Cython	Very Fast	Pythonic	Full XPath support, mem_id, recursive text
selectolax	HTML	C (lexbor) + Cython	Very Fast	Pythonic	CSS selectors, minimal overhead

Both pygixml and selectolax take the same philosophy:

Use low-level performance engines (pugixml, lexbor)
Expose a simple, modern, Pythonic API
Strip away decades of legacy overhead

🚀 Time to Modernize Your Stack

If you’re still using lxml for XML or HTML in new projects — it might be time to consider a faster, cleaner alternative.

For XML: try pygixml
For HTML: try selectolax

Parsing in Python no longer has to be slow or clunky.
Modern Cython-powered parsers give you the best of both worlds — the speed of C/C++ and the elegance of Python.

🏁 Final Thoughts

lxml was a legend — and it still works fine.
But libraries like pygixml and selectolax show what parsing can feel like in 2025:
leaner, faster, and built for the way modern Python developers actually work.

If you deal with big XML or HTML workloads, give these tools a spin.
You might never look back.