DEV Community

Mohammad Raziei
Mohammad Raziei

Posted on

Beyond lxml: Faster and More Pythonic Parsing with pygixml and selectolax

For almost two decades, lxml has been the go-to choice for parsing XML and HTML in Python.
It’s fast, reliable, and feature-rich — a powerful C-based library that has served the ecosystem extremely well.

But the world has changed.
XML and HTML parsing have new performance demands, and developers expect cleaner, faster, and more Pythonic APIs.

That’s where pygixml (for XML) and selectolax (for HTML) come in — two modern parsing libraries built with Cython and inspired by low-level speed but high-level usability.


🕰️ A Brief Look Back at lxml

Let’s give credit where it’s due.
lxml revolutionized parsing when it came out — it combined the power of the C-based libxml2 with a clean Python API.
For years, it was the de facto standard for working with XML and HTML.

A simple example:

from lxml import etree

xml = """<root><item>Alpha</item><item>Beta</item></root>"""
tree = etree.fromstring(xml)

for item in tree.xpath("//item"):
    print(item.text)
Enter fullscreen mode Exit fullscreen mode

Output:

Alpha
Beta
Enter fullscreen mode Exit fullscreen mode

Nothing wrong here — it works!
But as XML and HTML documents grow larger, lxml starts to struggle. Its API is also somewhat verbose and its performance, though C-based, is not fully optimized for modern multi-megabyte data workloads.


⚡ Meet pygixml — XML at C++ Speed

pygixml is a modern XML parser for Python, powered by pugixml (C++) and Cython.
It’s not just a wrapper — it’s a reimagined XML API that combines raw C++ speed with Pythonic usability.

Benchmarks show pygixml is 16× to 33× faster than ElementTree, and around 5× faster than lxml, depending on input size.

Example:

import pygixml

xml = """<root><item id="1">Alpha</item><item id="2">Beta</item></root>"""
doc = pygixml.parse_string(xml)

for node in doc.select_nodes("//item"):
    print(node.xpath, node.text())
Enter fullscreen mode Exit fullscreen mode

Output:

/root/item[1] Alpha
/root/item[2] Beta
Enter fullscreen mode Exit fullscreen mode

Each node exposes an .xpath property (something lxml doesn’t provide directly), and every node has a unique mem_id — letting you find or reference elements instantly.

Need to iterate?

for node in doc:
    print(node.name, node.text(recursive=False))
Enter fullscreen mode Exit fullscreen mode

Want the full text recursively?

print(doc.first_child().text(recursive=True))
Enter fullscreen mode Exit fullscreen mode

Pygixml is designed for developers who work with huge XML files, complex XPath queries, and high-performance data pipelines.
It’s production-ready, thread-safe, and ridiculously fast.

📘 Full API documentation:
https://mohammadraziei.github.io/pygixml/api.html


🌐 And for HTML? Meet selectolax

When it comes to HTML parsing, selectolax fills the same niche that pygixml does for XML.
It’s built in Cython, inspired by the speed of lexbor (a fast C-based HTML5 parser), and offers a familiar, Pythonic interface.

Example:

from selectolax.parser import HTMLParser

html = """<html><body><div class='post'>Hello</div><div class='post'>World</div></body></html>"""
tree = HTMLParser(html)

for node in tree.css("div.post"):
    print(node.text())
Enter fullscreen mode Exit fullscreen mode

Output:

Hello
World
Enter fullscreen mode Exit fullscreen mode

It even supports CSS selectors natively, making it ideal for scraping and lightweight DOM manipulation.

In spirit, selectolax feels like pygixml’s sibling — both written in Cython, both extremely fast, both with modern and minimal APIs.


🧠 Why Modern Parsers Make Sense Today

Here’s a summary of how these libraries stack up:

Library Domain Language Core Speed API Style Notes
ElementTree XML Python Slow Standard Built-in, minimal features
lxml XML/HTML C (libxml2) Medium Verbose Mature but aging
pygixml XML C++ (pugixml) + Cython Very Fast Pythonic Full XPath support, mem_id, recursive text
selectolax HTML C (lexbor) + Cython Very Fast Pythonic CSS selectors, minimal overhead

Both pygixml and selectolax take the same philosophy:

  • Use low-level performance engines (pugixml, lexbor)
  • Expose a simple, modern, Pythonic API
  • Strip away decades of legacy overhead

🚀 Time to Modernize Your Stack

If you’re still using lxml for XML or HTML in new projects — it might be time to consider a faster, cleaner alternative.

Parsing in Python no longer has to be slow or clunky.
Modern Cython-powered parsers give you the best of both worlds — the speed of C/C++ and the elegance of Python.


🏁 Final Thoughts

lxml was a legend — and it still works fine.
But libraries like pygixml and selectolax show what parsing can feel like in 2025:
leaner, faster, and built for the way modern Python developers actually work.

If you deal with big XML or HTML workloads, give these tools a spin.
You might never look back.


Top comments (0)