DEV Community

Mohammad Raziei
Mohammad Raziei

Posted on

pygixml — The Fastest XML Parser for Python (Beating lxml at Its Own Game)

XML parsing in Python has always been… a bit of a pain.
We’ve all used ElementTree or lxml, but once your XML files hit hundreds of megabytes, the slowdown becomes impossible to ignore.

So, I decided to build something faster — much faster.
Meet pygixml — a new Python XML parser built with pugixml (C++) and Cython, designed from the ground up for speed, simplicity, and modern Python usability.


⚡ Why Another XML Parser?

Because speed and developer experience shouldn’t be mutually exclusive.

While lxml is already a C-based library and faster than ElementTree, it still struggles with huge XMLs, complex XPath queries, and deep node traversal.
pygixml, on the other hand, leverages pugixml’s blazing-fast C++ core but adds a Pythonic layer with powerful, convenience-driven APIs.


🧪 Benchmark Results

Let’s get straight to the numbers.

Parser Speed Notes
ElementTree baseline
lxml ~6× faster solid, but starts to choke with large XMLs
pygixml 16×–33× faster scales beautifully as input size grows

Even compared to lxml, pygixml shows up to 5× performance gains depending on the dataset size.


🧩 Key Features

  • Full XPath 1.0 support integrated into the Python API.
  • Each node has a unique mem_id for direct lookup and tracking.
  • Every node exposes an xpath property — something even pugixml itself doesn’t provide.
  • Flexible text extraction via text(recursive=True|False) for precise content control.
  • Simple helper functions: parse_file() and parse_string().
  • Lightweight, thread-safe, and built for massive XML pipelines.

🚀 Getting Started

Install it with pip:

pip install pygixml
Enter fullscreen mode Exit fullscreen mode

Parse your first XML:

import pygixml

doc = pygixml.parse_file("books.xml")
root = doc.first_child()

print(root.xpath)
for book in root.select_nodes("//book"):
    print(book.xpath, book.text(recursive=False))
Enter fullscreen mode Exit fullscreen mode

🔄 Iterating Over Elements

pygixml provides an easy way to walk through nodes using __iter__() — similar to iterators in BeautifulSoup or lxml, but lightning-fast and memory-efficient.

import pygixml

xml = """<library>
  <book id="a"><title>Alpha</title></book>
  <book id="b"><title>Beta</title></book>
</library>"""

doc = pygixml.parse_string(xml)

for node in doc:
    print(node.tag.name, node.text(recursive=False))
Enter fullscreen mode Exit fullscreen mode

You can also inspect each node’s XML snippet directly:

for node in doc:
    print(node.tag.name)
    print("Inner XML:", node.xml)
    print("Text:", node.text(recursive=True))
Enter fullscreen mode Exit fullscreen mode

This makes it incredibly simple to traverse complex XML trees without needing XPath every time.


🔍 Advanced Example — Working with mem_id and XPath

The mem_id property lets you uniquely identify nodes, and xpath gives you their absolute path.

import pygixml 

xml = """<root><item id="a"><name>Alpha</name></item>
         <item id="b"><name>Beta</name></item></root>"""

doc = pygixml.parse_string(xml)
items = doc.select_nodes("//item") # or doc.select_nodes("item")
root = doc.first_child()

for node in items:
    print("mem_id:", node.mem_id)
    print("parent:", node.parent)
    print("xpath:", node.xpath)
    print("text:", node.text(recursive=True))
    print(node == root.select_node(node.xpath).node) # True
Enter fullscreen mode Exit fullscreen mode

You can also fetch a node directly using its mem_id:

node = doc.find_mem_id(items[0].mem_id)
print("Found:", node.xpath)
Enter fullscreen mode Exit fullscreen mode

🧠 Why It’s More Than a Wrapper

While pygixml uses pugixml internally, it’s not a thin binding — it’s a full XML toolkit written with Python developers in mind.
Features like node search by memory ID, built-in XPath property access, and intuitive recursive text extraction don’t exist in pugixml.

If you’ve ever wanted lxml-level power without its overhead, pygixml is for you.


📘 Documentation

You can explore the full API reference here:
👉 https://mohammadraziei.github.io/pygixml/api.html


🏁 Final Thoughts

Parsing XML shouldn’t be the bottleneck in your pipeline.
With pygixml, you get the speed of C++ and the simplicity of Python, in one elegant library.

👉 Try it here: github.com/mohammadraziei/pygixml

If you have ideas, performance tests, or contributions — I’d love to hear them.

Top comments (0)