XML parsing in Python has always been… a bit of a pain.
We’ve all used ElementTree or lxml, but once your XML files hit hundreds of megabytes, the slowdown becomes impossible to ignore.
So, I decided to build something faster — much faster.
Meet pygixml — a new Python XML parser built with pugixml (C++) and Cython, designed from the ground up for speed, simplicity, and modern Python usability.
⚡ Why Another XML Parser?
Because speed and developer experience shouldn’t be mutually exclusive.
While lxml is already a C-based library and faster than ElementTree, it still struggles with huge XMLs, complex XPath queries, and deep node traversal.
pygixml, on the other hand, leverages pugixml’s blazing-fast C++ core but adds a Pythonic layer with powerful, convenience-driven APIs.
🧪 Benchmark Results
Let’s get straight to the numbers.
| Parser | Speed | Notes |
|---|---|---|
| ElementTree | 1× | baseline |
| lxml | ~6× faster | solid, but starts to choke with large XMLs |
| pygixml | 16×–33× faster | scales beautifully as input size grows |
Even compared to lxml, pygixml shows up to 5× performance gains depending on the dataset size.
🧩 Key Features
- Full XPath 1.0 support integrated into the Python API.
- Each node has a unique
mem_idfor direct lookup and tracking. - Every node exposes an
xpathproperty — something even pugixml itself doesn’t provide. - Flexible text extraction via
text(recursive=True|False)for precise content control. - Simple helper functions:
parse_file()andparse_string(). - Lightweight, thread-safe, and built for massive XML pipelines.
🚀 Getting Started
Install it with pip:
pip install pygixml
Parse your first XML:
import pygixml
doc = pygixml.parse_file("books.xml")
root = doc.first_child()
print(root.xpath)
for book in root.select_nodes("//book"):
print(book.xpath, book.text(recursive=False))
🔄 Iterating Over Elements
pygixml provides an easy way to walk through nodes using __iter__() — similar to iterators in BeautifulSoup or lxml, but lightning-fast and memory-efficient.
import pygixml
xml = """<library>
<book id="a"><title>Alpha</title></book>
<book id="b"><title>Beta</title></book>
</library>"""
doc = pygixml.parse_string(xml)
for node in doc:
print(node.tag.name, node.text(recursive=False))
You can also inspect each node’s XML snippet directly:
for node in doc:
print(node.tag.name)
print("Inner XML:", node.xml)
print("Text:", node.text(recursive=True))
This makes it incredibly simple to traverse complex XML trees without needing XPath every time.
🔍 Advanced Example — Working with mem_id and XPath
The mem_id property lets you uniquely identify nodes, and xpath gives you their absolute path.
import pygixml
xml = """<root><item id="a"><name>Alpha</name></item>
<item id="b"><name>Beta</name></item></root>"""
doc = pygixml.parse_string(xml)
items = doc.select_nodes("//item") # or doc.select_nodes("item")
root = doc.first_child()
for node in items:
print("mem_id:", node.mem_id)
print("parent:", node.parent)
print("xpath:", node.xpath)
print("text:", node.text(recursive=True))
print(node == root.select_node(node.xpath).node) # True
You can also fetch a node directly using its mem_id:
node = doc.find_mem_id(items[0].mem_id)
print("Found:", node.xpath)
🧠 Why It’s More Than a Wrapper
While pygixml uses pugixml internally, it’s not a thin binding — it’s a full XML toolkit written with Python developers in mind.
Features like node search by memory ID, built-in XPath property access, and intuitive recursive text extraction don’t exist in pugixml.
If you’ve ever wanted lxml-level power without its overhead, pygixml is for you.
📘 Documentation
You can explore the full API reference here:
👉 https://mohammadraziei.github.io/pygixml/api.html
🏁 Final Thoughts
Parsing XML shouldn’t be the bottleneck in your pipeline.
With pygixml, you get the speed of C++ and the simplicity of Python, in one elegant library.
👉 Try it here: github.com/mohammadraziei/pygixml
If you have ideas, performance tests, or contributions — I’d love to hear them.

Top comments (0)