Mohammad Raziei

Posted on Apr 8

How to Parse XML Fast in 2026 (Python)

#python #xml #showdev #programming

JSON won the internet. We all know that. But XML never left — it just moved
into the places where reliability matters more than trendiness.

If you work with Maven configs, Android manifests, Office Open XML (.docx/.xlsx),
SVG, RSS feeds, DocBook, SOAP services, or any enterprise integration layer, you're
still parsing XML. And in 2026, there's no excuse for it being slow.

The Problem with XML Parsing in Python

Python's standard library ships with xml.etree.ElementTree. It works. It's
fine for small files. But the moment your XML grows beyond a few hundred
elements, ElementTree becomes a bottleneck — because it builds a full Python
object for every single node, attribute, and text node in the tree.

The usual answer is lxml, which wraps libxml2 in C. It's fast and
feature-rich. But it's also a 5.5 MB install with a heavy dependency chain,
and its Python bindings add overhead on every call.

So what if you want the fastest possible parse, a tiny footprint, and a
clean Python API?

That's the question that led me to build pygixml —
a Cython wrapper around pugixml, one of the fastest
C++ XML parsers in existence.

Let me show you the numbers first, then we'll get into the code.

The Numbers

Here's what happens when you parse a 5,000-element XML document with the
three most common Python XML libraries:

Library	Parse Time	Speedup vs ElementTree
pygixml	0.0009 s	8.6× faster
lxml	0.0041 s	1.9× faster
ElementTree	0.0076 s	1.0× (baseline)

And memory usage during the same parse:

Library	Peak Memory
pygixml	0.67 MB
lxml	0.67 MB
ElementTree	4.84 MB

ElementTree uses 7× more memory because it materializes every node as a
full Python object. pygixml and lxml stay in C/C++ land until you
explicitly access data.

The installed package size tells its own story:

Package	Size
pygixml	0.43 MB
lxml	5.48 MB

That's a 12× difference. If you're building a Docker image, Lambda function,
or anything where size matters, it adds up.

All benchmarks run on the same machine with time.perf_counter() across 5
warmed-up iterations. You can reproduce them yourself — the code is in the
benchmarks/ directory.

How pygixml Works Under the Hood

Here's the architecture:

Three things make this fast:

No Python object per node — the entire parsed tree lives in C++ memory. pygixml only creates a Python wrapper when you explicitly access a node.
Zero-copy Cython bridge — data doesn't get copied between C++ and Python. Strings are encoded in-place.
pugixml's custom allocator — pugixml uses a block-based memory pool instead of per-node malloc, which means fewer syscalls and better cache locality.

Getting Started

pip install pygixml

One dependency-free install, 430 KB.

Parsing XML

import pygixml

xml = """
<library>
    <book id="1" category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
    </book>
    <book id="2" category="fiction">
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
</library>
"""

doc = pygixml.parse_string(xml)
root = doc.root

# Access children
book = root.child("book")
print(book.name)                      # book
print(book.attribute("id").value)     # 1
print(book.child("title").text())     # The Great Gatsby

The API is deliberately simple. Properties for simple access
(node.name, node.value, node.type), methods for operations that take
arguments (node.child(name), node.text()). No surprises.

XPath Queries

This is where pygixml really shines. pugixml's XPath engine is fast,
standards-compliant (XPath 1.0), and fully exposed:

# All fiction books
fiction = root.select_nodes("book[@category='fiction']")
print(f"Found {len(fiction)} fiction books")

# Single match
match = root.select_node("book[@id='2']")
if match:
    print(match.node.child("title").text())   # 1984

# Pre-compile for repeated use
query = pygixml.XPathQuery("book[year > 1950]")
recent = query.evaluate_node_set(root)

# Scalar evaluations
avg = pygixml.XPathQuery(
    "sum(book/price) div count(book)"
).evaluate_number(root)
print(f"Average price: ${avg:.2f}")

has_orwell = pygixml.XPathQuery(
    "book[author='George Orwell']"
).evaluate_boolean(root)
print(f"Has Orwell: {has_orwell}")

Creating XML

doc = pygixml.XMLDocument()
root = doc.append_child("catalog")
item = root.append_child("product")
item.append_child("name").set_value("Laptop")
item.append_child("price").set_value("999.99")

doc.save_file("catalog.xml")

Modifying XML

doc = pygixml.parse_string("<person><name>John</name></person>")
root = doc.root

root.child("name").set_value("Jane")
root.child("name").name = "full_name"
root.append_child("age").set_value("30")

print(root.xml)
# <person>
#   <full_name>Jane</full_name>
#   <age>30</age>
# </person>

Performance Tuning: Parse Flags

Here's a feature most Python XML libraries don't expose: parse flags.
pygixml gives you a ParseFlags enum with 18 options to control exactly
how pugixml processes your input.

# Fastest possible parse — skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Pick exactly what you need
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

ParseFlags.MINIMAL skips escape processing, EOL normalization, and
attribute whitespace conversion. On real-world XML with lots of escaped
content (&, <, etc.), this can give you a noticeable speed boost
over the default.

Which Library Should You Use?

	pygixml	lxml	ElementTree
Parse speed	Fastest	Fast	Slowest
Memory	Low	Low	High (7×)
Package size	0.43 MB	5.48 MB	Built-in
XPath	1.0	1.0 + 2.0 + 3.0	Limited
XSLT	No	Yes	No
Schema validation	No	Yes	No
Dependencies	None	libxml2, libxslt	None

The Full Benchmark

If you want to run the numbers yourself:

git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixml

The project uses CMake for its build system, so benchmarks are built-in targets:

# Full suite: parsing (6 sizes), memory, package size
cmake --build build --target run_full_benchmarks

# Legacy parsing-only benchmark
cmake --build build --target run_benchmarks

# Or directly with Python
python benchmarks/full_benchmark.py

Here's the actual output from a recent run:

=====================================================================
PARSING PERFORMANCE
=====================================================================
    Size | Library      |    Avg (s) |    Min (s) |  Speedup vs ET
----------------------------------------------------------------------
     100 | pygixml      |   0.000008 |   0.000008 |          14.4x
     100 | lxml         |   0.000094 |   0.000088 |           1.2x
     100 | elementtree  |   0.000112 |   0.000108 |           1.0x
----------------------------------------------------------------------
     500 | pygixml      |   0.000097 |   0.000096 |           5.8x
     500 | lxml         |   0.000394 |   0.000385 |           1.4x
     500 | elementtree  |   0.000558 |   0.000542 |           1.0x
----------------------------------------------------------------------
    1000 | pygixml      |   0.000147 |   0.000143 |           7.8x
    1000 | lxml         |   0.001127 |   0.001052 |           1.0x
    1000 | elementtree  |   0.001146 |   0.001114 |           1.0x
----------------------------------------------------------------------
    5000 | pygixml      |   0.000883 |   0.000880 |           8.6x
    5000 | lxml         |   0.004108 |   0.003907 |           1.9x
    5000 | elementtree  |   0.007614 |   0.006634 |           1.0x
----------------------------------------------------------------------
   10000 | pygixml      |   0.001649 |   0.001635 |           9.8x
   10000 | lxml         |   0.009095 |   0.008174 |           1.8x
   10000 | elementtree  |   0.016108 |   0.013917 |           1.0x
----------------------------------------------------------------------

Memory usage (tracemalloc peak):

Size	pygixml	lxml	ElementTree
1 000	0.13 MB	0.13 MB	1.01 MB
5 000	0.67 MB	0.67 MB	4.84 MB
10 000	1.34 MB	1.34 MB	9.68 MB

Package size:

Package	Size
pygixml	0.43 MB
lxml	5.48 MB

Wrap-Up

XML isn't going anywhere. The tools we use to process it matter more than
we think — especially when that XML is on the critical path of a request,
a batch job, or a data pipeline.

pygixml brings one of the fastest C++ XML parsers to Python with minimal
friction. Same API patterns you already know. Same XPath you already use.
Just faster.

If you try it out, I'd love to hear about your use case. And if the project
helps you, a star on GitHub
goes a long way.

Links:

Have a different XML parsing strategy? Drop it in the comments — I'm
always looking for better approaches.

DEV Community