When people benchmark XML tools, they often compare the wrong things.
They compare a full-document parser to a streaming parser. They
compare one tool that returns DOM objects to another that returns
arrays. They compare a micro-example that does not resemble
production code. Or they publish a time number without showing what
work was actually done.
That is not useful.
For real PHP projects, the right benchmark question is usually much
narrower:
If the task is to stream through a large XML feed, extract repeated
business records, and turn them into plain PHP arrays, what do I gain
by using raw XMLReader directly, and what do I gain by using a
focused extraction library such as XmlExtractKit?
That is the comparison I care about.
This article shows how I would benchmark that scenario in a way that
is both fair and technically honest.
What the benchmark should measure
For extraction-heavy workloads, “total runtime” is not enough.
A useful benchmark should measure at least four things:
- total wall-clock time;
- peak memory usage;
- time to first useful record;
- amount of userland code needed to express the task.
That last one is not a machine metric, but it matters. In many
integrations, the long-term cost is not CPU time. It is the amount of
extraction glue code you end up carrying from project to project.
The scenario: one realistic extraction task
To keep the comparison fair, both approaches should solve the same
task.
Here is the scenario:
- the input is a large XML feed;
- the feed contains repeated
<offer>records; - each offer has nested elements and attributes;
- we only care about offer records, not the rest of the document;
- the output for each record is a plain PHP array shaped for application use.
That is much closer to a real import or ETL job than an abstract
“parse XML” benchmark.
A minimal example of the feed structure looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
<offer id="1001" available="true">
<sku>KB-1001</sku>
<name>Mechanical Keyboard</name>
<brand>Acme</brand>
<category>Keyboards</category>
<price currency="USD">129.90</price>
<stock>14</stock>
</offer>
<service id="svc-1">
<name>Extended Warranty</name>
</service>
<offer id="1002" available="false">
<sku>MS-1002</sku>
<name>Wireless Mouse</name>
<brand>Acme</brand>
<category>Mice</category>
<price currency="USD">39.90</price>
<stock>0</stock>
</offer>
</catalog>
In the benchmark, all implementations should produce records like
this:
[
'external_id' => '1001',
'available' => true,
'sku' => 'KB-1001',
'name' => 'Mechanical Keyboard',
'brand' => 'Acme',
'category' => 'Keyboards',
'price' => '129.90',
'currency' => 'USD',
'stock' => '14',
]
That common output shape is important. If the two approaches do
different work, the benchmark is meaningless.
What exactly is being compared
I would compare these two implementations:
1. Raw XMLReader
This is the “do it yourself” baseline:
- move the cursor manually;
- detect
<offer>nodes; - read attributes and child elements;
- assemble arrays by hand;
- yield one normalized record at a time.
2. XmlExtractKit (sbwerewolf/xml-navigator)
This is the extraction-first approach:
- still stream through the XML using
XMLReaderunderneath; - use
FastXmlParser::extractPrettyPrint()to yield only matching nodes; - normalize the resulting arrays into your application format.
This is the comparison that matters in practice because these two
approaches solve the same class of problem: streaming extraction of
repeated records.
Why I am not publishing arbitrary numbers here
A benchmark article becomes misleading very quickly when it includes
numbers without context.
Runtime depends on all of these things:
- PHP version and build;
- enabled extensions;
- CPU and storage;
- XML shape and depth;
- number of attributes;
- number of repeated child elements;
- whether your normalization step is trivial or heavy.
Because of that, I think the honest way to present this benchmark is:
- show the exact task;
- show both implementations;
- show the harness;
- explain what to measure;
- tell readers how to interpret the results.
That way the article stays useful even when the raw numbers differ
from machine to machine.
Step 1: generate a reproducible XML fixture
For a benchmark, hand-written miniature XML is not enough.
You want a generated fixture with many repeated records so that the
streaming behavior becomes visible.
Here is a simple generator:
<?php
declare(strict_types=1);
function generateCatalogFixture(string $path, int $offers = 9999): void
{
$fh = fopen($path, 'wb');
if ($fh === false) {
throw new RuntimeException(
'Cannot open fixture file for writing.'
);
}
$date = date('Y-m-d');
fwrite($fh, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
fwrite($fh, "<catalog generated_at=\"$date\" region=\"eu\">\n");
for ($i = 1; $i <= $offers; $i++) {
$available = $i % 7 === 0 ? 'false' : 'true';
$brand = 'Brand-' . ($i % 13);
$category = 'Category-' . ($i % 9);
$price = number_format(10 + ($i % 1000) / 10, 2, '.', '');
$stock = (string) ($i % 25);
fwrite($fh, " <offer id=\"{$i}\" available=\"{$available}\">\n");
fwrite($fh, " <sku>SKU-{$i}</sku>\n");
fwrite($fh, " <name>Product {$i}</name>\n");
fwrite($fh, " <brand>{$brand}</brand>\n");
fwrite($fh, " <category>{$category}</category>\n");
fwrite($fh, " <price currency=\"USD\">{$price}</price>\n");
fwrite($fh, " <stock>{$stock}</stock>\n");
fwrite($fh, " </offer>\n");
if ($i % 1000 === 0) {
fwrite($fh, " <service id=\"svc-{$i}\"><name>Warranty</name></service>\n");
}
}
fwrite($fh, "</catalog>\n");
fclose($fh);
}
This gives both implementations identical input and enough repeated
records to make the comparison meaningful.
Step 2: the raw XMLReader baseline
The baseline should be direct and honest. It should not be
intentionally ugly, but it should reflect the kind of code people
really end up writing when they solve the problem manually.
Here is one way to do that:
<?php
declare(strict_types=1);
/**
* @return Generator<int, array<string, mixed>>
*/
function iterateOffersWithXmlReader(string $path): Generator
{
$reader = XMLReader::open($path);
if ($reader === false) {
throw new RuntimeException('Cannot open XML file.');
}
try {
while ($reader->read()) {
if (
$reader->nodeType !== XMLReader::ELEMENT
|| $reader->name !== 'offer'
) {
continue;
}
$depth = $reader->depth;
$offer = [
'external_id' => $reader->getAttribute('id') ?? '',
'available' => ($reader->getAttribute('available') ?? '') === 'true',
'sku' => '',
'name' => '',
'brand' => '',
'category' => '',
'price' => '',
'currency' => '',
'stock' => '0',
];
while ($reader->read()) {
if (
$reader->nodeType === XMLReader::END_ELEMENT
&& $reader->name === 'offer'
&& $reader->depth === $depth
) {
break;
}
if ($reader->nodeType !== XMLReader::ELEMENT) {
continue;
}
switch ($reader->name) {
case 'sku':
$offer['sku'] = $reader->readString();
break;
case 'name':
$offer['name'] = $reader->readString();
break;
case 'brand':
$offer['brand'] = $reader->readString();
break;
case 'category':
$offer['category'] = $reader->readString();
break;
case 'price':
$offer['currency'] = $reader->getAttribute('currency') ?? '';
$offer['price'] = $reader->readString();
break;
case 'stock':
$offer['stock'] = $reader->readString();
break;
}
}
yield $offer;
}
} finally {
$reader->close();
}
}
There is nothing wrong with this code. In fact, for some one-off jobs
it is a perfectly reasonable solution.
But it already illustrates the tradeoff:
- manual cursor handling;
- explicit depth management;
- hand-written field extraction;
- record assembly mixed with traversal logic.
That is exactly what I want the benchmark to capture.
Step 3: the XmlExtractKit implementation
Now compare it with the extraction-first version.
XmlExtractKit still uses the streaming model, but it moves the code
closer to the business task: extract matching nodes, then normalize
them.
<?php
declare(strict_types=1);
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;
/**
* @param array<string, mixed> $node
* @return array<string, mixed>
*/
function normalizeOffer(array $node): array
{
$offer = $node['offer'] ?? [];
$attributes = $offer['@attributes'] ?? [];
$price = $offer['price'] ?? [];
$priceAttributes = is_array($price) ? ($price['@attributes'] ?? []) : [];
return [
'external_id' => ($attributes['id'] ?? ''),
'available' => (($attributes['available'] ?? '')) === 'true',
'sku' => ($offer['sku'] ?? ''),
'name' => ($offer['name'] ?? ''),
'brand' => ($offer['brand'] ?? ''),
'category' => ($offer['category'] ?? ''),
'price' => is_array($price) ? ($price['@value'] ?? '') : $price,
'currency' => ($priceAttributes['currency'] ?? ''),
'stock' => ($offer['stock'] ?? '0'),
];
}
/**
* @return Generator<int, array<string, mixed>>
*/
function iterateOffersWithXmlExtractKit(string $path): Generator
{
$reader = XMLReader::open($path);
if ($reader === false) {
throw new RuntimeException('Cannot open XML file.');
}
try {
foreach (
FastXmlParser::extractPrettyPrint(
$reader,
static fn (XMLReader $cursor): bool =>
$cursor->nodeType === XMLReader::ELEMENT
&& $cursor->name === 'offer'
) as $node
) {
yield normalizeOffer($node);
}
} finally {
$reader->close();
}
}
The normalization step is explicit in both versions, which is good.
The difference is where the complexity lives.
With raw XMLReader, traversal and record assembly are tightly
coupled.
With XmlExtractKit, traversal stays streaming-based, but the
extraction phase is lifted into a more reusable form.
Step 4: use a benchmark harness that measures the right things
A benchmark harness should consume the records fully, otherwise the
result is misleading.
It should also record the time to the first yielded record, not just
the final completion time.
Here is a simple harness:
<?php
declare(strict_types=1);
/**
* @param callable(): iterable<array<string, mixed>> $factory
* @return array<string, int|float>
*/
function benchmarkExtraction(callable $factory): array
{
gc_collect_cycles();
if (function_exists('memory_reset_peak_usage')) {
memory_reset_peak_usage();
}
$startedAt = hrtime(true);
$firstRecordMs = null;
$count = 0;
$checksum = 0;
foreach ($factory() as $record) {
$count++;
$checksum += strlen((string) ($record['external_id'] ?? ''));
if ($firstRecordMs === null) {
$firstRecordMs = (hrtime(true) - $startedAt) / 1_000_000;
}
}
$elapsedMs = (hrtime(true) - $startedAt) / 1_000_000;
$peakMb = memory_get_peak_usage(true) / 1024 / 1024;
return [
'records' => $count,
'checksum' => $checksum,
'first_record_ms' => $firstRecordMs ?? 0.0,
'elapsed_ms' => $elapsedMs,
'peak_memory_mb' => $peakMb,
];
}
And here is how you would run it:
<?php
declare(strict_types=1);
require_once __DIR__ . '/vendor/autoload.php';
$fixture = __DIR__ . '/catalog-benchmark.xml';
generateCatalogFixture($fixture, 50000);
$xmlReaderResult = benchmarkExtraction(
static fn (): iterable => iterateOffersWithXmlReader($fixture)
);
$xmlExtractKitResult = benchmarkExtraction(
static fn (): iterable => iterateOffersWithXmlExtractKit($fixture)
);
var_export([
'xmlreader' => $xmlReaderResult,
'xmlextractkit' => $xmlExtractKitResult,
]);
That is enough to produce a reproducible benchmark on your own
machine.
What I would expect to see
I would not assume that one implementation wins every metric.
That is exactly why this comparison is interesting.
Peak memory
If both solutions are truly streaming and process one record at a
time, peak memory should stay controlled in both implementations.
If one of them starts materializing too much intermediate state,
this metric will reveal it quickly.
Time to first record
This is one of the most underrated metrics for XML extraction
workloads.
If your pipeline can start processing useful data almost immediately,
that is a real engineering advantage. It matters for imports,
progress reporting, partial processing, and backpressure-aware
systems.
Total runtime
This matters, but it should not dominate the interpretation.
A lower-level implementation may sometimes squeeze out a small
performance advantage. But if that advantage comes with much more
traversal glue, branching, and duplicated code, it may not be the
better engineering choice.
Userland code size and complexity
This is not a synthetic concern.
In production codebases, a solution that is 5% faster but
significantly harder to review, extend, and reuse is often the more
expensive solution.
That is why I would always report both machine metrics and code-shape
metrics side by side.
What would make the benchmark unfair
A benchmark like this becomes unreliable very quickly if you do any
of the following:
- compare different output shapes;
- parse different XML structures;
- make one side do more normalization work;
- compare a streaming solution to a full-document solution;
- benchmark tiny files that do not stress the streaming model;
- omit the code and only publish the result table.
The point is not to “win.”
The point is to understand the tradeoff for a specific extraction
task.
My interpretation of this comparison
This is how I would read the result after running the benchmark.
If raw XMLReader is slightly faster but the difference is small,
I would still strongly consider XmlExtractKit for repeated
integration work because the extraction code is easier to reason
about and easier to reuse.
If XmlExtractKit comes out close on runtime and similar on peak
memory, that is already a strong result for the library because it
means the higher-level extraction model is not buying convenience
at an unreasonable systems cost.
If the XML task is extremely narrow and unlikely to be reused, raw
XMLReader may still be the right answer.
But if the workload looks like real feed processing and the
extraction pattern shows up again and again, the benefit of moving
from cursor choreography to extraction-oriented code becomes very
tangible.
Conclusion
The most useful XML benchmark is not “which parser is fastest in the
abstract.”
It is:
for this exact extraction task, on this XML shape, with this output
model, what do I gain in runtime, memory, first-record latency, and
maintainability?
That is why I think raw XMLReader vs XmlExtractKit is the
comparison worth making.
They belong to the same real-world decision point:
- write the traversal and extraction layer yourself;
- or keep the streaming model but use a focused library to reduce the
- amount of glue code.
For large XML feeds in modern PHP, that is a benchmark that actually
tells you something useful.
Try it
composer require sbwerewolf/xml-navigator
Explore the demo project
git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install
Top comments (0)