Nicholas Volkhin

Posted on Apr 17

Benchmark: XMLReader vs XmlExtractKit on a Real Extraction Scenario

#opensource #php #xml #parsing

When people benchmark XML tools, they often compare the wrong things.

They compare a full-document parser to a streaming parser. They compare one tool that returns DOM objects to another that returns arrays. They compare a micro-example that does not resemble production code. Or they publish a time number without showing what work was actually done.

That is not useful.

For real PHP projects, the right benchmark question is usually much
narrower:

If the task is to stream through a large XML feed, extract repeated business records, and turn them into plain PHP arrays, what do I gain by using raw XMLReader directly, and what do I gain by using a focused extraction library such as XmlExtractKit?

That is the comparison I care about.

This article shows how I would benchmark that scenario in a way that is both fair and technically honest.

What the benchmark should measure

For extraction-heavy workloads, “total runtime” is not enough.

A useful benchmark should measure at least four things:

total wall-clock time;
peak memory usage;
time to first useful record;
amount of userland code needed to express the task.

That last one is not a machine metric, but it matters. In many integrations, the long-term cost is not CPU time. It is the amount of extraction glue code you end up carrying from project to project.

The scenario: one realistic extraction task

To keep the comparison fair, both approaches should solve the same task.

Here is the scenario:

the input is a large XML feed;
the feed contains repeated <offer> records;
each offer has nested elements and attributes;
we only care about offer records, not the rest of the document;
the output for each record is a plain PHP array shaped for application use.

That is much closer to a real import or ETL job than an abstract “parse XML” benchmark.

A minimal example of the feed structure looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
  <offer id="1001" available="true">
    <sku>KB-1001</sku>
    <name>Mechanical Keyboard</name>
    <brand>Acme</brand>
    <category>Keyboards</category>
    <price currency="USD">129.90</price>
    <stock>14</stock>
  </offer>
  <service id="svc-1">
    <name>Extended Warranty</name>
  </service>
  <offer id="1002" available="false">
    <sku>MS-1002</sku>
    <name>Wireless Mouse</name>
    <brand>Acme</brand>
    <category>Mice</category>
    <price currency="USD">39.90</price>
    <stock>0</stock>
  </offer>
</catalog>

In the benchmark, all implementations should produce records like this:

[
    'external_id' => '1001',
    'available' => true,
    'sku' => 'KB-1001',
    'name' => 'Mechanical Keyboard',
    'brand' => 'Acme',
    'category' => 'Keyboards',
    'price' => '129.90',
    'currency' => 'USD',
    'stock' => '14',
]

That common output shape is important. If the two approaches do different work, the benchmark is meaningless.

What exactly is being compared

I would compare these two implementations:

1. Raw `XMLReader`

This is the “do it yourself” baseline:

move the cursor manually;
detect <offer> nodes;
read attributes and child elements;
assemble arrays by hand;
yield one normalized record at a time.

2. `XmlExtractKit` (`sbwerewolf/xml-navigator`)

This is the extraction-first approach:

still stream through the XML using XMLReader underneath;
use FastXmlParser::extractPrettyPrint() to yield only matching nodes;
normalize the resulting arrays into your application format.

This is the comparison that matters in practice because these two approaches solve the same class of problem: streaming extraction of repeated records.

Why I am not publishing arbitrary numbers here

A benchmark article becomes misleading very quickly when it includes numbers without context.

Runtime depends on all of these things:

PHP version and build;
enabled extensions;
CPU and storage;
XML shape and depth;
number of attributes;
number of repeated child elements;
whether your normalization step is trivial or heavy.

Because of that, I think the honest way to present this benchmark is:

show the exact task;
show both implementations;
show the harness;
explain what to measure;
tell readers how to interpret the results.

That way the article stays useful even when the raw numbers differ from machine to machine.

Step 1: generate a reproducible XML fixture

For a benchmark, hand-written miniature XML is not enough.

You want a generated fixture with many repeated records so that the streaming behavior becomes visible.

Here is a simple generator:

<?php

declare(strict_types=1);

function generateCatalogFixture(string $path, int $offers = 9999): void
{
    $fh = fopen($path, 'wb');

    if ($fh === false) {
        throw new RuntimeException(
            'Cannot open fixture file for writing.'
        );
    }

    $date = date('Y-m-d');
    fwrite($fh, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
    fwrite($fh, "<catalog generated_at=\"$date\" region=\"eu\">\n");

    for ($i = 1; $i <= $offers; $i++) {
        $available = $i % 7 === 0 ? 'false' : 'true';
        $brand = 'Brand-' . ($i % 13);
        $category = 'Category-' . ($i % 9);
        $price = number_format(10 + ($i % 1000) / 10, 2, '.', '');
        $stock = (string) ($i % 25);

        fwrite($fh, "  <offer id=\"{$i}\" available=\"{$available}\">\n");
        fwrite($fh, "    <sku>SKU-{$i}</sku>\n");
        fwrite($fh, "    <name>Product {$i}</name>\n");
        fwrite($fh, "    <brand>{$brand}</brand>\n");
        fwrite($fh, "    <category>{$category}</category>\n");
        fwrite($fh, "    <price currency=\"USD\">{$price}</price>\n");
        fwrite($fh, "    <stock>{$stock}</stock>\n");
        fwrite($fh, "  </offer>\n");

        if ($i % 1000 === 0) {
            fwrite($fh, "  <service id=\"svc-{$i}\"><name>Warranty</name></service>\n");
        }
    }

    fwrite($fh, "</catalog>\n");
    fclose($fh);
}

This gives both implementations identical input and enough repeated records to make the comparison meaningful.

Step 2: the raw XMLReader baseline

The baseline should be direct and honest. It should not be intentionally ugly, but it should reflect the kind of code people really end up writing when they solve the problem manually.

Here is one way to do that:

<?php

declare(strict_types=1);

/**
 * @return Generator<int, array<string, mixed>>
 */
function iterateOffersWithXmlReader(string $path): Generator
{
    $reader = XMLReader::open($path);

    if ($reader === false) {
        throw new RuntimeException('Cannot open XML file.');
    }

    try {
        while ($reader->read()) {
            if (
                $reader->nodeType !== XMLReader::ELEMENT
                || $reader->name !== 'offer'
            ) {
                continue;
            }

            $depth = $reader->depth;

            $offer = [
                'external_id' => $reader->getAttribute('id') ?? '',
                'available' => ($reader->getAttribute('available') ?? '') === 'true',
                'sku' => '',
                'name' => '',
                'brand' => '',
                'category' => '',
                'price' => '',
                'currency' => '',
                'stock' => '0',
            ];

            while ($reader->read()) {
                if (
                    $reader->nodeType === XMLReader::END_ELEMENT
                    && $reader->name === 'offer'
                    && $reader->depth === $depth
                ) {
                    break;
                }

                if ($reader->nodeType !== XMLReader::ELEMENT) {
                    continue;
                }

                switch ($reader->name) {
                    case 'sku':
                        $offer['sku'] = $reader->readString();
                        break;

                    case 'name':
                        $offer['name'] = $reader->readString();
                        break;

                    case 'brand':
                        $offer['brand'] = $reader->readString();
                        break;

                    case 'category':
                        $offer['category'] = $reader->readString();
                        break;

                    case 'price':
                        $offer['currency'] = $reader->getAttribute('currency') ?? '';
                        $offer['price'] = $reader->readString();
                        break;

                    case 'stock':
                        $offer['stock'] = $reader->readString();
                        break;
                }
            }

            yield $offer;
        }
    } finally {
        $reader->close();
    }
}

There is nothing wrong with this code. In fact, for some one-off jobs it is a perfectly reasonable solution.

But it already illustrates the tradeoff:

manual cursor handling;
explicit depth management;
hand-written field extraction;
record assembly mixed with traversal logic.

That is exactly what I want the benchmark to capture.

Step 3: the XmlExtractKit implementation

Now compare it with the extraction-first version.

XmlExtractKit still uses the streaming model, but it moves the code closer to the business task: extract matching nodes, then normalize them.

<?php

declare(strict_types=1);

use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

/**
 * @param array<string, mixed> $node
 * @return array<string, mixed>
 */
function normalizeOffer(array $node): array
{
    $offer = $node['offer'] ?? [];
    $attributes = $offer['@attributes'] ?? [];
    $price = $offer['price'] ?? [];
    $priceAttributes = is_array($price) ? ($price['@attributes'] ?? []) : [];

    return [
        'external_id' => ($attributes['id'] ?? ''),
        'available' => (($attributes['available'] ?? '')) === 'true',
        'sku' => ($offer['sku'] ?? ''),
        'name' => ($offer['name'] ?? ''),
        'brand' => ($offer['brand'] ?? ''),
        'category' => ($offer['category'] ?? ''),
        'price' => is_array($price) ? ($price['@value'] ?? '') : $price,
        'currency' => ($priceAttributes['currency'] ?? ''),
        'stock' => ($offer['stock'] ?? '0'),
    ];
}

/**
 * @return Generator<int, array<string, mixed>>
 */
function iterateOffersWithXmlExtractKit(string $path): Generator
{
    $reader = XMLReader::open($path);

    if ($reader === false) {
        throw new RuntimeException('Cannot open XML file.');
    }

    try {
        foreach (
            FastXmlParser::extractPrettyPrint(
                $reader,
                static fn (XMLReader $cursor): bool =>
                    $cursor->nodeType === XMLReader::ELEMENT
                    && $cursor->name === 'offer'
            ) as $node
        ) {
            yield normalizeOffer($node);
        }
    } finally {
        $reader->close();
    }
}

The normalization step is explicit in both versions, which is good. The difference is where the complexity lives.

With raw XMLReader, traversal and record assembly are tightly coupled.

With XmlExtractKit, traversal stays streaming-based, but the extraction phase is lifted into a more reusable form.

Step 4: use a benchmark harness that measures the right things

A benchmark harness should consume the records fully, otherwise the result is misleading.

It should also record the time to the first yielded record, not just the final completion time.

Here is a simple harness:

<?php

declare(strict_types=1);

/**
 * @param callable(): iterable<array<string, mixed>> $factory
 * @return array<string, int|float>
 */
function benchmarkExtraction(callable $factory): array
{
    gc_collect_cycles();

    if (function_exists('memory_reset_peak_usage')) {
        memory_reset_peak_usage();
    }

    $startedAt = hrtime(true);
    $firstRecordMs = null;
    $count = 0;
    $checksum = 0;

    foreach ($factory() as $record) {
        $count++;
        $checksum += strlen((string) ($record['external_id'] ?? ''));

        if ($firstRecordMs === null) {
            $firstRecordMs = (hrtime(true) - $startedAt) / 1_000_000;
        }
    }

    $elapsedMs = (hrtime(true) - $startedAt) / 1_000_000;
    $peakMb = memory_get_peak_usage(true) / 1024 / 1024;

    return [
        'records' => $count,
        'checksum' => $checksum,
        'first_record_ms' => $firstRecordMs ?? 0.0,
        'elapsed_ms' => $elapsedMs,
        'peak_memory_mb' => $peakMb,
    ];
}

And here is how you would run it:

<?php

declare(strict_types=1);

require_once __DIR__ . '/vendor/autoload.php';

$fixture = __DIR__ . '/catalog-benchmark.xml';
generateCatalogFixture($fixture, 50000);

$xmlReaderResult = benchmarkExtraction(
    static fn (): iterable => iterateOffersWithXmlReader($fixture)
);

$xmlExtractKitResult = benchmarkExtraction(
    static fn (): iterable => iterateOffersWithXmlExtractKit($fixture)
);

var_export([
    'xmlreader' => $xmlReaderResult,
    'xmlextractkit' => $xmlExtractKitResult,
]);

That is enough to produce a reproducible benchmark on your own machine.

What I would expect to see

I would not assume that one implementation wins every metric.

That is exactly why this comparison is interesting.

Peak memory

If both solutions are truly streaming and process one record at a time, peak memory should stay controlled in both implementations.

If one of them starts materializing too much intermediate state, this metric will reveal it quickly.

Time to first record

This is one of the most underrated metrics for XML extraction workloads.

If your pipeline can start processing useful data almost immediately, that is a real engineering advantage. It matters for imports, progress reporting, partial processing, and backpressure-aware systems.

Total runtime

This matters, but it should not dominate the interpretation.

A lower-level implementation may sometimes squeeze out a small performance advantage. But if that advantage comes with much more traversal glue, branching, and duplicated code, it may not be the better engineering choice.

Userland code size and complexity

This is not a synthetic concern.

In production codebases, a solution that is 5% faster but significantly harder to review, extend, and reuse is often the more expensive solution.

That is why I would always report both machine metrics and code-shape metrics side by side.

What would make the benchmark unfair

A benchmark like this becomes unreliable very quickly if you do any of the following:

compare different output shapes;
parse different XML structures;
make one side do more normalization work;
compare a streaming solution to a full-document solution;
benchmark tiny files that do not stress the streaming model;
omit the code and only publish the result table.

The point is not to “win.”

The point is to understand the tradeoff for a specific extraction task.

My interpretation of this comparison

This is how I would read the result after running the benchmark.

If raw XMLReader is slightly faster but the difference is small, I would still strongly consider XmlExtractKit for repeated integration work because the extraction code is easier to reason about and easier to reuse.

If XmlExtractKit comes out close on runtime and similar on peak memory, that is already a strong result for the library because it means the higher-level extraction model is not buying convenience at an unreasonable systems cost.

If the XML task is extremely narrow and unlikely to be reused, raw XMLReader may still be the right answer.

But if the workload looks like real feed processing and the extraction pattern shows up again and again, the benefit of moving from cursor choreography to extraction-oriented code becomes very tangible.

Conclusion

The most useful XML benchmark is not “which parser is fastest in the abstract.”

It is:

for this exact extraction task, on this XML shape, with this output model, what do I gain in runtime, memory, first-record latency, and maintainability?

That is why I think raw XMLReader vs XmlExtractKit is the comparison worth making.

They belong to the same real-world decision point:

write the traversal and extraction layer yourself;
or keep the streaming model but use a focused library to reduce the
amount of glue code.

For large XML feeds in modern PHP, that is a benchmark that actually tells you something useful.

Try it

composer require sbwerewolf/xml-navigator

Explore the demo project

git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

DEV Community

Benchmark: XMLReader vs XmlExtractKit on a Real Extraction Scenario

What the benchmark should measure

The scenario: one realistic extraction task

What exactly is being compared

1. Raw `XMLReader`

2. `XmlExtractKit` (`sbwerewolf/xml-navigator`)

Why I am not publishing arbitrary numbers here

Step 1: generate a reproducible XML fixture

Step 2: the raw XMLReader baseline

Step 3: the XmlExtractKit implementation

Step 4: use a benchmark harness that measures the right things

What I would expect to see

Peak memory

Time to first record

Total runtime

Userland code size and complexity

What would make the benchmark unfair

My interpretation of this comparison

Conclusion

Try it

Explore the demo project

Top comments (0)

What the benchmark should measure

The scenario: one realistic extraction task

What exactly is being compared

1. Raw XMLReader

2. XmlExtractKit (sbwerewolf/xml-navigator)

Why I am not publishing arbitrary numbers here

Step 1: generate a reproducible XML fixture

Step 2: the raw XMLReader baseline

Step 3: the XmlExtractKit implementation

Step 4: use a benchmark harness that measures the right things

What I would expect to see

Peak memory

Time to first record

Total runtime

Userland code size and complexity

What would make the benchmark unfair

My interpretation of this comparison

Conclusion

Try it

Explore the demo project

1. Raw `XMLReader`

2. `XmlExtractKit` (`sbwerewolf/xml-navigator`)