Nicholas Volkhin

Posted on Apr 14 • Edited on Apr 17

XMLReader vs XmlExtractKit for Real XML Extraction Tasks in PHP

#xml #xmlreader #etl #integration

When PHP developers compare XML approaches, the comparison often starts in the wrong place.

It usually becomes a vague question like this:

"What is the best XML library for PHP?"

That is too broad to be useful.

In real projects, the question is usually much narrower:

I have a large XML file;
it contains repeated business records;
I only need some of those records;
I want application-friendly PHP data, not a full in-memory XML tree.

That is not a general XML problem.

It is an extraction task.

And for this kind of work, the most honest comparison is often not between two third-party packages. It is between:

raw XMLReader, where you write the extraction logic yourself;
a focused extraction toolkit, where the streaming model stays
the same but the glue code becomes reusable.

In my case, that focused toolkit is XmlExtractKit, published as sbwerewolf/xml-navigator.

This article compares both approaches on the same practical task.

The task

Suppose we have a large XML feed that contains repeated <offer> records, mixed with other node types that we do not care about.

We want to extract each offer into a PHP array with a shape like this:

[
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.90,
    'currency' => 'USD',
]

Here is the sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <offer id="1001" available="true">
    <name>Keyboard</name>
    <price currency="USD">49.90</price>
  </offer>
  <service id="s-1">
    <name>Warranty</name>
  </service>
  <offer id="1002" available="false">
    <name>Mouse</name>
    <price currency="USD">19.90</price>
  </offer>
</catalog>

This is a simple example, but it is representative of a lot of real XML integration work: repeated nodes, some attributes, some nested values, and other elements that should be ignored.

Option 1: raw XMLReader

The low-level memory-safe baseline in PHP is XMLReader.

That makes it the right foundation for large-file extraction.

Here is one way to solve the task with plain XMLReader and a small
amount of helper parsing:

$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

while ($reader->read()) {
    if (
        $reader->nodeType !== XMLReader::ELEMENT
        || $reader->name !== 'offer'
    ) {
        continue;
    }

    $offerXml = $reader->readOuterXML();
    $offer = simplexml_load_string($offerXml);

    if ($offer === false) {
        continue;
    }

    $rows[] = [
        'id' => (string) $offer['id'],
        'available' => ((string) $offer['available']) === 'true',
        'name' => (string) $offer->name,
        'price' => (float) $offer->price,
        'currency' => (string) $offer->price['currency'],
    ];
}

$reader->close();

var_export($rows);

Output:

array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)

This is a perfectly valid solution.

It is memory-safe in the important sense: we are not loading the whole XML document into memory. We are moving through the stream and extracting matching nodes.

For a one-off task, this may be enough.

But there are tradeoffs.

What the raw XMLReader version costs you

The raw XMLReader version works, but its cost is not obvious when the example is this small.

The real cost shows up later:

matching logic has to be repeated or abstracted;
field extraction rules are embedded directly in the loop;
nested XML handling becomes more verbose;
attributes and text values require repeated manual decisions;
optional fields quickly add conditionals;
the same extraction pattern gets reimplemented across projects.

This is the critical point: the issue is not whether XMLReader is capable. It absolutely is.

The issue is whether low-level cursor code is the right place to keep business extraction logic once the project grows beyond a toy example.

Option 2: XmlExtractKit on top of XMLReader

Now let us solve the same extraction task using XmlExtractKit.

The important thing to understand is that the streaming model does not change. Under the hood, the workflow is still based on XMLReader.

What changes is the level of abstraction.

Instead of manually managing cursor flow and converting node fragments inline, the library lets me say:

stream through the XML;
select matching nodes;
receive structured PHP arrays for those nodes.

Here is the same scenario using FastXmlParser::extractHierarchy()
and XmlElement:

use SbWereWolf\XmlNavigator\Navigation\XmlElement;
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

$rows = [];

foreach (
    FastXmlParser::extractHierarchy(
        $reader,
        static fn (XMLReader $cursor): bool =>
            $cursor->nodeType === XMLReader::ELEMENT
            && $cursor->name === 'offer'
    ) as $offerData
) {
    $offer = new XmlElement($offerData);
    $name = $offer->pull('name')->current();
    $price = $offer->pull('price')->current();

    $rows[] = [
        'id' => $offer->get('id'),
        'available' => $offer->get('available') === 'true',
        'name' => $name?->value() ?? '',
        'price' => (float) ($price?->value() ?? 0),
        'currency' => $price?->get('currency') ?? '',
    ];
}

$reader->close();

var_export($rows);

The result is the same kind of application-level array:

array (
  0 =>
  array (
    'id' => '1001',
    'available' => true,
    'name' => 'Keyboard',
    'price' => 49.9,
    'currency' => 'USD',
  ),
  1 =>
  array (
    'id' => '1002',
    'available' => false,
    'name' => 'Mouse',
    'price' => 19.9,
    'currency' => 'USD',
  ),
)

That is the key comparison.

Both approaches are streaming-based. Both avoid loading the full XML document into memory. Both can solve the same extraction task.

The difference is where the complexity lives.

The practical difference

With raw XMLReader, the extraction loop carries several
responsibilities at once:

traversal;
node matching;
fragment parsing;
data mapping;
shape normalization.

With XmlExtractKit, traversal remains streaming-based, but extraction becomes more explicit and reusable.

That matters because most XML integration code is not judged only by whether it works today. It is judged by what happens when you need to:

add another field;
support optional nodes;
process another repeated element type;
reuse the same extraction pattern in a second project;
hand the code to someone else six months later.

In other words, the comparison is not just about performance. It is about where you want complexity to accumulate.

What raw XMLReader is still excellent for

It is worth being very clear here: this is not an argument against XMLReader.

XMLReader is the right foundation for large XML handling in PHP.

And there are cases where staying close to the metal is still the best option:

the task is small and one-off;
you need very custom cursor-level logic;
the extraction rules are extremely specific;
introducing another abstraction would not pay for itself.

When that is the case, use raw XMLReader and move on.

That is a completely reasonable engineering choice.

Where XmlExtractKit starts paying off

A focused extraction toolkit starts making sense when the job repeats.

That usually means one or more of these are true:

XML files are large enough that streaming is mandatory;
extraction is a recurring integration pattern;
the codebase needs arrays, not XML trees;
multiple projects solve similar feed or import tasks;
you want a stable intermediate representation of XML records;
you want the extraction code to read like the task, not like cursor choreography.

That is the use case I built sbwerewolf/xml-navigator for.

I did not want a general-purpose XML mega-toolkit. I wanted a practical way to keep the memory-safe streaming model while reducing how much extraction glue code I had to keep rewriting.

A more honest way to compare XML tools

One of the reasons XML discussions become unhelpful is that people compare tools that are not aimed at the same job.

A better comparison framework looks like this:

DOM / SimpleXML when the document is small and full-tree convenience matters;
raw XMLReader when the file is large and the task is custom enough that low-level control is worth it;
XmlExtractKit when the file is large, the task is extraction-focused, and you want structured arrays instead of repeated cursor glue.

That is much more useful than asking for a universal winner.

There is no universal winner.

There is only a better fit for the task in front of you.

So which one should you choose?

Here is my practical answer.

Choose raw XMLReader when:

you want maximal control;
the task is narrow;
the extraction code will probably never be reused;
a little extra boilerplate is acceptable.

Choose XmlExtractKit when:

you keep solving the same extraction problem repeatedly;
you want the XML stage to produce structured PHP arrays;
you want extraction code that is easier to read and maintain;
you want to stay streaming-first without hand-writing the same conversion patterns again and again.

Conclusion

For real XML extraction tasks in PHP, the main decision is usually not "which XML package is best?"

It is this:

Do I want to keep solving this at the raw XMLReader level, or do I want a reusable extraction-oriented layer on top of the same streaming model?

That is the honest comparison.

XMLReader is still the correct low-level foundation for large XML files.

But if your actual problem is repeated extraction of business records into plain PHP arrays, then XmlExtractKit (sbwerewolf/xml-navigator) is designed to make that workflow cleaner, more reusable, and easier to maintain.

Try it

composer require sbwerewolf/xml-navigator

Explore the demo project

git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

DEV Community