Nicholas Volkhin

Posted on Apr 12 • Edited on Apr 17

How to Parse Large XML Files in PHP Without Running Out of Memory

#opensource #php #xml #parsing

XML is still everywhere: supplier feeds, marketplace catalogs, partner exports, legacy APIs, SOAP-ish payloads, ETL jobs. None of that is glamorous, but plenty of production systems still depend on it.

The real problem starts when the file is no longer small.

At that point, the question is not really "How do I parse XML in PHP?" It becomes:

How do I process a large XML document safely, extract only the records I care about, and keep the rest of my application working with normal PHP data structures?

That is a very different problem.

In many real-world integrations, you do not need the whole XML document in memory. You do not need to traverse every branch of the tree. You do not need a rich DOM-style model.

You usually need something much simpler:

scan the file efficiently;
find repeated business records such as product, offer, or item;
extract those records;
turn them into arrays;
pass them to the rest of your pipeline.

That is the approach I use in modern PHP projects, and it is the one I recommend for large XML workloads.

Why naive XML parsing stops working

For small files, the usual PHP XML tools are perfectly fine.

A typical first solution looks like this:

$xml = simplexml_load_file('feed.xml');

foreach ($xml->products->product as $product) {
    // process product
}

There is nothing wrong with that when the file is small and the document structure is simple.

The trouble is that this style of code implicitly treats the XML file as something you want to load and work with as a whole. For large feeds, that is often the wrong tradeoff.

If you only need repeated business records from a large XML file, materializing the entire document in memory is unnecessary work. It also makes your pipeline more fragile as feeds grow over time.

This is why large-XML handling should start with a different mental model:

Do not load the document. Stream through it and extract only what matters.

The real task is usually extraction, not XML manipulation

In practice, most XML processing jobs in application code look like this:

the file contains many repeated records;
you only need a subset of them;
you only need some fields from each record;
the result will end up in arrays, JSON, a database, or a queue.

That means the business task is usually not "work with XML as a document."

It is:

Find the repeated records I care about and turn them into application-friendly data.

That distinction matters because it leads directly to the right low-memory approach.

The memory-safe foundation: XMLReader

In PHP, the standard low-level tool for memory-safe XML traversal is XMLReader.

Instead of loading the entire document, it lets you move through the XML cursor-style, node by node.

That is exactly what you want when the file is large.

Here is a minimal baseline example:

$reader = new XMLReader();

if (! $reader->open('feed.xml')) {
    throw new RuntimeException('Cannot open XML file.');
}

while ($reader->read()) {
    if (
        $reader->nodeType === XMLReader::ELEMENT
        && $reader->name === 'product'
    ) {
        $nodeXml = $reader->readOuterXML();

        $product = simplexml_load_string($nodeXml);

        $data = [
            'id' => (string) $product->id,
            'name' => (string) $product->name,
            'price' => (float) $product->price,
            'available' => (string) $product->available,
        ];

        // process $data immediately
    }
}

$reader->close();

This is already much better than loading the full file up front.

It gives you the right execution model:

sequential reading;
low memory pressure;
immediate processing of extracted records.

If your XML task is simple and one-off, this may be enough.

But once you do this in more than one project, the weak points show up quickly.

Where raw XMLReader starts to hurt

XMLReader is powerful, but it is also low-level.

The moment your extraction task becomes slightly more realistic, you start accumulating glue code:

repeated node-selection logic;
conversion of XML fragments into arrays;
nested element handling;
attributes versus values;
optional nodes;
repeated fields like multiple <picture> tags;
serialization to JSON-friendly structures;
duplicated extraction code across projects.

At that point, memory is no longer the only concern.

Maintainability becomes the real cost.

This is the line I care about most in application code: not just "can I stream it," but "can I keep the extraction logic readable after the third similar integration?"

A more practical extraction-first approach

This is exactly why I built XmlExtractKit for PHP, published as sbwerewolf/xml-navigator.

The goal is not to replace XMLReader, but to keep its streaming model while moving application code closer to the actual business task.

Instead of managing the cursor manually and assembling records by hand, I want code that says:

open a large XML stream;
match the elements I care about;
get plain PHP arrays back.

Here is a streaming example using the library:

use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$uri = tempnam(sys_get_temp_dir(), 'xml-extract-kit-');
file_put_contents($uri, <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <offer id="1001" available="true">
    <name>Keyboard</name>
    <price currency="USD">49.90</price>
  </offer>
  <service id="s-1">
    <name>Warranty</name>
  </service>
  <offer id="1002" available="false">
    <name>Mouse</name>
    <price currency="USD">19.90</price>
  </offer>
</catalog>
XML);

$reader = XMLReader::open($uri);

if ($reader === false) {
    throw new RuntimeException('Cannot open XML file.');
}

$offers = FastXmlParser::extractPrettyPrint(
    $reader,
    static fn (XMLReader $cursor): bool =>
        $cursor->nodeType === XMLReader::ELEMENT
        && $cursor->name === 'offer'
);

foreach ($offers as $offer) {
    echo json_encode(
        $offer,
        JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES
    ) . PHP_EOL;
}

$reader->close();
unlink($uri);

The output is application-friendly:

{
  "offer": {
    "@attributes": {
      "id": "1001",
      "available": "true"
    },
    "name": "Keyboard",
    "price": {
      "@value": "49.90",
      "@attributes": {
        "currency": "USD"
      }
    }
  }
}

{
  "offer": {
    "@attributes": {
      "id": "1002",
      "available": "false"
    },
    "name": "Mouse",
    "price": {
      "@value": "19.90",
      "@attributes": {
        "currency": "USD"
      }
    }
  }
}

This is still a streaming workflow. The difference is that the code is now centered on the extraction task instead of low-level cursor management.

That becomes more valuable when the XML structure is nested, partially optional, or reused across multiple integrations.

Why plain arrays are often the right output

A lot of application code does not really want XML.

It wants data.

Once the relevant record has been extracted, the rest of the system usually prefers:

plain arrays;
normalized values;
JSON-ready structures;
data that can be validated, transformed, and persisted.

That is why I think "XML extraction" is a more useful framing than "XML handling."

Most business systems do not want to live inside an XML tree. They want to move past it as quickly as possible.

If the XML document is just a transport format, then the best workflow is usually:

XML stream -> selected nodes -> PHP arrays

That is the design center of my library.

When this approach makes sense

This style of XML processing works especially well when:

the XML file is large;
the document contains many repeated records;
you only need part of the document;
the extracted data should be processed immediately;
the rest of the application works with arrays, not DOM objects.

Typical examples include:

supplier and marketplace feeds;
product catalogs;
partner imports and exports;
ETL jobs;
queue payload preparation;
legacy integration endpoints that still speak XML.

When you probably do not need it

There are also cases where this is the wrong tool.

You probably do not need a streaming extraction approach when:

the XML is small;
loading the whole file is acceptable;
you need full-document manipulation;
your task is closer to DOM transformation than record extraction;
the XML structure is simple enough that a tiny one-off script is
enough.

That is important to say explicitly.

Not every XML task needs an extraction-first workflow. But the ones that do usually benefit from it immediately.

A useful rule of thumb

Here is the simplest practical rule I know:

if the XML is small and you need the whole document, convenience APIs are fine;
if the XML is large and you only need repeated records, stream it;
if you keep solving the same streaming extraction problem in multiple projects, stop writing the same glue code over and over.

That is the point where a focused library becomes worth it.

Conclusion

Large XML files are not primarily a parsing problem.

They are an extraction problem.

If you treat them like full in-memory documents, you often pay too much in memory and complexity. If you treat them like streams of repeated business records, the solution becomes safer, simpler, and much easier to fit into modern PHP pipelines.

XMLReader gives you the right low-level foundation for that model.

And if your real task is not "load XML," but "extract matching records and turn them into plain PHP arrays," then XmlExtractKit (sbwerewolf/xml-navigator) was built exactly for that workflow.

Try it

composer require sbwerewolf/xml-navigator

Explore the demo project

git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

DEV Community