Nicholas Volkhin

Posted on Apr 16 • Edited on Apr 17

Processing Supplier and Marketplace XML Feeds in PHP

#opensource #php #xml #parsing

Supplier and marketplace integrations are one of the places where XML refuses to die.

That is not a complaint. It is just the shape of the problem.

If you build import pipelines, catalog sync jobs, price updates, availability updates, or partner data bridges, sooner or later you will meet an XML feed that contains the data you need in a format your application does not really want.

In those situations, the most important design decision is usually not about XML itself.

It is about the processing model.

Do you treat the feed as a document that your application should “work with”? Or do you treat it as an external transport format that should be scanned, filtered, converted, normalized, and handed off to the rest of your pipeline as plain PHP data?

For supplier and marketplace feeds, I strongly prefer the second model.

That is the approach behind XmlExtractKit, published as sbwerewolf/xml-navigator.

The real shape of feed-processing work

When people hear “XML parsing,” the task can sound abstract.

Supplier and marketplace feeds are not abstract.

They usually look more like this:

a large file with repeated business records;
products, offers, items, categories, stock entries, prices, or media blocks;
partial updates, optional fields, nested elements, repeated child tags, and attributes;
a downstream pipeline that wants arrays, validated records, database writes, or queue jobs.

That means the actual engineering task is usually not:

“Parse XML.”

It is:

“Extract repeated records from an external feed and transform them into a predictable internal format.”

That framing leads to much better implementation choices.

What makes feeds different from small XML documents

Small XML documents are often fine with convenient full-document APIs.

Feeds are different for a few reasons.

1. They are repetitive by nature

A feed usually contains the same business structure again and again:

<offer>
<product>
<item>
<entry>

That is a strong signal that you should process the XML as a sequence of records, not as one big tree you want to keep in memory.

2. You rarely need everything

A typical import job does not need every element in the feed.

It may only need:

the offer identifier;
availability;
price;
currency;
category;
a few images;
update timestamps;
one or two custom attributes.

The rest is often irrelevant for the current pipeline step.

3. The output is almost never “more XML”

Your import layer usually wants:

associative arrays;
normalized field values;
database rows;
JSON payloads;
DTOs;
queue messages.

That is why feed work is usually an extraction and normalization problem, not an XML-manipulation problem.

A representative feed example

Here is a simplified feed structure that is close to what many supplier and marketplace pipelines deal with:

<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
  <offer id="1001" available="true">
    <sku>KB-1001</sku>
    <name>Mechanical Keyboard</name>
    <brand>Acme</brand>
    <category>Keyboards</category>
    <price currency="USD">129.90</price>
    <oldprice currency="USD">149.90</oldprice>
    <picture>https://cdn.example.test/kb-1001-front.jpg</picture>
    <picture>https://cdn.example.test/kb-1001-side.jpg</picture>
    <stock>14</stock>
  </offer>
  <offer id="1002" available="false">
    <sku>MS-1002</sku>
    <name>Wireless Mouse</name>
    <brand>Acme</brand>
    <category>Mice</category>
    <price currency="USD">39.90</price>
    <picture>https://cdn.example.test/ms-1002.jpg</picture>
    <stock>0</stock>
  </offer>
</catalog>

This is already enough to illustrate the real shape of the work.

The import pipeline usually does not want to keep this XML structure around.

It wants to turn each <offer> into something like:

[
    'external_id' => '1001',
    'sku' => 'KB-1001',
    'name' => 'Mechanical Keyboard',
    'brand' => 'Acme',
    'category' => 'Keyboards',
    'price' => '129.90',
    'currency' => 'USD',
    'old_price' => '149.90',
    'available' => true,
    'stock' => 14,
    'pictures' => [
        'https://cdn.example.test/kb-1001-front.jpg',
        'https://cdn.example.test/kb-1001-side.jpg',
    ],
]

That is the internal target.

Once you are clear about that, the XML side becomes much easier to reason about.

The two stages that matter most

For feed pipelines, I think it helps to split the work into two explicit stages.

Stage 1: extraction

This is where you identify the repeated record you care about and convert it into a predictable PHP structure.

Stage 2: normalization

This is where you adapt that structure to your own application model:

rename fields;
cast values;
collapse optional fields;
map categories;
validate currency or stock rules;
prepare records for persistence or messaging.

Trying to collapse these two stages into one giant parsing function usually makes the code harder to maintain.

Why streaming is such a good fit for feeds

Supplier and marketplace feeds are one of the best use cases for streaming XML traversal.

The reasons are practical:

files can become large over time;
records are naturally repeated;
each record can often be processed independently;
you usually do not need the whole document tree;
early filtering is valuable.

This is exactly where XMLReader and extraction-first libraries built on top of it become useful.

With XmlExtractKit, I usually approach these feeds as “find repeated offers and turn them into arrays.”

Here is a streaming extraction example using FastXmlParser::extractPrettyPrint():

use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;

require_once __DIR__ . '/vendor/autoload.php';

$uri = tempnam(sys_get_temp_dir(), 'supplier-feed-');

file_put_contents($uri, <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
  <offer id="1001" available="true">
    <sku>KB-1001</sku>
    <name>Mechanical Keyboard</name>
    <brand>Acme</brand>
    <category>Keyboards</category>
    <price currency="USD">129.90</price>
    <picture>https://cdn.example.test/kb-1001-front.jpg</picture>
    <picture>https://cdn.example.test/kb-1001-side.jpg</picture>
    <stock>14</stock>
  </offer>
  <service id="svc-1">
    <name>Extended Warranty</name>
  </service>
  <offer id="1002" available="false">
    <sku>MS-1002</sku>
    <name>Wireless Mouse</name>
    <brand>Acme</brand>
    <category>Mice</category>
    <price currency="USD">39.90</price>
    <picture>https://cdn.example.test/ms-1002.jpg</picture>
    <stock>0</stock>
  </offer>
</catalog>
XML);

$reader = XMLReader::open($uri);

if ($reader === false) {
    throw new RuntimeException('Cannot open XML feed.');
}

$offers = FastXmlParser::extractPrettyPrint(
    $reader,
    static fn (XMLReader $cursor): bool =>
        $cursor->nodeType === XMLReader::ELEMENT
        && $cursor->name === 'offer'
);

foreach ($offers as $offer) {
    echo json_encode(
        $offer,
        JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES
    ) . PHP_EOL;
}

$reader->close();
unlink($uri);

That extraction result is already close to what the rest of the application needs.

It is still XML-derived data, but it is no longer trapped in XML traversal logic.

Why readable arrays help so much in feed work

In feed-processing pipelines, readability is not cosmetic.

It directly affects how quickly you can:

inspect bad records;
log partial failures;
test normalization rules;
compare incoming and outgoing payloads;
reason about optional fields;
support multiple partner formats.

That is why array output is so practical.

For example, one extracted <offer> might look like this after extractPrettyPrint():

[
    'offer' => [
        '@attributes' => [
            'id' => '1001',
            'available' => 'true',
        ],
        'sku' => 'KB-1001',
        'name' => 'Mechanical Keyboard',
        'brand' => 'Acme',
        'category' => 'Keyboards',
        'price' => [
            '@value' => '129.90',
            '@attributes' => [
                'currency' => 'USD',
            ],
        ],
        'picture' => [
            'https://cdn.example.test/kb-1001-front.jpg',
            'https://cdn.example.test/kb-1001-side.jpg',
        ],
        'stock' => '14',
    ],
]

That is a much better input for normalization code than a half-processed XML cursor state.

The normalization step is where your business rules belong

Once the feed record is in array form, you can normalize it with ordinary PHP code.

For example:

/**
 * @param array<string, mixed> $record
 * @return array<string, mixed>
 */
function normalizeOffer(array $record): array
{
    $offer = $record['offer'];

    $pictures = $offer['picture'] ?? [];
    if (! is_array($pictures)) {
        $pictures = [$pictures];
    }

    return [
        'external_id' => $offer['@attributes']['id'] ?? null,
        'available' => ($offer['@attributes']['available'] ?? 'false') === 'true',
        'sku' => $offer['sku'] ?? null,
        'name' => $offer['name'] ?? null,
        'brand' => $offer['brand'] ?? null,
        'category' => $offer['category'] ?? null,
        'price' => $offer['price']['@value'] ?? null,
        'currency' => $offer['price']['@attributes']['currency'] ?? null,
        'stock' => isset($offer['stock']) ? (int) $offer['stock'] : null,
        'pictures' => array_values(array_filter($pictures, 'is_string')),
    ];
}

This is where business logic belongs.

Not in low-level XML traversal. Not in cursor movement. Not in string fragments.

A clean import architecture keeps those concerns separate.

Repeated tags, attributes, and optional fields are not edge cases

In feed processing, these are normal conditions:

multiple images;
optional old price;
empty stock fields;
attributes that carry business meaning;
tags that are present for some suppliers and absent for others.

That is another reason I prefer extraction to arrays early.

Once the record is in a stable PHP structure, handling these cases becomes straightforward.

You can:

default missing fields;
cast types;
merge repeated tags into lists;
strip noise;
build validation rules around familiar array shapes.

One feed is manageable. Ten feeds expose architecture problems

A lot of parsing approaches look acceptable when there is only one partner.

The trouble begins when the system grows:

supplier A sends offer;
supplier B sends item;
marketplace C adds nested media blocks;
another feed uses attributes where the previous one used child elements;
one integration sends a full nightly catalog;
another sends partial incremental updates.

At that point, the quality of your processing model matters much more than the convenience of a single parser call.

The goal is not just “parse this file.”

The goal is to build a repeatable pattern:

extract repeated records;
convert them into stable PHP structures;
normalize them into your domain shape;
pass them downstream.

That pattern scales much better than spreading XML handling rules throughout the codebase.

A useful split for real projects

For supplier and marketplace XML feeds, I think the cleanest split is this:

Integration edge

read the XML stream;
extract only target records;
convert them into arrays.

Normalization layer

cast and validate fields;
reconcile naming differences;
apply partner-specific mapping rules;
create consistent internal records.

Application layer

persist catalog data;
emit events;
update search indexes;
enqueue downstream jobs.

This keeps XML where it belongs: at the edge.

When a full-document approach is still fine

Not every feed needs streaming.

If the XML is small and the structure is simple, a full-document approach may be completely acceptable.

But supplier and marketplace integrations tend to drift in one direction over time:

more records;
more nested data;
more optional fields;
more partner variants;
more operational pressure.

That is why an extraction-first model is often the safer default.

It is not about premature optimization.

It is about choosing a processing pattern that continues to work when the feed stops being toy-sized.

Conclusion

Supplier and marketplace XML feeds are rarely difficult because XML is mysterious.

They are difficult because they combine repetition, size, optional structure, external control, and business-specific normalization rules.

That is why I think the most productive way to handle them in PHP is:

stream the feed when needed;
extract repeated records instead of loading everything;
convert XML into plain arrays early;
keep normalization and business rules outside low-level XML traversal.

That is the workflow I wanted from XmlExtractKit.

Not a giant XML abstraction layer. Not an attempt to make XML pleasant.

Just a practical path from external XML feeds to application-ready PHP data.

Try it

composer require sbwerewolf/xml-navigator

Explore the demo project

git clone https://github.com/SbWereWolf/xml-extract-kit-demo-repo.git
cd xml-extract-kit-demo-repo
composer install

DEV Community