Supplier and marketplace integrations are one of the places where XML
refuses to die.
That is not a complaint. It is just the shape of the problem.
If you build import pipelines, catalog sync jobs, price updates,
availability updates, or partner data bridges, sooner or later you
will meet an XML feed that contains the data you need in a format
your application does not really want.
In those situations, the most important design decision is usually
not about XML itself.
It is about the processing model.
Do you treat the feed as a document that your application should
“work with”? Or do you treat it as an external transport format that
should be scanned, filtered, converted, normalized, and handed off to
the rest of your pipeline as plain PHP data?
For supplier and marketplace feeds, I strongly prefer the second model.
That is the approach behind XmlExtractKit, published as
sbwerewolf/xml-navigator.
The real shape of feed-processing work
When people hear “XML parsing,” the task can sound abstract.
Supplier and marketplace feeds are not abstract.
They usually look more like this:
- a large file with repeated business records;
- products, offers, items, categories, stock entries, prices, or media blocks;
- partial updates, optional fields, nested elements, repeated child tags, and attributes;
- a downstream pipeline that wants arrays, validated records, database writes, or queue jobs.
That means the actual engineering task is usually not:
“Parse XML.”
It is:
“Extract repeated records from an external feed and transform them
into a predictable internal format.”
That framing leads to much better implementation choices.
What makes feeds different from small XML documents
Small XML documents are often fine with convenient full-document APIs.
Feeds are different for a few reasons.
1. They are repetitive by nature
A feed usually contains the same business structure again and again:
<offer><product><item><entry>
That is a strong signal that you should process the XML as a sequence
of records, not as one big tree you want to keep in memory.
2. You rarely need everything
A typical import job does not need every element in the feed.
It may only need:
- the offer identifier;
- availability;
- price;
- currency;
- category;
- a few images;
- update timestamps;
- one or two custom attributes.
The rest is often irrelevant for the current pipeline step.
3. The output is almost never “more XML”
Your import layer usually wants:
- associative arrays;
- normalized field values;
- database rows;
- JSON payloads;
- DTOs;
- queue messages.
That is why feed work is usually an extraction and normalization
problem, not an XML-manipulation problem.
A representative feed example
Here is a simplified feed structure that is close to what many
supplier and marketplace pipelines deal with:
<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
<offer id="1001" available="true">
<sku>KB-1001</sku>
<name>Mechanical Keyboard</name>
<brand>Acme</brand>
<category>Keyboards</category>
<price currency="USD">129.90</price>
<oldprice currency="USD">149.90</oldprice>
<picture>https://cdn.example.test/kb-1001-front.jpg</picture>
<picture>https://cdn.example.test/kb-1001-side.jpg</picture>
<stock>14</stock>
</offer>
<offer id="1002" available="false">
<sku>MS-1002</sku>
<name>Wireless Mouse</name>
<brand>Acme</brand>
<category>Mice</category>
<price currency="USD">39.90</price>
<picture>https://cdn.example.test/ms-1002.jpg</picture>
<stock>0</stock>
</offer>
</catalog>
This is already enough to illustrate the real shape of the work.
The import pipeline usually does not want to keep this XML structure
around.
It wants to turn each <offer> into something like:
[
'external_id' => '1001',
'sku' => 'KB-1001',
'name' => 'Mechanical Keyboard',
'brand' => 'Acme',
'category' => 'Keyboards',
'price' => '129.90',
'currency' => 'USD',
'old_price' => '149.90',
'available' => true,
'stock' => 14,
'pictures' => [
'https://cdn.example.test/kb-1001-front.jpg',
'https://cdn.example.test/kb-1001-side.jpg',
],
]
That is the internal target.
Once you are clear about that, the XML side becomes much easier to
reason about.
The two stages that matter most
For feed pipelines, I think it helps to split the work into two
explicit stages.
Stage 1: extraction
This is where you identify the repeated record you care about and
convert it into a predictable PHP structure.
Stage 2: normalization
This is where you adapt that structure to your own application model:
- rename fields;
- cast values;
- collapse optional fields;
- map categories;
- validate currency or stock rules;
- prepare records for persistence or messaging.
Trying to collapse these two stages into one giant parsing function
usually makes the code harder to maintain.
Why streaming is such a good fit for feeds
Supplier and marketplace feeds are one of the best use cases for
streaming XML traversal.
The reasons are practical:
- files can become large over time;
- records are naturally repeated;
- each record can often be processed independently;
- you usually do not need the whole document tree;
- early filtering is valuable.
This is exactly where XMLReader and extraction-first libraries
built on top of it become useful.
With XmlExtractKit, I usually approach these feeds as “find
repeated offers and turn them into arrays.”
Here is a streaming extraction example using
FastXmlParser::extractPrettyPrint():
use SbWereWolf\XmlNavigator\Parsing\FastXmlParser;
require_once __DIR__ . '/vendor/autoload.php';
$uri = tempnam(sys_get_temp_dir(), 'supplier-feed-');
file_put_contents($uri, <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<catalog generated_at="2026-04-01T08:00:00Z" region="eu">
<offer id="1001" available="true">
<sku>KB-1001</sku>
<name>Mechanical Keyboard</name>
<brand>Acme</brand>
<category>Keyboards</category>
<price currency="USD">129.90</price>
<picture>https://cdn.example.test/kb-1001-front.jpg</picture>
<picture>https://cdn.example.test/kb-1001-side.jpg</picture>
<stock>14</stock>
</offer>
<service id="svc-1">
<name>Extended Warranty</name>
</service>
<offer id="1002" available="false">
<sku>MS-1002</sku>
<name>Wireless Mouse</name>
<brand>Acme</brand>
<category>Mice</category>
<price currency="USD">39.90</price>
<picture>https://cdn.example.test/ms-1002.jpg</picture>
<stock>0</stock>
</offer>
</catalog>
XML);
$reader = XMLReader::open($uri);
if ($reader === false) {
throw new RuntimeException('Cannot open XML feed.');
}
$offers = FastXmlParser::extractPrettyPrint(
$reader,
static fn (XMLReader $cursor): bool =>
$cursor->nodeType === XMLReader::ELEMENT
&& $cursor->name === 'offer'
);
foreach ($offers as $offer) {
echo json_encode(
$offer,
JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES
) . PHP_EOL;
}
$reader->close();
unlink($uri);
That extraction result is already close to what the rest of the
application needs.
It is still XML-derived data, but it is no longer trapped in XML
traversal logic.
Why readable arrays help so much in feed work
In feed-processing pipelines, readability is not cosmetic.
It directly affects how quickly you can:
- inspect bad records;
- log partial failures;
- test normalization rules;
- compare incoming and outgoing payloads;
- reason about optional fields;
- support multiple partner formats.
That is why array output is so practical.
For example, one extracted <offer> might look like this after
extractPrettyPrint():
[
'offer' => [
'@attributes' => [
'id' => '1001',
'available' => 'true',
],
'sku' => 'KB-1001',
'name' => 'Mechanical Keyboard',
'brand' => 'Acme',
'category' => 'Keyboards',
'price' => [
'@value' => '129.90',
'@attributes' => [
'currency' => 'USD',
],
],
'picture' => [
'https://cdn.example.test/kb-1001-front.jpg',
'https://cdn.example.test/kb-1001-side.jpg',
],
'stock' => '14',
],
]
That is a much better input for normalization code than a
half-processed XML cursor state.
The normalization step is where your business rules belong
Once the feed record is in array form, you can normalize it with
ordinary PHP code.
For example:
/**
* @param array<string, mixed> $record
* @return array<string, mixed>
*/
function normalizeOffer(array $record): array
{
$offer = $record['offer'];
$pictures = $offer['picture'] ?? [];
if (! is_array($pictures)) {
$pictures = [$pictures];
}
return [
'external_id' => $offer['@attributes']['id'] ?? null,
'available' => ($offer['@attributes']['available'] ?? 'false') === 'true',
'sku' => $offer['sku'] ?? null,
'name' => $offer['name'] ?? null,
'brand' => $offer['brand'] ?? null,
'category' => $offer['category'] ?? null,
'price' => $offer['price']['@value'] ?? null,
'currency' => $offer['price']['@attributes']['currency'] ?? null,
'stock' => isset($offer['stock']) ? (int) $offer['stock'] : null,
'pictures' => array_values(array_filter($pictures, 'is_string')),
];
}
This is where business logic belongs.
Not in low-level XML traversal. Not in cursor movement. Not in string
fragments.
A clean import architecture keeps those concerns separate.
Repeated tags, attributes, and optional fields are not edge cases
In feed processing, these are normal conditions:
- multiple images;
- optional old price;
- empty stock fields;
- attributes that carry business meaning;
- tags that are present for some suppliers and absent for others.
That is another reason I prefer extraction to arrays early.
Once the record is in a stable PHP structure, handling these cases
becomes straightforward.
You can:
- default missing fields;
- cast types;
- merge repeated tags into lists;
- strip noise;
- build validation rules around familiar array shapes.
One feed is manageable. Ten feeds expose architecture problems
A lot of parsing approaches look acceptable when there is only one
partner.
The trouble begins when the system grows:
- supplier A sends
offer; - supplier B sends
item; - marketplace C adds nested media blocks;
- another feed uses attributes where the previous one used child elements;
- one integration sends a full nightly catalog;
- another sends partial incremental updates.
At that point, the quality of your processing model matters much more
than the convenience of a single parser call.
The goal is not just “parse this file.”
The goal is to build a repeatable pattern:
- extract repeated records;
- convert them into stable PHP structures;
- normalize them into your domain shape;
- pass them downstream.
That pattern scales much better than spreading XML handling rules
throughout the codebase.
A useful split for real projects
For supplier and marketplace XML feeds, I think the cleanest split is
this:
Integration edge
- read the XML stream;
- extract only target records;
- convert them into arrays.
Normalization layer
- cast and validate fields;
- reconcile naming differences;
- apply partner-specific mapping rules;
- create consistent internal records.
Application layer
- persist catalog data;
- emit events;
- update search indexes;
- enqueue downstream jobs.
This keeps XML where it belongs: at the edge.
When a full-document approach is still fine
Not every feed needs streaming.
If the XML is small and the structure is simple, a full-document
approach may be completely acceptable.
But supplier and marketplace integrations tend to drift in one
direction over time:
- more records;
- more nested data;
- more optional fields;
- more partner variants;
- more operational pressure.
That is why an extraction-first model is often the safer default.
It is not about premature optimization.
It is about choosing a processing pattern that continues to work when
the feed stops being toy-sized.
Conclusion
Supplier and marketplace XML feeds are rarely difficult because XML
is mysterious.
They are difficult because they combine repetition, size, optional
structure, external control, and business-specific normalization
rules.
That is why I think the most productive way to handle them in PHP is:
- stream the feed when needed;
- extract repeated records instead of loading everything;
- convert XML into plain arrays early;
- keep normalization and business rules outside low-level XML
- traversal.
That is the workflow I wanted from XmlExtractKit.
Not a giant XML abstraction layer. Not an attempt to make XML
pleasant.
Just a practical path from external XML feeds to application-ready
PHP data.
composer require sbwerewolf/xml-navigator
Top comments (0)