Peter Y

Posted on May 15

How We Generate 100+ Product Feeds From 300k SKUs Without Hitting the Database

#performance #database #architecture #php

Generating product feeds (Google Shopping, Facebook, marketplaces) is a boring problem until your catalog has 300,000 SKUs. Then it becomes a nightmare.

The naive approach — load each product from PrestaShop, compute its price, check availability, format the output — hits the database with ~80 queries per product. Multiply that by 300k products, add network latency to a clustered database, and you're looking at hours of generation time. Per feed. And we have over a hundred feeds: 10 different feed types × 4 languages × 3 shops.

We tried the "proper" engineering approach first. It failed. Then I built something dumb and fast that actually works.

Why Feeds Are Hard in PrestaShop

You can't just dump product data with SQL queries. I mean, you technically can, but you'll regret it.

PrestaShop computes a lot of things at runtime. Product price depends on specific price rules, group discounts, cart rules, tax rules, country settings, and a dozen admin toggles. Availability depends on stock management mode, pack stock type, combination stock, and warehouse config. Even something simple like "product name" goes through language layers and shop context.

Reproducing all of that in raw SQL is reverse-engineering the entire PrestaShop business logic layer. Someone might pull it off, but it'll be fragile, it won't respect admin settings, and it'll break on every PrestaShop update.

So you're stuck loading products through PrestaShop's own objects. And that means PHP, and that means queries. Lots of them.

The "Proper" Solution That Failed

We hired a specialist to build a proper feed generation pipeline. His architecture:

Message broker for events
Enrichment service on Symfony
Separate PrestaShop instance for data hydration
Event-driven pipeline with object serialization

Five months later: nothing shipped. Not one working feed. The architecture was theoretically sound but practically impossible to finish. We let him go.

The Insight

Here's what I realized. We already have a process that loads every product through PrestaShop's full business logic: our product update cron.

In our setup, product changes arrive from an external source via a Redis queue. A cron job runs every 30 seconds, picks up SKUs that changed, and performs a full product update in PrestaShop. After the update, all the expensive stuff is already computed — prices, stock, categories, attributes, descriptions, tax rules. It's all sitting right there in PHP memory.

What if, at that exact moment, we just... grabbed it?

And it gets better. At that same point in the code we're already updating the Elasticsearch search index for the storefront — because the data is the same. So we're already building a product object for search. Adding a second write for feed data is almost free.

The Solution: Capture Once, Feed Forever

During the product update cycle, after all business logic has executed, we collect every field that any feed might ever need into a single associative array. Price, stock, description, categories, attributes, images, EAN, weight, shipping — everything.

Then we:

json_encode it
gzencode it (compress)
base64_encode it (for safe storage)
Write it to Elasticsearch as one document per SKU

The Elasticsearch index is dead simple: SKU as the ID, compressed payload, a timestamp, and an "indexed" flag. One document per product. ~4KB per document compressed. The whole index for 300k products is about 4GB.

No extra database queries. No separate pipeline. The data piggybacks on a process that was already running.

Feed Generation: Just Scroll and Write

When it's time to generate feeds, the process is trivial:

Open an Elasticsearch scroll query over the entire index
For each document: decompress → json_decode → you have all product data
Write to whatever format the feed needs (CSV, XML, JSON)

The key trick: we write all languages and all shops in a single pass. One scroll through 300k documents, and we're writing to all output files simultaneously. No need to iterate the catalog multiple times.

Numbers

216,000 active SKUs
One feed (all languages, all shops): 6 minutes
All feeds combined: ~35 minutes
Write speed: ~500 SKUs/second
Database load during feed generation: zero

For comparison: the naive approach (load each product via PrestaShop objects) would need ~80 queries per SKU. That's 24 million queries for one feed. On a clustered database with network latency — we're talking hours per feed.

Adding New Feeds Takes Minutes

This is the part I'm most happy with.

The compressed JSON in Elasticsearch contains a superset of all fields any feed could need. When a manager says "we need a new feed for marketplace X," I don't build a new data pipeline. I just:

Look at what fields the marketplace requires
Check if they're in the universal object (they usually are)
Write a simple transformer: read field A, format it as column B
Add it to the feed generation cron

A new feed type takes maybe an hour. Most of that is reading the marketplace's spec.

If a required field isn't in the universal object yet — I add it in the product update step, wait for one full update cycle, and it's available everywhere.

Why This Works (And What It's Actually Called)

After building this, I looked up whether the approach has a name. Turns out it's a combination of two well-known patterns — I just didn't know that when I built it.

Materialized View — a precomputed, stored result of a complex query, optimized for reading. That's exactly what our compressed JSON in Elasticsearch is: a materialized view of PrestaShop's business logic output. The classic version lives in a database (PostgreSQL has them built-in, for example). Ours lives in Elasticsearch because the "query" isn't SQL — it's the result of PHP-level computations that can't be expressed in SQL at all.

Event-Carried State Transfer — a pattern from event-driven architecture where instead of telling consumers "something changed, go fetch the data yourself," you send the full state along with the event. That's exactly what we do: when a product updates, we don't just flag it for later processing. We capture the complete product snapshot right there and store it. Feed generators never need to go back to the source.

The twist is that both patterns are usually discussed in the context of microservices and distributed systems. Nobody talks about applying them inside a PHP monolith to solve a feed generation problem. But that's what works.

The insight isn't architectural theory. It's practical: don't go get data when you can grab it while it's already in your hands.

Trade-offs

It's not perfect.

Feed data is only as fresh as the last product update cycle. If a product updates at 10:00 and feeds generate at 10:30, there's a 30-minute gap. For our B2B use case this is fine. For a flash-sale store it might not be.

The universal object can get bloated. Ours has maybe 40-50 fields per product. Not terrible, but it needs occasional cleanup.

You need the product update pipeline to begin with. If your products are edited manually in PrestaShop admin — this approach doesn't apply directly. You'd need to hook into PrestaShop's save events instead.

TL;DR

Don't generate feeds by loading products from the database. At scale, it's impossibly slow.
Don't build a separate data pipeline. It's months of work and probably won't ship.
Capture product data during your existing update process, when all business logic has already executed.
Store it compressed in Elasticsearch. One document per SKU, ~4KB each.
Generate feeds by scrolling Elasticsearch and writing files. Zero database load, 500 SKUs/second.
New feed types take an hour to add, not weeks.

Sometimes the best architecture is no architecture. Just write stuff down when you already have it.

Tags: prestashop, ecommerce, elasticsearch, performance

DEV Community