Transforming Unstructured Retail Catalogs into Structured Data using AI

#webdev #ai #architecture #nextjs

Comparing products on e-commerce platforms is straightforward because the data is already structured. However, when dealing with weekly promotional catalogs published by traditional retail chains, you are often left with giant JPEG images containing hundreds of products, prices, and scattered text.

In this post, I will share the high-level architecture behind Haftalikaktuel, a platform we built to ingest these unstructured catalog images, parse them using AI, and turn them into a fully structured, searchable comparison engine.

🏗️ High-Level Architecture

The system is decoupled into three main operational stacks:

Frontend (Public Web): Built with Next.js (App Router) leveraging React Server Components.
Data Extraction Pipeline: A Python-based async processing pipeline handling orchestration and extraction.
Data & Search Layer: A combination of document databases, vector search engines, and object storage for optimized assets.

The core challenge lies within the data pipeline—moving beyond traditional web scraping.

🧠 From Vision to Data: Object Detection meets Multimodal LLMs

Extracting product data from a complex catalog page using traditional OCR is nearly impossible. Text layouts are chaotic, and price tags can be positioned anywhere relative to the product image.

We solved this using a two-stage machine learning pipeline:

1. Object Detection

Before reading any text, we run the raw catalog pages through a custom object detection model (based on YOLO architecture). This model is trained to identify the bounding boxes of individual product regions, allowing us to crop the giant page into smaller, isolated product images.

2. Semantic Extraction via Vision LLMs

Each cropped product image is then sent to a multimodal LLM (we primarily utilize Google Gemini models). Instead of just extracting raw text, we prompt the model to return a structured JSON object containing:

Product Name & Brand
Exact Price
Key Attributes (weight, color, etc.)
A Normalized Entity Category

This structured payload is then persisted into our database, while the cropped images are optimized and sent to our object storage.

⚡ Frontend & Performance Optimization

Once the data is structured, it needs to be served fast. We use Next.js with a heavy reliance on Server Components (RSC) and Incremental Static Regeneration (ISR).

A major performance bottleneck we encountered was on-the-fly image optimization. Processing thousands of product images per day overloaded the default frontend image optimizer. We solved this by shifting the optimization logic to the backend pipeline. Images are pre-processed into multiple sizes and WebP formats before they hit the storage bucket, allowing the Next.js frontend to simply map to the correct static URL without consuming compute resources.

🔍 Search: Hybrid & Vector-Based

To allow users to search through thousands of active and historical products, traditional full-text search wasn't enough. A user searching for "laundry detergent" needs to see "Omo" even if the word "detergent" isn't explicitly in the extracted title.

Hybrid Search Engine: We use a dedicated search engine that supports both keyword matching and vector search.
Vector Caching: Generating embeddings for every search query adds latency. We implemented an in-memory caching layer that stores the vector embeddings of frequent queries and semantic product recommendations, dropping search latency to milliseconds.

🛠️ Key Challenges

LLM Latency: Relying on external AI APIs creates variable pipeline latency. We mitigated this by fully decoupling the extraction process into background worker queues.
AI Hallucinations: Vision models sometimes misread prices. To prevent bad data from reaching the public site, we built an internal backoffice with threshold rules and a manual review queue for low-confidence extractions.
Entity Normalization: Different retailers call the same product by different names. We use semantic similarity to group these disparate products under single "SEO Entities" before they hit the database.

Conclusion

Combining computer vision and modern LLMs with traditional web stacks transforms what used to be a highly manual, multi-day data entry job into a completely autonomous system.

If you are dealing with unstructured data, utilizing multimodal LLMs as parsing tools rather than just text generators is a game changer. You can check out the live result at haftalikaktuel.com. If you have questions about the architecture, let's discuss them in the comments!