Linghua Jin

Posted on Dec 12, 2025

From Catalog Chaos to Real-Time Recommendations: Building a Product Graph with LLMs and Neo4j

#ai #python #machinelearning #neo4j

Stop Building Dumb Recommendation Engines (Here's How to Build a Smart One)

Most product recommendation systems I've seen are basically fancy keyword matchers. They work okay when you have millions of clicks to analyze, but they completely fall apart when:

You launch a new product with zero interaction data 📉
Your catalog is a mess of inconsistent tags and descriptions 🤦
You want to explain WHY you're recommending something (not just show a black-box score)

I just built a real-time recommendation engine that actually understands products using LLMs and graph databases. Here's the surprising part: the core logic is only ~100 lines of Python.

The Secret Sauce: Product Taxonomy + Knowledge Graphs

Instead of relying on user behavior alone, we're teaching an LLM to understand:

What a product actually is (fine-grained taxonomy like "gel pen" not "office supplies")
What people buy together (complementary products like "gel pen" → "notebook", "pen holder")

Then we throw all of this into a Neo4j graph database where relationships become first-class citizens. Now you can query things like "show me all products that share a complementary taxonomy with this gel pen."

Real-World Example: The Gel Pen Problem

Say someone's browsing a gel pen on your e-commerce site. A traditional recommender might show:

Other gel pens (same category)
Popular items (based on sales)
Random "customers also bought" (if you have enough data)

With our approach, the LLM analyzes the product description and extracts:

Primary taxonomy: gel pen, writing instrument
Complementary taxonomy: notebook, pencil case, desk organizer

Now your graph knows these relationships. When someone views the gel pen, you can traverse the graph to find notebooks, planners, and organizers—all with explainable connections.

The Architecture (Simplified)

Here's the 30,000-foot view:

Product JSONs → CocoIndex Pipeline → LLM Extraction → Neo4j Graph

1. Ingest Products as a Stream

We watch a folder of product JSON files with auto-refresh:

data_scope["products"] = flow_builder.add_source(
    cocoindex.sources.LocalFile(
        path="products",
        included_patterns=["*.json"]
    ),
    refresh_interval=datetime.timedelta(seconds=5)
)

Every time a product file changes, it triggers a pipeline update. No manual rebuilds.

2. Clean and Normalize Data

We map raw JSON into a clean structure:

@cocoindex.op.function(behavior_version=2)
def extract_product_info(product: cocoindex.typing.Json, filename: str) -> ProductInfo:
    return ProductInfo(
        id=f"{filename.removesuffix('.json')}",
        url=product["source"],
        title=product["title"],
        price=float(product["price"].lstrip("$").replace(",", "")),
        detail=Template(PRODUCT_TEMPLATE).render(**product),
    )

The detail field becomes a markdown "product sheet" that we feed to the LLM.

3. Let the LLM Do the Heavy Lifting

Instead of prompt strings, we define the taxonomy contract as dataclasses:

@dataclasses.dataclass
class ProductTaxonomy:
    """
    A concise noun or short phrase based on core functionality.
    Use lowercase, avoid brands/styles.
    Be specific: "pen" not "office supplies".
    """
    name: str

@dataclasses.dataclass
class ProductTaxonomyInfo:
    taxonomies: list[ProductTaxonomy]
    complementary_taxonomies: list[ProductTaxonomy]

Then we call the LLM:

taxonomy = data["detail"].transform(
    cocoindex.functions.ExtractByLlm(
        llm_spec=cocoindex.LlmSpec(
            api_type=cocoindex.LlmApiType.OPENAI,
            model="gpt-4.1"
        ),
        output_type=ProductTaxonomyInfo
    )
)

The LLM reads the markdown description and returns structured JSON matching our schema. No parsing nightmares.

4. Build the Knowledge Graph in Neo4j

We export three things:

Product nodes: id, title, price, url
Taxonomy nodes: unique labels like "gel pen", "notebook"
Relationships: PRODUCT_TAXONOMY and PRODUCT_COMPLEMENTARY_TAXONOMY

product_node.export(
    "product_node",
    cocoindex.storages.Neo4j(
        connection=conn_spec,
        mapping=cocoindex.storages.Nodes(label="Product")
    ),
    primary_key_fields=["id"],
)

Neo4j automatically deduplicates nodes by primary key. If five products all mention "notebook" as a complementary taxonomy, they all link to the same Taxonomy node.

Running It Live

Once you've set up Postgres (for CocoIndex's incremental processing) and Neo4j, it's just:

pip install -e .
cocoindex update --setup main

You'll see:

documents: 9 added, 0 removed, 0 updated

Then open Neo4j Browser at http://localhost:7474 and run:

MATCH p=()-->() RETURN p

Boom. Your entire product graph visualized.

Why This Actually Works

This pattern punches above its weight because:

LLMs are stupid good at text understanding: You offload all the messy natural language interpretation to a model you control with schema and docstrings.
Graphs are made for relationships: You get explainable connections and can run graph algorithms on top (PageRank, community detection, shortest path, etc.).
Incremental updates are free: CocoIndex handles all the plumbing. Add a product file, get an updated graph.

What You Can Build Next

Add brand, material, or use-case taxonomies as separate node types
Plug in clickstream data to weight edges or create FREQUENTLY_BOUGHT_WITH relationships
Swap OpenAI for Ollama (on-prem LLMs) when you need full control
Layer on graph algorithms to find product clusters or detect trending categories

Try It Yourself

Full working code is open-source:

👉 CocoIndex Product Recommendation Example

The repo includes:

Complete flow definition
LLM extraction ops
Neo4j mappings
Sample product JSONs

If you're experimenting with LLM-native data pipelines or graph-based recommendations, I'd love to hear what you're building. Drop a comment or tag me!

P.S. If you found this useful, give the CocoIndex repo a star ⭐ — we're constantly adding new examples and features.

P.P.S. You can also explore the pipeline visually with CocoInsight (free beta) — it's like DevTools for your data pipeline, with zero data retention.

DEV Community