Divyanshu Singh

Posted on Dec 3

How I built a RAG-powered Anime Recommendation Engine with Python & FastAPI (Open Sourcing the Journey)

#machinelearning #ai #showdev #python

MyAnimeList recommendations were broken, so I scraped 108 years of history to fix them.

Standard anime search engines rely on keyword matching. If you search for "Cyberpunk", they look for the tag "Sci-Fi". I wanted to search by "Vibe" (e.g., "Anime that feels like a warm hug" or "Neon-soaked tragedy").

So, I spent the last 2 months building AiMi: A production-grade Hybrid RAG engine.

Here is the full technical breakdown of how I built it, the architectural challenges I faced (handling 8,000+ embeddings on CPU vs GPU), and the code behind the viral "Anime Receipts" generator.

1. The Data: 108 Years of History (1917-2025)

Garbage in, garbage out. Before building the model, I needed a dataset that didn't exist.

I aggregated data from AniDB, and MAL to create a unified database of 8,248 anime.
The biggest challenge was Normalization.

Ratings: AniDB uses floats (0-10), MAL uses integers. I preserved the float precision.
Context: Raw synopses aren't enough for RAG. I engineered a canonical_embedding_text field that blends Themes + Character Archetypes + Emotional Tone into a single dense vector block.

🎁 Free Resource: I’ve open-sourced a 500-row sample of this cleaned dataset on Kaggle for anyone who wants to test their own models:
Download Sample Dataset

2. The Engine: Hybrid RAG Architecture

Most RAG tutorials are "Hello World" toys. I needed this to run in production.

I settled on a Hybrid Search architecture to balance "Vibe" (Semantic) with "Precision" (Keywords).

A. The Embedding Model

I chose Nomic v1.5 over OpenAI.

Why? It outperforms other open-source models like Jina, Stella, and Alibaba-NLP for structured retrieval tasks.
Cost: It runs locally. No API bills.

B. The "Keyword Boosting" Layer (BM25 Logic)

Vector search is bad at specific nouns. If a user searches for "Anime about a notebook", vectors might give you "School Life" anime.
I implemented a python-native boosting logic to force specific terms to the top.

Here is the actual search logic from my pipeline:

# Inside robust_rag_pipeline.py

def search(self, query: str, top_k: int = 10):
    # 1. Vector Search (Nomic)
    query_emb = self.model.encode(query)
    scores, ids = self.index.search(query_emb, top_k)

    # 2. Keyword Boosting (Safety Net)
    # Extract rare nouns (>4 chars) like "Pancreas" or "Notebook"
    query_words = set([w.lower() for w in query.split() if len(w) > 4])

    for idx, score in zip(ids, scores):
        anime = self.dataset.iloc[idx]
        anime_text = (anime['Synopsis'] + " " + anime['Title']).lower()

        # Check for exact matches
        matches = sum(1 for q in query_words if q in anime_text)

        # Boost score by 5% per match (Max 15%)
        boost = min(matches * 0.05, 0.15)
        final_score = min(score + boost, 0.9999)

3. Solving the "Hardware-Aware" Problem

I wanted this to run on a MacBook Air (CPU) and a Gaming Rig (NVIDIA GPU) without changing code.

On GPU: The system loads a local LLM (Qwen-2.5-1.5B) to enable HyDE (Hypothetical Document Embeddings). It "translates" queries like "No fanservice" into "Wholesome, family friendly" before searching.
On CPU: It gracefully degrades to "Lightweight Mode" (Nomic + Boosting only) to prevent timeouts.

This "Self-Healing" initialization was critical for deployment:

# Hardware-Aware Initialization
if device == 'cuda':
    logger.info("⚡ GPU Detected: Enabling HyDE (Generative Intent).")
    self.llm_pipeline = load_qwen_model()
else:
    logger.warning("⚠️ CPU Detected: Skipping LLM to prevent timeouts.")
    self.llm_pipeline = None

4. The Viral Feature: Generating Receipts with Playwright

Data is boring if you can't share it. I wanted users to be able to visualize their watch history as "Store Receipts."

I used Playwright (headless browser) to render HTML templates into High-DPI images.

The Challenge: Performance. Generating 100 receipts sequentially took forever.
The Solution: asyncio.gather.

# Batch Processing Receipts
async def convert_all(receipts):
    tasks = []
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        # Create multiple contexts for parallelism
        for receipt in receipts:
             tasks.append(render_receipt(browser, receipt))

        # Execute all renders in parallel
        await asyncio.gather(*tasks)

This reduced generation time from 40 seconds to 3 seconds for a batch.

5. Build it Yourself (Source Code)

I believe in Source-Available software. You shouldn't have to spend 2 months scraping and refactoring like I did.

I have packaged the Entire Ecosystem into a "Business-in-a-Box" for developers who want to launch their own Anime SaaS or learn advanced RAG patterns.

📦 What's in the box?

The 8k RAG Dataset (Parquet).
The Recommendation Engine (FastAPI + Streamlit Source Code).
The Receipt Generator (Playwright + Async Logic).
The Asset Library (2.3GB of Posters/Logos).

You can clone this, white-label it, and launch your own version today.

🚀 Launch Special (Limited Time)

To celebrate the launch, I'm offering a 10% Discount on the Ultimate Tier.

Code: AIMILAUNCH
Note: I will be raising the prices by $50 after the first 50 sales. Lock it in now.

💎 Get the Ultimate Ecosystem (Tier 3)

(If you just want the raw data, Tier 1 is available for $49 here.)

Let me know if you have any questions about the Nomic vs. OpenAI benchmarks or the HyDE implementation in the comments!

Go make something impossible. 🚀

DEV Community