Your RAG pipeline is only as good as the shit you put in your vector database

#ai #rag #langchain #machinelearning

I’m continuing my series of posts about my RAG assistant for textbook grammar. The first version worked. Technically. You ask a question -> you get an answer.
And then I started testing it like a regular student… and I was blown away.
Instead of helping me learn, the model was just solving the exercises. It was spitting out ready-made answers. At first I thought, “Well, the prompt is bad.” I was wrong.
The problem was how I fed the book into the model.
How I did it at first (shame on me)

# pdf → chunks по N токенов → Chroma

That's it. The "fill in the blanks" exercise and the rule associated with it looked almost identical when embedded. When a student asks for help with the exercise, the system pulls the answer key from another page. The model happily solves the problem for the student.
I was seriously cringing when I realized how stupid that was.
What I did instead
I wrote a parser that understands the textbook's structure before anything gets sent to Chroma.

Here's what the main classification looks like:

def _classify_content_type(text: str, section: str) -> str:
    combined = f"{section} {text}"

    if _EXERCISE_KEYWORDS.search(combined):
        return "exercise"
    if _REFERENCE_KEYWORDS.search(combined):
        return "reference"
    if _VOCAB_KEYWORDS.search(combined):
        return "vocabulary"
    if _EXAMPLE_KEYWORDS.search(combined):
        return "example"
    if _GRAMMAR_KEYWORDS.search(combined):
        return "rule"
    return "other"

And then each chunk is assigned rich metadata:

metadata = {
    "book": book_name,
    "page": page_num,
    "chapter": hierarchy.get("chapter", ""),
    "section": section,
    "content_type": content_type,
    "task_pattern": "fill_blank" | "choose" | "rewrite" | ...,
    "grammar_terms": "present perfect, passive voice",
    "related_rule": "Unit 5 > 5.1",  
    ...
}

The main lesson that really hit home for me:
Everyone goes on and on about embeddings, chunk size, hybrid search, rerankers, and which prompt is best. That’s important.
But if your model doesn’t understand the difference between a rule and an exercise—even if you feed it Claude 4 Opus it’s still going to be crap. The model will start building a structure that you didn’t give it.
GitHub link
HuggingFace Demo

DEV Community

Your RAG pipeline is only as good as the shit you put in your vector database

Top comments (0)