I’m continuing my series of posts about my RAG assistant for textbook grammar. The first version worked. Technically. You ask a question -> you get an answer.
And then I started testing it like a regular student… and I was blown away.
Instead of helping me learn, the model was just solving the exercises. It was spitting out ready-made answers. At first I thought, “Well, the prompt is bad.” I was wrong.
The problem was how I fed the book into the model.
How I did it at first (shame on me)
# pdf → chunks по N токенов → Chroma
That's it. The "fill in the blanks" exercise and the rule associated with it looked almost identical when embedded. When a student asks for help with the exercise, the system pulls the answer key from another page. The model happily solves the problem for the student.
I was seriously cringing when I realized how stupid that was.
What I did instead
I wrote a parser that understands the textbook's structure before anything gets sent to Chroma.
Here's what the main classification looks like:
def _classify_content_type(text: str, section: str) -> str:
combined = f"{section} {text}"
if _EXERCISE_KEYWORDS.search(combined):
return "exercise"
if _REFERENCE_KEYWORDS.search(combined):
return "reference"
if _VOCAB_KEYWORDS.search(combined):
return "vocabulary"
if _EXAMPLE_KEYWORDS.search(combined):
return "example"
if _GRAMMAR_KEYWORDS.search(combined):
return "rule"
return "other"
And then each chunk is assigned rich metadata:
metadata = {
"book": book_name,
"page": page_num,
"chapter": hierarchy.get("chapter", ""),
"section": section,
"content_type": content_type,
"task_pattern": "fill_blank" | "choose" | "rewrite" | ...,
"grammar_terms": "present perfect, passive voice",
"related_rule": "Unit 5 > 5.1",
...
}
The main lesson that really hit home for me:
Everyone goes on and on about embeddings, chunk size, hybrid search, rerankers, and which prompt is best. That’s important.
But if your model doesn’t understand the difference between a rule and an exercise—even if you feed it Claude 4 Opus it’s still going to be crap. The model will start building a structure that you didn’t give it.
GitHub link
HuggingFace Demo
Top comments (0)