DEV Community

우병수
우병수

Posted on • Originally published at techdigestor.com

Building a Free AI Layer Over NCERT/CBSE: How I Wired Together Ollama, LangChain, and a React Frontend for Indian Classrooms

TL;DR: A teacher in a government school in Rajasthan or Bihar has 40 NCERT PDFs sitting on a pendrive. She can open them, ctrl+F through them, maybe print a few pages.

📖 Reading time: ~36 min

What's in this article

  1. The Actual Problem: NCERT PDFs Are a Graveyard of Unsearchable Knowledge
  2. The Stack I Landed On (After Trying Three Others)
  3. Step 1 — Install Ollama and Pull a Model That Actually Fits in RAM
  4. Step 2 — Ingesting NCERT PDFs into ChromaDB
  5. Step 3 — Wiring the RAG Chain in LangChain
  6. Step 4 — The Frontend: Open WebUI vs Building Your Own
  7. Gotchas I Hit That Took Hours to Debug
  8. What This Actually Costs to Run

The Actual Problem: NCERT PDFs Are a Graveyard of Unsearchable Knowledge

A teacher in a government school in Rajasthan or Bihar has 40 NCERT PDFs sitting on a pendrive. She can open them, ctrl+F through them, maybe print a few pages. What she cannot do is ask them anything. A student comes up and says "explain photosynthesis like I'm 12" — and that PDF just stares back. No context, no adaptation, no patience. The content is there. The intelligence layer is not.

The paid tools that could close this gap are effectively locked out. OpenAI's API costs money per token. ChatGPT Plus is $20/month — which is real money in a context where school IT budgets don't exist. Google Gemini's free tier has rate limits that evaporate fast in a classroom of 40 students firing questions simultaneously. What we actually need is something that runs on the hardware that's already there: a mid-range laptop, no GPU required, no internet dependency, zero recurring cost.

The gap here is embarrassingly specific. NCERT is one of the most thoroughly structured free curricula on the planet — chapters are numbered, concepts build on each other, diagrams are labeled, exercises are categorized by difficulty. India has put decades of pedagogical work into these books and made them freely downloadable at ncert.nic.in. But there is not a single open source AI tool built specifically to treat that corpus as a knowledge base. Everything AI-in-education coming out of edtech startups is either English-private-school-focused or built on top of paid APIs with a thin wrapper. Nobody has actually ingested the NCERT corpus locally and built a retrieval layer around it.

What we're building is a local RAG pipeline — Retrieval-Augmented Generation — that does exactly this. The architecture is straightforward: you ingest the PDFs, chunk the text intelligently (by section, not arbitrary token windows), embed those chunks using a model that runs on CPU, store the embeddings locally, and then wire up a small local LLM that answers questions by pulling the relevant NCERT context first. The model never hallucinates about "what NCERT says" because its answer is grounded in the actual retrieved passage. The whole thing runs offline. A teacher can set it up once, and it works in a school with no WiFi.

The thing that caught me off guard when I first tried to prototype this: NCERT PDFs are not clean. They're scanned in some editions, use non-standard fonts in others, and the Hindi-medium PDFs especially come out as near-garbage when you run a naive pdftotext on them. So the "ingest PDFs" step is actually the hardest part — not the model, not the retrieval, not the UI. Before any of the interesting AI work happens, you're doing OCR triage. That's where we'll spend real time in this guide.

The Stack I Landed On (After Trying Three Others)

The thing that caught me off guard wasn't choosing between LLMs — it was how much the serving layer matters when you're running on school hardware that hasn't been upgraded since 2019. I burned two weeks on llama.cpp direct before switching to Ollama, and the gap in usability is not subtle.

Ollama Over llama.cpp Direct and GPT4All

llama.cpp gives you raw control, but you're managing process lifecycle, model loading flags, and a janky HTTP server you bolt on separately. GPT4All has a decent GUI but the API surface kept changing under me and it doesn't expose enough of the generation parameters to tune for bilingual (Hindi/English) content. Ollama wraps all of that into a clean REST interface that's been stable enough to build on. One command to pull a model, one command to serve it:

# Pull a quantized model that actually fits in 16GB RAM
ollama pull mistral:7b-instruct-q4_K_M

# Serves on localhost:11434 by default
ollama serve

# Sanity check
curl http://localhost:11434/api/generate -d '{
  "model": "mistral:7b-instruct-q4_K_M",
  "prompt": "Explain photosynthesis for Class 8 NCERT students.",
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

The q4_K_M quantization is the sweet spot I landed on — not the smallest, but meaningfully better output quality than q4_0 for mixed-script content, and it still fits comfortably under 10GB VRAM (or RAM, if you're CPU-only). More on that in a second.

ChromaDB, Not Weaviate

I looked at Weaviate first because the ecosystem articles all recommend it. Then I tried to run it on the school's refurb box and watched it immediately allocate 4GB just idling. Weaviate is a distributed system wearing a single-node costume — it's designed for clusters. ChromaDB, by contrast, has a persistent local mode that stores everything on disk as SQLite + raw numpy arrays. No separate process, no Docker-in-Docker nonsense, just:

pip install chromadb

# In code — persistent storage to a local directory
import chromadb
client = chromadb.PersistentClient(path="./chroma_store")
collection = client.get_or_create_collection("ncert_chapters")
Enter fullscreen mode Exit fullscreen mode

The total idle memory footprint with ChromaDB is under 200MB. That matters when your machine also has to run Open WebUI and Ollama simultaneously.

LangChain Python — Version Pin This or You Will Regret It

LangChain's 0.1 → 0.2 migration broke basically every import path I had. from langchain.vectorstores import Chroma became from langchain_chroma import Chroma. The RetrievalQA chain got deprecated in favor of LCEL (LangChain Expression Language). I'm not complaining — LCEL is actually cleaner — but if you don't pin versions, a pip install --upgrade three months in will silently break your RAG chain at 2am before an exam week. My requirements file:

# requirements.txt — don't let this float
langchain==0.2.16
langchain-community==0.2.16
langchain-chroma==0.1.4
langchain-ollama==0.1.3
pypdf==4.3.1        # handles the NCERT PDFs well
Enter fullscreen mode Exit fullscreen mode

I picked Python over JS specifically for the document loaders. LangChain's PyPDFLoader with pypdf backend handles NCERT PDFs — which are a mix of Devanagari, transliterated Hindi, and English — far better than anything in the JS ecosystem right now. The JS loaders kept mangling the Unicode on Chapter 1 of the Class 10 Science book and I spent a day debugging before I just switched runtimes entirely.

Open WebUI as the Frontend

I was planning to build a React frontend until I found Open WebUI. It connects directly to Ollama's API endpoint, has model switching built in, supports conversation history, and the setup is literally one Docker command or a pip install open-webui. Students get a ChatGPT-style interface and I didn't write a single line of frontend code. You configure it by setting OLLAMA_BASE_URL=http://localhost:11434 and it autodiscovers whatever models you've pulled. The one honest limitation: the UI isn't localized to Hindi, so for younger students (Class 6–7), you need a teacher mediating.

Why Not LlamaIndex

LlamaIndex is genuinely solid. I'm not dunking on it. But when I tested it against NCERT's PDFs — some of which have two-column layouts, embedded diagrams with caption text, and chapter headers in bold Devanagari — LangChain's loader pipeline gave me cleaner text chunks out of the box. LlamaIndex requires more custom node parsing to handle those edge cases. If your corpus is clean English PDFs, go with either. If it's government textbooks that were clearly scanned, typeset in PageMaker, and then exported to PDF by a committee, stick with LangChain's loaders.

The Hardware Reality

This entire stack runs on a ₹35,000 refurbished HP ProDesk with a Core i5-8500, 16GB DDR4, and a 512GB SSD. No GPU. CPU inference with mistral:7b-instruct-q4_K_M gives you roughly 3–5 tokens/second on that hardware — slow enough that you'll want to set user expectations ("it thinks for a few seconds"), but totally usable for a student reading a response, not watching a live stream. If you go larger (13B+), you're waiting 30+ seconds per response and students will think it's broken. The 7B quantized models are the practical ceiling for CPU-only school servers right now.

Step 1 — Install Ollama and Pull a Model That Actually Fits in RAM

The thing that burned me first was pulling a model that looked fine on paper and then watching a student wait 45 seconds for a response about friction. That's not a latency problem — that's a classroom management problem. Teachers will abandon the tool in a week. So before anything else, let's get the model choice right.

Install Ollama on Ubuntu 22.04 with the one-liner:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

This sets up the systemd service automatically, drops the binary into /usr/local/bin/ollama, and starts the daemon. Takes about 30 seconds on a decent connection. Now the model choice actually matters here — if your machine has 16GB RAM, pull Mistral 7B instruct:

# Good for English-heavy NCERT content, Class 6–12 range
ollama pull mistral:7b-instruct
Enter fullscreen mode Exit fullscreen mode

If you're running on 8GB RAM (common on older school server hardware), pull Gemma 2B instead:

# Fits in 8GB RAM, still coherent for factual CBSE Q&A
ollama pull gemma2:2b
Enter fullscreen mode Exit fullscreen mode

Don't pull llama3.1:8b on a no-GPU machine. I made this mistake. On CPU-only hardware — even a decent 8-core machine — you're looking at 40–50 second response times per query. That's not a guess; I clocked 47 seconds on a Core i5 machine with 16GB RAM. Gemma2:2b on the same machine was under 12 seconds. The model quality difference for Class 9 Science explanations is genuinely not worth the wait.

Test immediately after pulling. Run this exact query because it's representative of what students will actually ask:

ollama run mistral:7b-instruct 'Explain Newton second law for Class 9'
Enter fullscreen mode Exit fullscreen mode

Time it. If you get a full response in under 20 seconds, you're in good shape for classroom use. If it's 30+ seconds, drop to gemma2:2b. The response quality from Mistral 7B instruct on NCERT-level content is solid — it's been trained on enough textbook-style English that its explanations land at the right reading level without extra prompting.

Now the gotcha that isn't in the README: Ollama binds to 127.0.0.1:11434 by default. That's fine if every student is on the same machine, which they're not. For LAN access from student devices, you need to override the systemd environment. Don't edit the main service file directly — use a drop-in override so updates don't clobber your config:

# Create the override directory and file
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo nano /etc/systemd/system/ollama.service.d/override.conf
Enter fullscreen mode Exit fullscreen mode

Paste this into the file:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Enter fullscreen mode Exit fullscreen mode

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify it's listening on all interfaces
ss -tlnp | grep 11434
# Should show: 0.0.0.0:11434, not 127.0.0.1:11434
Enter fullscreen mode Exit fullscreen mode

One more thing — Ollama doesn't have auth by default. Once you bind to 0.0.0.0, anyone on the LAN can hit the API. For a school intranet that's usually acceptable, but don't expose port 11434 to the internet directly. Put Nginx in front of it if you need to add basic auth or rate limiting — we'll get to that in a later step.

Step 2 — Ingesting NCERT PDFs into ChromaDB

The thing that surprised me most was how clean the source material actually is. NCERT publishes all textbooks as free PDFs at ncert.nic.in — they're public domain, no licensing drama, no scraping gray areas. The structure is predictable too: each grade has a folder per subject. You can bulk-download an entire grade with:

# Downloads all PDFs for Class 10 — adjust the path pattern per grade
wget -r -A.pdf -nd -P ./ncert_pdfs/class10 \
  https://ncert.nic.in/textbook/pdf/
Enter fullscreen mode Exit fullscreen mode

That said, the folder structure on their server isn't perfectly consistent across grades, so you'll end up doing a few manual spot-checks. Class 6–12 math and science are well-structured. History and civics PDFs for Class 6–8 are where things get messy — more on that in a moment.

Get the environment set up with pinned versions. I'm specifying langchain==0.2.16 deliberately — the 0.3.x series moved things around in ways that break community loaders silently, and you don't want to debug that mid-project.

python3 -m venv ncert-env
source ncert-env/bin/activate

pip install \
  langchain==0.2.16 \
  langchain-community \
  chromadb \
  pypdf \
  pdfplumber \
  sentence-transformers
Enter fullscreen mode Exit fullscreen mode

The embedding model I landed on is all-MiniLM-L6-v2 from sentence-transformers. It's 90MB, runs fine on a CPU-only machine (important if you're keeping this free), and handles mixed Hindi-English text reasonably well — not perfectly, but well enough that Chapter titles in Devanagari don't completely break retrieval. If your target content is mostly English-medium NCERT, it's more than sufficient.

Here's the real problem nobody mentions in the tutorials: NCERT PDFs for Class 6, 7, and 8 — especially social studies — contain scanned page images embedded in otherwise valid PDF containers. pypdf will parse these files without errors but return empty strings for the scanned pages. You get zero chunks, zero warnings, and your vector DB silently misses entire chapters. The fix is a fallback to pdfplumber, which at least surfaces the issue, and you can log which pages came back empty:

import pdfplumber
from pypdf import PdfReader

def extract_text_with_fallback(pdf_path):
    pages = []
    reader = PdfReader(pdf_path)
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        if len(text.strip()) < 50:  # threshold — scanned pages return near-nothing
            # fallback: pdfplumber sometimes recovers more from embedded fonts
            with pdfplumber.open(pdf_path) as plumber:
                plumber_text = plumber.pages[i].extract_text() or ""
                if len(plumber_text.strip()) > len(text.strip()):
                    text = plumber_text
                else:
                    print(f"WARNING: Page {i+1} in {pdf_path} appears scanned — OCR needed")
        pages.append(text)
    return pages
Enter fullscreen mode Exit fullscreen mode

For genuinely scanned pages you'll eventually need Tesseract OCR, but that's a separate pass — don't block your initial ingestion on it. Log the gaps and move on.

Chunk size 800 with overlap 100 is not arbitrary. NCERT paragraphs are dense and self-contained — a typical explanation paragraph runs 120–200 words. At 800 characters you usually get one complete concept with surrounding context. I tried 500 and retrieval kept cutting mid-explanation. I tried 1200 and the chunks became too broad, pulling in unrelated concepts that hurt precision. The 100-character overlap handles sentences that bridge chunk boundaries — things like "Therefore, from the above..." that make no sense without the preceding chunk.

Here's the full ingest script you can actually run:

import os
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Persistent storage — NEVER use the in-memory client for this.
# In-memory means re-embedding 500+ PDFs on every restart.
client = chromadb.PersistentClient(path="./ncert_db")

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " "],  # respects paragraph breaks first
)

def ingest_pdf(pdf_path: str, collection_name: str = "ncert_docs"):
    print(f"Loading: {pdf_path}")
    loader = PyPDFLoader(pdf_path)

    try:
        pages = loader.load()
    except Exception as e:
        print(f"PyPDFLoader failed on {pdf_path}: {e}")
        return

    # Filter out empty pages before splitting — saves you from empty chunks
    pages = [p for p in pages if len(p.page_content.strip()) > 50]

    if not pages:
        print(f"SKIP (all pages empty — likely scanned): {pdf_path}")
        return

    chunks = splitter.split_documents(pages)
    print(f"{len(pages)} pages, {len(chunks)} chunks")

    # Chroma.from_documents handles collection creation and upsert
    Chroma.from_documents(
        documents=chunks,
        embedding=embedding_model,
        client=client,
        collection_name=collection_name,
    )

if __name__ == "__main__":
    pdf_dir = "./ncert_pdfs"
    for root, _, files in os.walk(pdf_dir):
        for fname in files:
            if fname.endswith(".pdf"):
                ingest_pdf(os.path.join(root, fname))

    print("Done. Collections:", client.list_collections())
Enter fullscreen mode Exit fullscreen mode

One operational thing: the first full run across all grades (Class 6–12, all subjects) takes around 45–90 minutes on a standard laptop CPU because the embedding model processes every chunk sequentially. Run it overnight. After that, PersistentClient loads from disk in seconds and you're not re-embedding anything. Keep the ./ncert_db directory in your .gitignore — it'll be several hundred MB and shouldn't go into version control.

Step 3 — Wiring the RAG Chain in LangChain

The part that bit me first wasn't the embeddings or the vector store — it was LangChain's own versioning. If you're on langchain>=0.2.0 (which you should be), RetrievalQA.from_chain_type is deprecated and throws a wall of warnings. The replacement is LCEL — the pipe syntax — and it's actually cleaner once you stop fighting the migration. Here's what the old code looked like versus what you need now:

# OLD — don't use this, langchain 0.1.x style
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# NEW — LCEL pipe syntax, langchain>=0.2.0
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
Enter fullscreen mode Exit fullscreen mode

The LCEL version is more explicit about what's flowing where, which matters when you're debugging why a Class 8 student is getting a university-level answer. Now for the system prompt — this is the part most tutorials skip entirely. A generic "answer based on context" prompt is useless here because Mistral 7B, without grade context, will happily explain photosynthesis using terms like "thylakoid membrane proton gradient" to a kid who just needs to know plants need sunlight. The prompt that actually works:

from langchain_core.prompts import ChatPromptTemplate

SYSTEM = """You are a tutor for Indian school students. The student is in Class {grade}.
Answer only based on the provided NCERT context below.
Use simple language appropriate for Class {grade}.
If the answer is not in the context, say exactly: "This topic isn't covered in your NCERT chapter."
Do not guess or add information from outside the context.

Context:
{context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", SYSTEM),
    ("human", "{question}")
])
Enter fullscreen mode Exit fullscreen mode

The full chain wiring with OllamaLLM and your ChromaDB retriever from Step 2 looks like this:

from langchain_ollama import OllamaLLM
from langchain_chroma import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# grade comes from your UI/session — hardcode for testing
GRADE = "10"

llm = OllamaLLM(model="mistral", temperature=0.1)  # low temp = fewer hallucinations

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(
    persist_directory="./chroma_ncert",
    embedding_function=embeddings,
    collection_name=f"ncert_grade_{GRADE}"   # per-grade collections = cleaner retrieval
)

retriever = vectorstore.as_retriever(
    search_type="mmr",          # MMR reduces redundant chunks from the same paragraph
    search_kwargs={"k": 4, "fetch_k": 10}
)

def format_docs(docs):
    # also print source metadata during dev so you can verify chapter provenance
    for d in docs:
        print(f"  [chunk from: {d.metadata.get('source', 'unknown')}]")
    return "\n\n".join(d.page_content for d in docs)

chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough(),
        "grade": lambda _: GRADE   # inject grade into every prompt call
    }
    | prompt
    | llm
    | StrOutputParser()
)
Enter fullscreen mode Exit fullscreen mode

One gotcha: ChatPromptTemplate with a system message that has {grade} requires that key to be in the runnable dict — if you forget to pass it, you get a cryptic KeyError at invocation time, not at chain construction time. LangChain won't complain until you actually call chain.invoke(). Test everything from CLI before you wire up any frontend. Here's a real Class 10 Science question and how to verify the retrieved chunks are coming from the right chapter:

# test_chain.py — run this before touching Streamlit or FastAPI
question = "What is the role of the large intestine in human digestion?"

# invoke with debug output from format_docs
answer = chain.invoke(question)

print("\n--- ANSWER ---")
print(answer)

# expected chunk sources should reference something like:
# ncert_class10_science_chapter6.pdf or Life Processes
# if you see chunks from Class 10 Chemistry instead, your collection naming is wrong
Enter fullscreen mode Exit fullscreen mode

When I ran this, the first thing I caught was that two chunks were being pulled from the Class 10 Chemistry chapter on chemical reactions — because I hadn't namespaced my ChromaDB collections by grade AND subject, just by grade. The fix was naming collections ncert_grade10_science rather than ncert_grade10. That format_docs print statement saved me from shipping a bug where a biology question returned chemistry context. Do this CLI verification with at least three questions per subject before you build any UI layer on top of it.

Step 4 — The Frontend: Open WebUI vs Building Your Own

The part that tripped me up hardest was thinking I could just point Open WebUI at Ollama and call it done. You can — but then all your carefully built RAG chain, your NCERT chapter chunking, your grade-aware context injection, none of it fires. Open WebUI bypasses your Python stack entirely and talks raw to Ollama. So let me walk through the three real options, because the choice depends on how far you are from a working demo versus a classroom-ready product.

Option A — Open WebUI (the fast path)

One command and you have a ChatGPT-looking interface that runs on any browser in the school's LAN:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main
Enter fullscreen mode Exit fullscreen mode

Open WebUI auto-discovers Ollama on host.docker.internal:11434 because of that --add-host flag — without it, the container can't see your host machine's Ollama process and you'll spend 45 minutes wondering why the model list is empty. Students or teachers hit http://<server-ip>:3000, create a login, and they're talking to Llama 3. The first-run admin account becomes the classroom admin. That's genuinely good enough for a proof-of-concept you can show a principal next Tuesday.

The real limitation: Open WebUI's Ollama integration routes directly to the model. Your RAG pipeline — the LangChain chain that retrieves relevant NCERT passages, filters by chapter, and injects grade-level context — never gets called. Open WebUI does support custom OpenAI-compatible endpoints, which is exactly the escape hatch you need.

Option B — FastAPI wrapper (the right move for RAG)

Write a 40-line FastAPI app that accepts /v1/chat/completions in OpenAI format, runs your RAG chain internally, then returns a response in OpenAI format. Open WebUI thinks it's talking to OpenAI. Your chain fires on every message.

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List
import time, uuid

app = FastAPI()

class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    model: str
    messages: List[Message]

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    user_query = req.messages[-1].content
    # your actual RAG chain call — swap this with your LangChain invoke
    answer = rag_chain.invoke({"query": user_query})

    return {
        "id": f"chatcmpl-{uuid.uuid4().hex}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": req.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": answer},
            "finish_reason": "stop"
        }]
    }

# run with: uvicorn app:app --host 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Then in Open WebUI, go to Settings → Connections → Add OpenAI API, set the base URL to http://host.docker.internal:8000 and any dummy string as the API key (your FastAPI app doesn't check it). Now every message goes through your chain. The gotcha: Open WebUI sends a GET /v1/models request on startup to populate the model dropdown — add that endpoint or it'll show an error even though chat still works:

@app.get("/v1/models")
async def models():
    return {"object": "list", "data": [
        {"id": "ncert-rag", "object": "model", "owned_by": "local"}
    ]}
Enter fullscreen mode Exit fullscreen mode

Option C — Custom React + Vite (more work, better curriculum UX)

If teachers need subject dropdowns (Physics / Chemistry / Biology), grade selectors (Class 9 / 10 / 11 / 12), and chapter navigation baked into the UI, Open WebUI can't give you that without plugin gymnastics. A minimal Vite + React app can. The frontend just posts to your FastAPI wrapper, but now the UI can pass structured metadata with every request:

const response = await fetch("http://server-ip:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "ncert-rag",
    messages: [{ role: "user", content: userMessage }],
    // extend the body with curriculum context your FastAPI picks up
    grade: selectedGrade,
    subject: selectedSubject,
    chapter: selectedChapter
  })
});
Enter fullscreen mode Exit fullscreen mode

Your FastAPI endpoint reads those extra fields and narrows the vector search to the right chapter's embeddings before calling Ollama. The trade-off is real: you're now maintaining a frontend, handling auth yourself, and explaining to a teacher why the chat doesn't work in IE 11 on the school's 2014 Dell. Bootstrap this only after the demo lands and you have actual teacher feedback on what the UX needs to do.

Honest take on sequencing

Start with Option A + Option B together — Open WebUI in Docker plus the FastAPI wrapper running on the same machine. You get a working, RAG-backed demo in one afternoon. The wrapper is the architectural piece that matters: once it's there, you can swap the frontend for anything. I'd only invest in the custom React app (Option C) after you've watched three teachers actually use Open WebUI and heard them say "I wish I could filter by chapter" out loud. Build UX for real complaints, not anticipated ones.

The FastAPI Wrapper That Makes Open WebUI Talk to Your RAG Chain

The thing that caught me off guard here is that Open WebUI doesn't care what's actually running behind your API — it just wants something that speaks the OpenAI chat completions format. That's the key insight. You're not integrating with Open WebUI; you're lying to it in a language it already understands. FastAPI makes this almost embarrassingly easy to fake.

Here's the full working wrapper. This wires directly to the LangChain RAG chain from Step 3 — the one that returns a string from chain.invoke(). The critical part is matching the response shape exactly, including the object, choices, and nested message fields. Open WebUI will silently display nothing if you get the structure wrong, which is a terrible debugging experience.

# app.py
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import time, uuid

# Import your chain from Step 3
from rag_chain import chain  # chain.invoke({"query": "..."}) returns a string

app = FastAPI()

# Add this BEFORE any routes — order matters in FastAPI middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],       # lock this down in prod, wildcard is fine for dev
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()

    # Grab the last user message — Open WebUI sends full history
    messages = body.get("messages", [])
    user_query = next(
        (m["content"] for m in reversed(messages) if m["role"] == "user"),
        ""
    )

    # Your RAG chain does the actual heavy lifting
    answer = chain.invoke({"query": user_query})

    return JSONResponse({
        "id": f"chatcmpl-{uuid.uuid4().hex}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": "ncert-rag",
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": answer
            },
            "finish_reason": "stop"
        }],
        "usage": {
            "prompt_tokens": 0,
            "completion_tokens": 0,
            "total_tokens": 0
        }
    })

# Open WebUI probes this endpoint to list available models
@app.get("/v1/models")
async def list_models():
    return {
        "object": "list",
        "data": [{
            "id": "ncert-rag",
            "object": "model",
            "created": int(time.time()),
            "owned_by": "local"
        }]
    }
Enter fullscreen mode Exit fullscreen mode

Spin it up with:

uvicorn app:app --host 0.0.0.0 --port 8001 --reload
Enter fullscreen mode Exit fullscreen mode

The --reload flag is useful while you're iterating on the chain. Once you hit port 8001 directly with curl and see a valid completion response, you're ready to point Open WebUI at it:

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is photosynthesis?"}]}'
Enter fullscreen mode Exit fullscreen mode

In Open WebUI, navigate to Settings → Connections → OpenAI API. Set the base URL to http://localhost:8001/v1 and use any non-empty string as the API key — FastAPI isn't checking it, but Open WebUI refuses to save a blank field. After saving, click the refresh icon next to the connection; if your /v1/models endpoint is working, you'll see ncert-rag appear in the model dropdown immediately.

The CORS error is the one that kills most people on first try. Open WebUI runs on port 3000 (or 8080 if you're running the Docker image), and browsers enforce CORS even for requests to localhost on a different port. You'll see something like Access to fetch at 'http://localhost:8001/v1/chat/completions' from origin 'http://localhost:3000' has been blocked by CORS policy in the browser console. The CORSMiddleware block in the code above fixes this — but only if you add it before your route definitions. I've wasted 20 minutes before realizing my middleware was declared after the routes and doing nothing. One other gotcha: if you're running both services inside Docker, localhost won't resolve correctly from inside the Open WebUI container. Use your host machine's LAN IP or Docker bridge IP (172.17.0.1 on Linux) instead of localhost in the base URL.

Gotchas I Hit That Took Hours to Debug

These are the issues that cost me the most time. None of them show up in the LangChain docs, the ChromaDB README, or any tutorial that uses clean English PDFs. You only find them after your students start getting half-sentences and wrong-grade content back.

Hindi/Devanagari Text From NCERT PDFs Comes Out Garbled

I burned almost a full afternoon on this. NCERT PDFs — especially the older scanned-and-OCR'd ones from before 2018 — store Devanagari using non-standard font encodings. Both pdfminer.six and pypdf try to decode these and produce absolute garbage: random ASCII characters, reversed syllables, missing matras. Your chunks look fine at a glance in the terminal until you realize the embedding model is getting fed gibberish.

The fix that actually worked was switching to pdfplumber with tolerance tuning:

import pdfplumber

with pdfplumber.open("ncert_history_class10.pdf") as pdf:
    for page in pdf.pages:
        # x_tolerance=3 tightens character clustering for Devanagari
        # which has tight horizontal spacing between glyphs
        text = page.extract_text(x_tolerance=3, y_tolerance=3)
        if text:
            print(text[:200])
Enter fullscreen mode Exit fullscreen mode

Not perfect for every PDF in the corpus, but it got me from ~40% readable chunks to ~92% on the history and social science books. The science books in Hindi are still rough — those seem to be actual scans with embedded image text, and at that point you need Tesseract with the hin language pack.

ChromaDB Version Conflicts Will Silently Corrupt Your Setup

langchain-community 0.2.x internally expects ChromaDB's new client API that shipped in 0.4.0. If pip resolves to 0.3.x — which happens more often than you'd expect in school lab environments where someone already had a partial install — you get errors that look like attribute errors on the Collection object, not version mismatch errors. Took me forever to trace it back. Pin this hard in your requirements.txt:

# requirements.txt
langchain-community==0.2.16
langchain-core==0.2.38
chromadb==0.5.3   # do NOT let this float — 0.4.x and 0.3.x both have API differences
sentence-transformers==3.0.1
Enter fullscreen mode Exit fullscreen mode

Run pip install -r requirements.txt --no-deps on a fresh venv first to verify the resolution is clean before deploying to student machines.

Ollama Truncates Dense NCERT Passages at 2048 Tokens

Mistral's default context in Ollama is 2048 tokens. That sounds like a lot until you're feeding it a chunk from the NCERT Class 10 History chapter on Nationalism, plus the system prompt, plus the question. You get answers that stop mid-sentence with no warning — no error, no ellipsis, just a clean cut. I wasted time thinking it was a network timeout before I checked the Ollama logs.

Create a custom Modelfile and rebuild the model definition locally:

# Modelfile.mistral-ncert
FROM mistral

# 4096 is the safe ceiling for most NCERT chunks + system prompt overhead
# Going to 8192 is possible but slows inference noticeably on CPU-only machines
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
SYSTEM "You are a helpful tutor for Indian school students studying NCERT/CBSE curriculum. Answer based only on the provided context."
Enter fullscreen mode Exit fullscreen mode
ollama create mistral-ncert -f Modelfile.mistral-ncert
ollama run mistral-ncert "What was the Rowlatt Act?"
Enter fullscreen mode Exit fullscreen mode

Port 11434 Blocked on School Network Switches

School IT departments lock down non-standard ports at the switch level, not just the firewall. Port 11434 is Ollama's default and it was blocked cold on every school switch I tested on. The fix is straightforward — run Ollama on 8080 which is almost always whitelisted for HTTP traffic:

# systemd override — create /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:8080"
Enter fullscreen mode Exit fullscreen mode
sudo systemctl daemon-reload
sudo systemctl restart ollama

# verify it's listening
curl http://localhost:8080/api/tags
Enter fullscreen mode Exit fullscreen mode

Then update your LangChain Ollama base URL to http://localhost:8080 and you're done. If 8080 is also blocked, try 443 — that one's almost never filtered.

Retrieval Returning Wrong Grade Content Without Metadata Filtering

This one is subtle and took me a while to notice because the answers looked plausible. A Class 12 Chemistry student asks about electrochemistry, and the retriever confidently pulls a chunk from the Class 6 Science chapter on basic materials because the word "electrode" appears there too. Without grade-level filtering, your vector similarity search has no concept of curriculum level.

The fix is two-part: store metadata at ingest time, then filter at query time. During ingest:

from langchain_community.vectorstores import Chroma

# Every chunk needs grade and subject in metadata — don't skip this
docs_with_metadata = []
for chunk in chunks:
    chunk.metadata.update({
        "grade": "10",
        "subject": "chemistry",
        "chapter": "electrochemistry",
        "board": "CBSE"
    })
    docs_with_metadata.append(chunk)

vectorstore = Chroma.from_documents(
    docs_with_metadata,
    embedding_function,
    persist_directory="./ncert_chroma_db"
)
Enter fullscreen mode Exit fullscreen mode

Then at query time, pass the where filter — this maps directly to ChromaDB's metadata filter syntax:

results = vectorstore.similarity_search(
    query="explain electrolytic cells",
    k=4,
    filter={"grade": "10", "subject": "chemistry"}  # LangChain wraps this as ChromaDB `where`
)
Enter fullscreen mode Exit fullscreen mode

The gotcha inside the gotcha: ChromaDB's where filter requires exact string matches. If you stored grade as the integer 10 during ingest and query with the string "10", it silently returns zero results. Standardize on strings throughout.

What This Actually Costs to Run

The thing that surprises most people when I show them this setup: the ongoing monetary cost is basically zero. The thing that surprises them less but hurts more: the time cost is real, and nobody budgets for it.

On hardware, I've run Mistral 7B comfortably on a refurbished Dell OptiPlex with an Intel i5-10500 and 16GB DDR4. Inference latency sits around 2-4 seconds per response for a typical NCERT question — acceptable for a student reading the reply, not acceptable if you're expecting ChatGPT speed. That machine handles 5-10 concurrent students without sweating. Beyond that, you'll see queuing. If your school has a computer lab of 30+, either shard the load across two machines or look at quantized 4-bit GGUF variants that cut memory pressure significantly. Anything with an AMD Ryzen 5 5600 or Intel i5 10th gen and 16GB RAM is your minimum viable floor.

Power draw is genuinely boring, which is good. A low-power desktop under inference load pulls 60-80W. That's two LED bulbs. Over a full school day (say, 8 hours of intermittent use), you're talking 0.5-0.65 kWh. At Indian grid rates that's fractions of a rupee. This matters when you're pitching the project to a school administration that's used to paying electricity bills for 30 CRT monitors.

The software stack is entirely free and stays that way. Ollama is MIT licensed, LangChain is MIT, ChromaDB is Apache 2.0, and Open WebUI is MIT. No vendor lock-in, no "free tier" that expires, no sales call when you hit a usage ceiling. What you do need to watch: ChromaDB's persistence layer is file-based by default and will silently corrupt if you kill the process mid-write. Run it with --path pointed at a dedicated directory on an ext4 volume, not tmpfs.

The real cost is maintenance time, and the way to contain it is automation. I built a health-check script that runs every 5 minutes via cron and sends a WhatsApp alert through Twilio's free tier if either Ollama or ChromaDB stops responding. Twilio's free tier gives you a sandbox number with enough credits for alert traffic at this scale. Here's the actual script:

#!/usr/bin/env python3
# health_check.py — runs via cron every 5 minutes
import requests
from twilio.rest import Client
import os

OLLAMA_URL = "http://localhost:11434/api/tags"
CHROMA_URL = "http://localhost:8000/api/v1/heartbeat"

TWILIO_SID = os.environ["TWILIO_ACCOUNT_SID"]
TWILIO_TOKEN = os.environ["TWILIO_AUTH_TOKEN"]
TWILIO_FROM = "whatsapp:+14155238886"   # Twilio sandbox number
ALERT_TO = "whatsapp:+91XXXXXXXXXX"     # your number

def check(url, name):
    try:
        r = requests.get(url, timeout=5)
        r.raise_for_status()
    except Exception as e:
        send_alert(f"[AI Classroom] {name} is DOWN: {e}")

def send_alert(msg):
    client = Client(TWILIO_SID, TWILIO_TOKEN)
    client.messages.create(body=msg, from_=TWILIO_FROM, to=ALERT_TO)
    print(msg)

check(OLLAMA_URL, "Ollama")
check(CHROMA_URL, "ChromaDB")
Enter fullscreen mode Exit fullscreen mode

Wire it into cron with */5 * * * * /path/to/venv/bin/python /opt/classroom/health_check.py and you'll know within 5 minutes if anything dies. The Twilio sandbox requires the recipient to opt in first (send "join " to the number), which is a one-time setup. For a broader look at how free and low-cost tooling choices compound across a small-budget operation — including adjacent infrastructure picks that translate directly to school deployments — the Essential SaaS Tools for Small Business in 2026 guide covers that territory well. The pattern of "free tier for alerts, open source for core, own your data" applies whether you're running a classroom AI layer or a five-person startup.

Scaling It: From One Classroom to a District

The thing that catches most people off guard here is how far you can push a CPU-only setup before it falls apart. I ran Mistral 7B on a 16-core Ryzen server with 32GB RAM — no GPU — and it held up fine for individual tutoring sessions. But "fine for one student" and "fine for a class of 30 hitting it simultaneously" are completely different claims. Before you tell any administrator that this system can handle their school, run locust against it first.

# locustfile.py — hammer your FastAPI endpoint before you make promises
from locust import HttpUser, task, between

class StudentUser(HttpUser):
    wait_time = between(5, 15)  # students think before submitting

    @task
    def ask_question(self):
        self.client.post("/ask", json={
            "question": "Explain photosynthesis as per NCERT Class 10",
            "subject": "biology"
        }, timeout=60)  # 60s timeout; CPU inference can be slow

# run with: locust -f locustfile.py --headless -u 10 -r 2 --host http://localhost:8000
# watch response times — if p95 goes past 30 seconds, students will bail
Enter fullscreen mode Exit fullscreen mode

On a CPU-only machine, I saw ~10 concurrent users as the practical ceiling before response times crossed 15 seconds per query. That's usable for a small lab where students work asynchronously. The moment you get an NVIDIA RTX 3060 or better into the picture, the math changes completely. Ollama auto-detects CUDA with zero config — you don't touch a single environment variable. Inference drops from 15 seconds to under 3 seconds on Mistral 7B, and suddenly 30+ concurrent users becomes realistic. The 3060 has 12GB VRAM which is just enough to hold Mistral 7B fully in GPU memory without layer offloading.

For multi-school deployments, the architecture that actually works is a centralized backend with thin school-side clients. Run ChromaDB and Ollama on one beefy central server (either a district server room or a powerful machine in the main school). Push only the FastAPI wrapper and Open WebUI to machines at satellite schools. Everything runs over LAN — no internet dependency for inference, which matters because school internet is notoriously unreliable during exams when you need it most.

# docker-compose.yml — bring up the full stack with one command
# Run this on your central server; school machines only run the webui service

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]  # remove this block entirely if no GPU
    restart: unless-stopped

  chroma:
    image: chromadb/chroma:0.5.3  # pin this — chroma breaks APIs between minor versions
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE
    restart: unless-stopped

  api:
    build: ./fastapi-app
    ports:
      - "8080:8080"
    environment:
      - OLLAMA_HOST=http://ollama:11434
      - CHROMA_HOST=chroma
      - CHROMA_PORT=8000
    depends_on:
      - ollama
      - chroma
    restart: unless-stopped

  webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  chroma_data:
Enter fullscreen mode Exit fullscreen mode

Do not use SQLite-backed ChromaDB in a multi-user environment. The default ChromaDB client uses SQLite under the hood, and SQLite has a write lock — when three students ask questions simultaneously that trigger document retrieval and embedding writes, you will get database locked errors. I hit this about two days into a pilot and it was maddening to debug because it only appeared under concurrent load. Switch to HTTP client mode in your FastAPI app and point it at the containerized ChromaDB server:

import chromadb

# BAD — don't do this in production, SQLite can't handle concurrent writes
# client = chromadb.PersistentClient(path="./chroma_db")

# GOOD — ChromaDB running as a separate service handles concurrency correctly
client = chromadb.HttpClient(
    host='chroma-server',  # or 'chroma' if using the compose file above
    port=8000
)

collection = client.get_or_create_collection(
    name="ncert_chapters",
    metadata={"hnsw:space": "cosine"}  # cosine works better than L2 for text embeddings
)
Enter fullscreen mode Exit fullscreen mode

One operational gotcha: when you deploy Ollama on the central server and school machines need to reach it, Ollama by default only binds to 127.0.0.1. Set OLLAMA_HOST=0.0.0.0 in your environment or it won't be reachable from other machines on the LAN — this isn't in the quick-start docs and it will burn you. Also, pull your models explicitly after the container starts rather than assuming they persist across restarts: docker exec -it ollama ollama pull mistral:7b-instruct-q4_K_M. The q4_K_M quantization is the right balance — smaller than q8 so it fits in VRAM, but noticeably sharper than q2 or q3 which start hallucinating chapter content.

What I Would Do Differently Starting Over

The most expensive mistake I made was spending three weeks optimizing around mistral:7b before realizing gemma2:2b was the right default for this use case. For NCERT content — which is structured, curriculum-aligned, and repetitive in the best way — the answer quality gap between a 7B and a 2B model is genuinely smaller than benchmarks suggest. The 7B model sounds more confident, but confidence isn't the bottleneck when a Class 8 student asks "what is photosynthesis" — correctness and speed are. On my M1 MacBook, mistral:7b takes 8–12 seconds for first token; gemma2:2b is under 3. During live classroom demos, that difference is the gap between "wow" and "is it broken?"

# Pull both and feel the difference yourself
ollama pull gemma2:2b
ollama pull mistral:7b

# Quick comparison — same NCERT prompt, time both
time ollama run gemma2:2b "Explain the water cycle as described in NCERT Class 7 Chapter 16"
time ollama run mistral:7b "Explain the water cycle as described in NCERT Class 7 Chapter 16"
Enter fullscreen mode Exit fullscreen mode

The metadata schema mistake will haunt you. I ingested about 50,000 chunks from NCERT PDFs before building a proper schema, figuring I'd "add metadata later." There is no clean way to retrofit grade, subject, chapter, and board fields into a ChromaDB collection that already exists. You end up either re-ingesting everything (expensive if you've done cleanup) or writing fragile filename-parsing logic to backfill. Do this on day zero:

# Every chunk should have this metadata structure from the start
metadata = {
    "grade": "8",
    "subject": "science",
    "chapter": "5",
    "chapter_title": "Coal and Petroleum",
    "board": "CBSE",
    "source_pdf": "ncert_sci_8.pdf",
    "chunk_index": 42
}

collection.add(
    documents=[chunk_text],
    metadatas=[metadata],
    ids=[f"ncert_sci_8_ch5_{chunk_index}"]
)
Enter fullscreen mode Exit fullscreen mode

The teacher-facing upload UI is the thing I regret skipping most. Right now, adding a new textbook to the system means SSH-ing into the box, dropping a PDF into a specific directory, running the ingest script, and hoping ChromaDB doesn't choke on a malformed page. That's fine for me. It completely blocks every non-technical person — teachers, curriculum coordinators, school admins — who should be the ones managing this. If I were starting over, I'd build a dead-simple web UI on day one: drag-and-drop PDF, select grade/subject/chapter from dropdowns, click ingest. Flask + a background Celery task or even just a subprocess call is enough. The backend doesn't need to be elegant; it needs to exist.

# Minimum viable ingest endpoint — add auth before this touches prod
@app.route("/ingest", methods=["POST"])
def ingest_pdf():
    f = request.files["pdf"]
    grade = request.form["grade"]
    subject = request.form["subject"]
    chapter = request.form["chapter"]

    path = f"/data/pdfs/{secure_filename(f.filename)}"
    f.save(path)

    # Fire and forget — show status via polling endpoint
    subprocess.Popen(["python", "ingest.py", path, grade, subject, chapter])
    return jsonify({"status": "ingesting", "file": f.filename})
Enter fullscreen mode Exit fullscreen mode

Testing with actual students in week six instead of week one was a real mistake. I spent those six weeks running synthetic eval sets — questions I wrote myself, questions GPT-4 generated, questions from past NCERT exam papers. None of it matched what students actually asked. Real Class 9 students ask things like "bhaiya carbon ka formula kya hai" (mixing Hindi mid-sentence), they ask about diagrams that exist only as images in the PDF, and they ask follow-up questions that assume the previous answer was stored as context. Every single one of those patterns exposed a retrieval or prompting gap that no benchmark caught. Get five students in front of the thing during week one and just watch them use it. You'll rebuild your chunking strategy after the first session.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Top comments (0)