Most RAG tutorials just say "chunk your PDF and call OpenAI". I wanted to build something more real — a proper pipeline that actually ingests, cleans, embeds, and serves knowledge from Isaac Newton's Wikipedia page end to end.
The result is Newton LLM. You can now ask things like "What are Newton's contributions in Calculus?" and get proper answers with sources instead of made-up stuff.
Here's how I actually built it and what I learned.
The Problem With Most RAG Demos
Every YouTube RAG tutorial follows the same boring steps: load PDF, split into chunks, put in vector store, done.
But nobody talks about the real issues:
How do you keep the data fresh when the source changes?
How do you clean messy web data before embedding?
How do you separate the ingestion part from the serving part?
How do you make the whole thing actually deployable?
Newton LLM tries to solve these. Its not just a notebook — its a small system.
Architecture Overview
The system has two main layers:
Data Ingestion Layer (the offline part)
Source → Airflow → MongoDB
I pull data from Wikipedia about Newton — his life, physics, math, optics etc.
Apache Airflow runs the whole ETL pipeline through a DAG. It fetches, cleans, and transforms the raw content. No random scripts or cron jobs. Airflow handles retries, scheduling and monitoring.
MongoDB stores the cleaned documents. This is my "source of truth" before anything gets embedded.
Why not embed straight from Wikipedia? Because raw scraped pages are full of garbage — menus, references, bad HTML. You need to clean it first. MongoDB gives me a clean staging area.
RAG Serving Layer (the online part)
Qdrant ← Batch Embeddings ← MongoDB
Since Newton's Wikipedia doesn't change every day, I use batch embedding instead of doing it live. Documents go from MongoDB → embedding model → Qdrant in scheduled batches. Its cheaper and faster.
When user asks a question:
User Question
→ FastAPI gets it
→ Query gets embedded
→ Qdrant finds similar chunks
→ Retrieved docs + question → LLM
→ Answer with sources
The LLM always gets context. It helps a lot with hallucinations.
Tech Stack
Orchestration: Apache Airflow (for DAGs, retries, monitoring)
Document Store: MongoDB (flexible for messy Wikipedia data)
Vector Store: Qdrant (fast and open source)
Backend: FastAPI (quick and clean)
Frontend: Next.js / Streamlit (Next for real use, Streamlit for quick tests)
Key Decisions
Batch Embedding > Real-time Embedding
Most tutorials embed on the fly. For static data like this, its stupid to keep re-embedding the same things. I run batch embedding once or on schedule and save a lot of time and money.
Airflow instead of simple Python script
I could have just written one scrape_and_embed.py file. But Airflow gives retries, proper logging, scheduling and makes each step separate. If Wikipedia is down, it retries automatically. For anything bigger than a toy project, orchestration actually matters.
Separating Ingestion from Serving
The scraping/cleaning part and the answering part are completely separate. Ingestion can break or update without touching the live RAG system. The serving layer just reads from Qdrant.
What I'd Do Differently Next Time
Add a reranker — simple vector search isn't enough. A reranker would make results much better.
Build evaluation from the start — without proper eval, you don't know if your changes actually help.
Add more sources — right now only Wikipedia. Academic papers would make it way stronger.
Try hybrid search — combine vector search with keyword search (BM25).
Final Thoughts
Building a simple RAG demo is easy. Building something that actually works properly is much harder. Most of the work is in the boring parts: cleaning data, setting up orchestration, separating concerns, and deciding when to use batch vs real-time.
Newton LLM showed me that good retrieval matters more than which LLM you use. If your pipeline is solid, even a smaller model gives good answers.
If you're building RAG, focus on the data pipeline first, not fancy prompts.

Top comments (0)