π If you're new to this project, start with the original guide here:
Building a RAG-powered PDF Chatbot - V1π Follow-up guide after the first iteration of the bot:
Refactoring RAG PDFBot - V2
In Version 2, we built on our Version 1 foundation by splitting everything into separate files. That was great - we cleaned up our monolithic code and gave our chatbot more structure. But letβs be honest: everything still lived inside one Streamlit app. The logic for uploading files, generating answers, and even managing the vector store - all of it was handled inside Streamlit. Thatβs fine for a prototype, but not quite production-ready.
With Version 3, weβve taken a major step forward.
π¦ Source Code V3: Zlash65/rag-bot-fastapi
π Whatβs New in Iteration 3?
Weβve split the application into a real Frontend and Backend:
- Frontend: Built using Streamlit, it handles all the UI.
 - Backend: Powered by FastAPI, it takes care of PDF processing, vector storage, querying, and AI interactions.
 
β Why this split?
- Separation of Concerns: UI doesnβt need to know how AI logic or embeddings work.
 - Flexibility: Want to use Gradio, or React for your UI? Now you can, without touching the backend.
 - Scalability: This separation allows better logging, monitoring, and potential deployment on different servers.
 
π Here's a quick look
π§± Our Project Structure
We now have two separate folders:
  
  
  π client/ - Streamlit Frontend
client/
βββ app.py                      # Main entrypoint for Streamlit
βββ components/                 # Chat UI, inspector, sidebar
β   βββ chat.py
β   βββ inspector.py
β   βββ sidebar.py
βββ state/
β   βββ session.py              # Session setup and helper functions
βββ utils/
β   βββ api.py                  # API calls to FastAPI server
β   βββ config.py               # API URL config
β   βββ helpers.py              # High-level API abstractions
βββ requirements.txt
βββ README.md
- Stateless API interactions via 
requests - UI elements handled via sidebar, chat input, and toggleable views
 - Modular components for Chat, Inspector, and Uploads
 
  
  
  π server/ - FastAPI Backend
server/
βββ api/
β   βββ routes.py               # API endpoints for upload, chat, models etc.
β   βββ schemas.py              # Input/output data validation with Pydantic
βββ core/
β   βββ document_processor.py   # PDF handling: save, chunk, split
β   βββ llm_chain_factory.py    # LLM, embeddings, chain creation
β   βββ vector_database.py      # ChromaDB handling: load, upsert, search
βββ config/
β   βββ settings.py             # API keys, model setup, directories
βββ utils/
β   βββ logger.py               # Logging for debugging and monitoring
βββ main.py                     # FastAPI app setup
βββ requirements.txt
βββ README.md
Our backend now has full control over:
- What LLM is used
 - Which model the user selects
 - How PDFs are stored and processed
 - What embeddings we generate
 - What responses we send back
 
This means we can extend things easily - add another model, another embedding technique, or change vector store - without touching our frontend.
π What Changed from Iteration 2
Hereβs a quick breakdown of how we evolved:
| Feature | Iteration 2 | Iteration 3 | 
|---|---|---|
| Codebase | One Streamlit app | Separate client (UI) + server (logic) | 
| PDF Handling | Inside frontend | Via FastAPI API | 
| LLM Response | Direct from Streamlit | API-based response | 
| Embeddings + Vectorstore | Managed by UI | Fully controlled by backend | 
| Inspector | Inside sidebar (cramped) | Main UI toggle - cleaner | 
| Extending Models | Needed code change in UI | Plug-and-play via config | 
| File Validation | None | PDF size/type check in backend | 
| Future Extensions | Hard | Clean hooks for scaling | 
| UX | Basic | Toggle-based views, downloads, resets | 
| Text Splitting | RecursiveCharacterTextSplitter | TokenTextSplitter (LLM-aware, cleaner splits) | 
π Why We Switched to TokenTextSplitter
In earlier versions, we used RecursiveCharacterTextSplitter to chunk our documents. It works by splitting the text at "natural" breakpoints - like paragraphs, then sentences, then words, then characters - to get close to the target chunk size in characters.
But here's the problem: LLMs like GPT, Claude, or Gemini donβt read text in characters - they read tokens. A token is roughly 3β4 characters or 0.75 words. That means your 1000-character chunk might be 300 tokensβ¦ or 1200 tokens. Itβs unpredictable.
To fix this, we now use TokenTextSplitter, which splits based on actual token counts, giving precise control over chunk size and overlap. This leads to more reliable inputs and avoids going over model limits.
π¬ Simple Example
Letβs take this sentence:
"LangChain helps developers build applications with LLMs more efficiently."
Thatβs 76 characters but around 12 tokens.
RecursiveCharacterTextSplitter
- With 
RecursiveCharacterTextSplitter(chunk_size=30), we might get: 
Chunk 1: "LangChain helps developers "
Chunk 2: "build applications with LLMs "
Chunk 3: "more efficiently."
Visually clean, but token count varies and could overflow model limits.
β οΈ Notice how the bot was not able to give correct response to our question because of improper chunking
TokenTextSplitter
- With 
TokenTextSplitter(chunk_size=10, chunk_overlap=2), we get chunks like: 
Chunk 1: ["Lang", "Chain", "helps", ..., "efficiently"]
Chunk 2: ["applications", ..., "efficiently"]
Each chunk is exactly 10 tokens, making it predictable and LLM-friendly.
β We get more accurate responses when splitting chunks by token
By using TokenTextSplitter, we gain better control, consistency, and contextual accuracy - making our RAG pipeline more reliable.
β¨ Better User Experience
Previously, the inspector tool was a bit hidden. We crammed it into the sidebar and showed the results there too - not the best experience.
In this iteration, we made it visible in the main chat area. There's a toggle in the sidebar where we can switch between:
- π¬ Chat View
 - π¬ Inspector View
 
Now, we get full-width, readable responses whether weβre chatting with PDFs or inspecting our vectorstore. Itβs simple and intuitive.
β‘ Why a Production-Ready Backend Matters
Splitting the codebase into frontend and backend isnβt just good structure - it unlocks real power:
Async APIs by Default: Our FastAPI backend supports async endpoints. That means heavy operations like PDF uploads can later be offloaded to background task queues like Celery or RQ, keeping the app responsive.
Plug-and-Play Model Integration: Want to add OpenAI, Cohere, or any other LLM provider? Just update the model config in
settings.py. The frontend automatically reflects the new options - no need to touch UI code.Independent Scalability: The backend can now be scaled separately. You could deploy it on a more powerful server or container, while keeping the Streamlit frontend lightweight.
- 
Extendability: You can now plug in:
- Authentication & authorization
 - Persistent chat history
 - User sessions
 - Rate limiting
 - Admin dashboards
 - and more...
 
 Cleaner Logs & Traceability: Errors, API calls, and internal processing can now be logged systematically using
utils/logger.py.Ready for Containerization: Frontend and backend can be deployed on different services and containers (e.g. Streamlit Cloud + Render, EC2, etc).
Frontend Agnostic: Want a more custom UI? You can now build one in React, Gradio, or even mobile - and keep using the same backend APIs.
π How the Frontend Talks to the Backend
The Streamlit frontend acts purely as a UI renderer. Every interaction routes through the FastAPI backend via well-defined HTTP endpoints:
- 
Model & Provider Fetching:
- 
GET /llmβ Fetches available providers. - 
GET /llm/{model_provider}β Fetches models for the selected provider. 
 - 
 - 
PDF Upload & Processing:
- 
POST /upload_and_process_pdfsβ Uploads selected PDFs, splits them, creates embeddings, and stores them. 
 - 
 - 
Inspector Tools:
- 
GET /vector_store/count/{model_provider}β Gets the number of indexed documents. - 
POST /vector_store/searchβ Returns top document matches for a query. 
 - 
 - 
Chat Endpoint:
- 
POST /chatβ Sends user message + model info, and returns LLM-generated response. 
 - 
 
This setup ensures separation of responsibilities:
- Frontend: handles layout, inputs, displaying results.
 - Backend: handles logic, computation, storage, and integration with LLMs.
 
ποΈ Why We Still Use Streamlit for the Frontend
While weβve upgraded our backend, weβre sticking with Streamlit for the frontend - for now. Hereβs why:
- π§± Rapid Prototyping: We can build interactive UIs in minutes, not days.
 - π¬ Built-in Components: Features like 
chat_input,expander,sidebar, andst.tabssimplify layout. - π§ͺ Focus on Learning AI: We avoid the overhead of building a custom UI in React or HTML/CSS - saving our energy for improving LLM workflows.
 
Eventually, weβll likely switch to a custom-built frontend. But until then, Streamlit lets us move fast and learn faster.
π¦ Architecture Benefits at a Glance
Hereβs what we gain by this separation of concerns:
- β Maintainability: Code is modular and easier to debug or extend.
 - β Scalability: Frontend and backend can grow independently.
 - β Developer Experience: No fear adding new models, chains, or workflows.
 - β Deployment Flexibility: Deploy frontend and backend to different services with ease.
 - β Tooling Support: Easier to add monitoring, tracing, logging, background jobs, or security layers.
 
This structure mirrors how real-world AI products are built.
π Recap
Letβs summarize our journey so far:
- Iteration 1: A single-file prototype with Streamlit and FAISS. Quick and dirty.
 - Iteration 2: Modularized the logic but still kept everything inside one Streamlit app.
 - Iteration 3: Split into a decoupled frontend (Streamlit) and backend (FastAPI), creating a scalable, production-leaning RAG bot.
 
From hacking things together to building an extensible, maintainable system - we're not just playing with AI anymore. We're engineering real tools.
π¦ Source Code
Version 1: Zlash65/rag-bot-basic
Version 2: Zlash65/rag-bot-chroma
Version 3: Zlash65/rag-bot-fastapi
π Final thoughts
We started with a single file prototype. Then, we broke things into modules. Now, weβve split the app into an actual frontend and backend. And thatβs a huge deal.
If youβve made it this far, youβve not only built something functional - youβve learned how real-world AI tools are structured.
Don't stop here. Keep exploring. Keep tweaking. Build weird stuff. Break things and fix them.
This is how engineers grow.
Letβs keep shipping and improving - one iteration at a time.
Happy building! π οΈ




    
Top comments (0)