π If you're new to this project, start with the original guide here:
Building a RAG-powered PDF Chatbot - V1π Follow-up guide after the first iteration of the bot:
Refactoring RAG PDFBot - V2
In Version 2, we built on our Version 1 foundation by splitting everything into separate files. That was great - we cleaned up our monolithic code and gave our chatbot more structure. But letβs be honest: everything still lived inside one Streamlit app. The logic for uploading files, generating answers, and even managing the vector store - all of it was handled inside Streamlit. Thatβs fine for a prototype, but not quite production-ready.
With Version 3, weβve taken a major step forward.
π¦ Source Code V3: Zlash65/rag-bot-fastapi
π Whatβs New in Iteration 3?
Weβve split the application into a real Frontend and Backend:
- Frontend: Built using Streamlit, it handles all the UI.
- Backend: Powered by FastAPI, it takes care of PDF processing, vector storage, querying, and AI interactions.
β Why this split?
- Separation of Concerns: UI doesnβt need to know how AI logic or embeddings work.
- Flexibility: Want to use Gradio, or React for your UI? Now you can, without touching the backend.
- Scalability: This separation allows better logging, monitoring, and potential deployment on different servers.
π Here's a quick look
π§± Our Project Structure
We now have two separate folders:
π client/
- Streamlit Frontend
client/
βββ app.py # Main entrypoint for Streamlit
βββ components/ # Chat UI, inspector, sidebar
β βββ chat.py
β βββ inspector.py
β βββ sidebar.py
βββ state/
β βββ session.py # Session setup and helper functions
βββ utils/
β βββ api.py # API calls to FastAPI server
β βββ config.py # API URL config
β βββ helpers.py # High-level API abstractions
βββ requirements.txt
βββ README.md
- Stateless API interactions via
requests
- UI elements handled via sidebar, chat input, and toggleable views
- Modular components for Chat, Inspector, and Uploads
π server/
- FastAPI Backend
server/
βββ api/
β βββ routes.py # API endpoints for upload, chat, models etc.
β βββ schemas.py # Input/output data validation with Pydantic
βββ core/
β βββ document_processor.py # PDF handling: save, chunk, split
β βββ llm_chain_factory.py # LLM, embeddings, chain creation
β βββ vector_database.py # ChromaDB handling: load, upsert, search
βββ config/
β βββ settings.py # API keys, model setup, directories
βββ utils/
β βββ logger.py # Logging for debugging and monitoring
βββ main.py # FastAPI app setup
βββ requirements.txt
βββ README.md
Our backend now has full control over:
- What LLM is used
- Which model the user selects
- How PDFs are stored and processed
- What embeddings we generate
- What responses we send back
This means we can extend things easily - add another model, another embedding technique, or change vector store - without touching our frontend.
π What Changed from Iteration 2
Hereβs a quick breakdown of how we evolved:
Feature | Iteration 2 | Iteration 3 |
---|---|---|
Codebase | One Streamlit app | Separate client (UI) + server (logic) |
PDF Handling | Inside frontend | Via FastAPI API |
LLM Response | Direct from Streamlit | API-based response |
Embeddings + Vectorstore | Managed by UI | Fully controlled by backend |
Inspector | Inside sidebar (cramped) | Main UI toggle - cleaner |
Extending Models | Needed code change in UI | Plug-and-play via config |
File Validation | None | PDF size/type check in backend |
Future Extensions | Hard | Clean hooks for scaling |
UX | Basic | Toggle-based views, downloads, resets |
Text Splitting | RecursiveCharacterTextSplitter | TokenTextSplitter (LLM-aware, cleaner splits) |
π Why We Switched to TokenTextSplitter
In earlier versions, we used RecursiveCharacterTextSplitter
to chunk our documents. It works by splitting the text at "natural" breakpoints - like paragraphs, then sentences, then words, then characters - to get close to the target chunk size in characters.
But here's the problem: LLMs like GPT, Claude, or Gemini donβt read text in characters - they read tokens. A token is roughly 3β4 characters or 0.75 words. That means your 1000-character chunk might be 300 tokensβ¦ or 1200 tokens. Itβs unpredictable.
To fix this, we now use TokenTextSplitter
, which splits based on actual token counts, giving precise control over chunk size and overlap. This leads to more reliable inputs and avoids going over model limits.
π¬ Simple Example
Letβs take this sentence:
"LangChain helps developers build applications with LLMs more efficiently."
Thatβs 76 characters but around 12 tokens.
RecursiveCharacterTextSplitter
- With
RecursiveCharacterTextSplitter(chunk_size=30)
, we might get:
Chunk 1: "LangChain helps developers "
Chunk 2: "build applications with LLMs "
Chunk 3: "more efficiently."
Visually clean, but token count varies and could overflow model limits.
β οΈ Notice how the bot was not able to give correct response to our question because of improper chunking
TokenTextSplitter
- With
TokenTextSplitter(chunk_size=10, chunk_overlap=2)
, we get chunks like:
Chunk 1: ["Lang", "Chain", "helps", ..., "efficiently"]
Chunk 2: ["applications", ..., "efficiently"]
Each chunk is exactly 10 tokens, making it predictable and LLM-friendly.
β We get more accurate responses when splitting chunks by token
By using TokenTextSplitter
, we gain better control, consistency, and contextual accuracy - making our RAG pipeline more reliable.
β¨ Better User Experience
Previously, the inspector tool was a bit hidden. We crammed it into the sidebar and showed the results there too - not the best experience.
In this iteration, we made it visible in the main chat area. There's a toggle in the sidebar where we can switch between:
- π¬ Chat View
- π¬ Inspector View
Now, we get full-width, readable responses whether weβre chatting with PDFs or inspecting our vectorstore. Itβs simple and intuitive.
β‘ Why a Production-Ready Backend Matters
Splitting the codebase into frontend and backend isnβt just good structure - it unlocks real power:
Async APIs by Default: Our FastAPI backend supports async endpoints. That means heavy operations like PDF uploads can later be offloaded to background task queues like Celery or RQ, keeping the app responsive.
Plug-and-Play Model Integration: Want to add OpenAI, Cohere, or any other LLM provider? Just update the model config in
settings.py
. The frontend automatically reflects the new options - no need to touch UI code.Independent Scalability: The backend can now be scaled separately. You could deploy it on a more powerful server or container, while keeping the Streamlit frontend lightweight.
-
Extendability: You can now plug in:
- Authentication & authorization
- Persistent chat history
- User sessions
- Rate limiting
- Admin dashboards
- and more...
Cleaner Logs & Traceability: Errors, API calls, and internal processing can now be logged systematically using
utils/logger.py
.Ready for Containerization: Frontend and backend can be deployed on different services and containers (e.g. Streamlit Cloud + Render, EC2, etc).
Frontend Agnostic: Want a more custom UI? You can now build one in React, Gradio, or even mobile - and keep using the same backend APIs.
π How the Frontend Talks to the Backend
The Streamlit frontend acts purely as a UI renderer. Every interaction routes through the FastAPI backend via well-defined HTTP endpoints:
-
Model & Provider Fetching:
-
GET /llm
β Fetches available providers. -
GET /llm/{model_provider}
β Fetches models for the selected provider.
-
-
PDF Upload & Processing:
-
POST /upload_and_process_pdfs
β Uploads selected PDFs, splits them, creates embeddings, and stores them.
-
-
Inspector Tools:
-
GET /vector_store/count/{model_provider}
β Gets the number of indexed documents. -
POST /vector_store/search
β Returns top document matches for a query.
-
-
Chat Endpoint:
-
POST /chat
β Sends user message + model info, and returns LLM-generated response.
-
This setup ensures separation of responsibilities:
- Frontend: handles layout, inputs, displaying results.
- Backend: handles logic, computation, storage, and integration with LLMs.
ποΈ Why We Still Use Streamlit for the Frontend
While weβve upgraded our backend, weβre sticking with Streamlit for the frontend - for now. Hereβs why:
- π§± Rapid Prototyping: We can build interactive UIs in minutes, not days.
- π¬ Built-in Components: Features like
chat_input
,expander
,sidebar
, andst.tabs
simplify layout. - π§ͺ Focus on Learning AI: We avoid the overhead of building a custom UI in React or HTML/CSS - saving our energy for improving LLM workflows.
Eventually, weβll likely switch to a custom-built frontend. But until then, Streamlit lets us move fast and learn faster.
π¦ Architecture Benefits at a Glance
Hereβs what we gain by this separation of concerns:
- β Maintainability: Code is modular and easier to debug or extend.
- β Scalability: Frontend and backend can grow independently.
- β Developer Experience: No fear adding new models, chains, or workflows.
- β Deployment Flexibility: Deploy frontend and backend to different services with ease.
- β Tooling Support: Easier to add monitoring, tracing, logging, background jobs, or security layers.
This structure mirrors how real-world AI products are built.
π Recap
Letβs summarize our journey so far:
- Iteration 1: A single-file prototype with Streamlit and FAISS. Quick and dirty.
- Iteration 2: Modularized the logic but still kept everything inside one Streamlit app.
- Iteration 3: Split into a decoupled frontend (Streamlit) and backend (FastAPI), creating a scalable, production-leaning RAG bot.
From hacking things together to building an extensible, maintainable system - we're not just playing with AI anymore. We're engineering real tools.
π¦ Source Code
Version 1: Zlash65/rag-bot-basic
Version 2: Zlash65/rag-bot-chroma
Version 3: Zlash65/rag-bot-fastapi
π Final thoughts
We started with a single file prototype. Then, we broke things into modules. Now, weβve split the app into an actual frontend and backend. And thatβs a huge deal.
If youβve made it this far, youβve not only built something functional - youβve learned how real-world AI tools are structured.
Don't stop here. Keep exploring. Keep tweaking. Build weird stuff. Break things and fix them.
This is how engineers grow.
Letβs keep shipping and improving - one iteration at a time.
Happy building! π οΈ
Top comments (0)