Zarrar Shaikh

Posted on Jul 7

Fullstack RAG PDFBot - From Prototype to Production-Ready-ish

🔗 If you're new to this project, start with the original guide here:

Building a RAG-powered PDF Chatbot - V1

🔗 Follow-up guide after the first iteration of the bot:

Refactoring RAG PDFBot - V2

In Version 2, we built on our Version 1 foundation by splitting everything into separate files. That was great - we cleaned up our monolithic code and gave our chatbot more structure. But let’s be honest: everything still lived inside one Streamlit app. The logic for uploading files, generating answers, and even managing the vector store - all of it was handled inside Streamlit. That’s fine for a prototype, but not quite production-ready.

With Version 3, we’ve taken a major step forward.

📦 Source Code V3: Zlash65/rag-bot-fastapi

🚀 What’s New in Iteration 3?

We’ve split the application into a real Frontend and Backend:

Frontend: Built using Streamlit, it handles all the UI.
Backend: Powered by FastAPI, it takes care of PDF processing, vector storage, querying, and AI interactions.

✅ Why this split?

Separation of Concerns: UI doesn’t need to know how AI logic or embeddings work.
Flexibility: Want to use Gradio, or React for your UI? Now you can, without touching the backend.
Scalability: This separation allows better logging, monitoring, and potential deployment on different servers.

👆 Here's a quick look

🧱 Our Project Structure

We now have two separate folders:

📂 `client/` - Streamlit Frontend

client/
├── app.py                      # Main entrypoint for Streamlit
├── components/                 # Chat UI, inspector, sidebar
│   ├── chat.py
│   ├── inspector.py
│   └── sidebar.py
├── state/
│   └── session.py              # Session setup and helper functions
├── utils/
│   ├── api.py                  # API calls to FastAPI server
│   ├── config.py               # API URL config
│   └── helpers.py              # High-level API abstractions
├── requirements.txt
└── README.md

Stateless API interactions via requests
UI elements handled via sidebar, chat input, and toggleable views
Modular components for Chat, Inspector, and Uploads

📂 `server/` - FastAPI Backend

server/
├── api/
│   ├── routes.py               # API endpoints for upload, chat, models etc.
│   └── schemas.py              # Input/output data validation with Pydantic
├── core/
│   ├── document_processor.py   # PDF handling: save, chunk, split
│   ├── llm_chain_factory.py    # LLM, embeddings, chain creation
│   └── vector_database.py      # ChromaDB handling: load, upsert, search
├── config/
│   └── settings.py             # API keys, model setup, directories
├── utils/
│   └── logger.py               # Logging for debugging and monitoring
├── main.py                     # FastAPI app setup
├── requirements.txt
└── README.md

Our backend now has full control over:

What LLM is used
Which model the user selects
How PDFs are stored and processed
What embeddings we generate
What responses we send back

This means we can extend things easily - add another model, another embedding technique, or change vector store - without touching our frontend.

🔄 What Changed from Iteration 2

Here’s a quick breakdown of how we evolved:

Feature	Iteration 2	Iteration 3
Codebase	One Streamlit app	Separate client (UI) + server (logic)
PDF Handling	Inside frontend	Via FastAPI API
LLM Response	Direct from Streamlit	API-based response
Embeddings + Vectorstore	Managed by UI	Fully controlled by backend
Inspector	Inside sidebar (cramped)	Main UI toggle - cleaner
Extending Models	Needed code change in UI	Plug-and-play via config
File Validation	None	PDF size/type check in backend
Future Extensions	Hard	Clean hooks for scaling
UX	Basic	Toggle-based views, downloads, resets
Text Splitting	RecursiveCharacterTextSplitter	TokenTextSplitter (LLM-aware, cleaner splits)

🔍 Why We Switched to TokenTextSplitter

In earlier versions, we used RecursiveCharacterTextSplitter to chunk our documents. It works by splitting the text at "natural" breakpoints - like paragraphs, then sentences, then words, then characters - to get close to the target chunk size in characters.

But here's the problem: LLMs like GPT, Claude, or Gemini don’t read text in characters - they read tokens. A token is roughly 3–4 characters or 0.75 words. That means your 1000-character chunk might be 300 tokens… or 1200 tokens. It’s unpredictable.

To fix this, we now use TokenTextSplitter, which splits based on actual token counts, giving precise control over chunk size and overlap. This leads to more reliable inputs and avoids going over model limits.

🔬 Simple Example

Let’s take this sentence:

"LangChain helps developers build applications with LLMs more efficiently."

That’s 76 characters but around 12 tokens.

RecursiveCharacterTextSplitter

With RecursiveCharacterTextSplitter(chunk_size=30), we might get:

Chunk 1: "LangChain helps developers "
Chunk 2: "build applications with LLMs "
Chunk 3: "more efficiently."

Visually clean, but token count varies and could overflow model limits.

⚠️ Notice how the bot was not able to give correct response to our question because of improper chunking

TokenTextSplitter

With TokenTextSplitter(chunk_size=10, chunk_overlap=2), we get chunks like:

Chunk 1: ["Lang", "Chain", "helps", ..., "efficiently"]
Chunk 2: ["applications", ..., "efficiently"]

Each chunk is exactly 10 tokens, making it predictable and LLM-friendly.

✅ We get more accurate responses when splitting chunks by token

By using TokenTextSplitter, we gain better control, consistency, and contextual accuracy - making our RAG pipeline more reliable.

✨ Better User Experience

Previously, the inspector tool was a bit hidden. We crammed it into the sidebar and showed the results there too - not the best experience.

In this iteration, we made it visible in the main chat area. There's a toggle in the sidebar where we can switch between:

💬 Chat View
🔬 Inspector View

Now, we get full-width, readable responses whether we’re chatting with PDFs or inspecting our vectorstore. It’s simple and intuitive.

⚡ Why a Production-Ready Backend Matters

Splitting the codebase into frontend and backend isn’t just good structure - it unlocks real power:

Async APIs by Default: Our FastAPI backend supports async endpoints. That means heavy operations like PDF uploads can later be offloaded to background task queues like Celery or RQ, keeping the app responsive.
Plug-and-Play Model Integration: Want to add OpenAI, Cohere, or any other LLM provider? Just update the model config in settings.py. The frontend automatically reflects the new options - no need to touch UI code.
Independent Scalability: The backend can now be scaled separately. You could deploy it on a more powerful server or container, while keeping the Streamlit frontend lightweight.
Extendability: You can now plug in:
- Authentication & authorization
- Persistent chat history
- User sessions
- Rate limiting
- Admin dashboards
- and more...
Cleaner Logs & Traceability: Errors, API calls, and internal processing can now be logged systematically using utils/logger.py.
Ready for Containerization: Frontend and backend can be deployed on different services and containers (e.g. Streamlit Cloud + Render, EC2, etc).
Frontend Agnostic: Want a more custom UI? You can now build one in React, Gradio, or even mobile - and keep using the same backend APIs.

🌐 How the Frontend Talks to the Backend

The Streamlit frontend acts purely as a UI renderer. Every interaction routes through the FastAPI backend via well-defined HTTP endpoints:

Model & Provider Fetching:
- GET /llm → Fetches available providers.
- GET /llm/{model_provider} → Fetches models for the selected provider.
PDF Upload & Processing:
- POST /upload_and_process_pdfs → Uploads selected PDFs, splits them, creates embeddings, and stores them.
Inspector Tools:
- GET /vector_store/count/{model_provider} → Gets the number of indexed documents.
- POST /vector_store/search → Returns top document matches for a query.
Chat Endpoint:
- POST /chat → Sends user message + model info, and returns LLM-generated response.

This setup ensures separation of responsibilities:

Frontend: handles layout, inputs, displaying results.
Backend: handles logic, computation, storage, and integration with LLMs.

🏗️ Why We Still Use Streamlit for the Frontend

While we’ve upgraded our backend, we’re sticking with Streamlit for the frontend - for now. Here’s why:

🧱 Rapid Prototyping: We can build interactive UIs in minutes, not days.
💬 Built-in Components: Features like chat_input, expander, sidebar, and st.tabs simplify layout.
🧪 Focus on Learning AI: We avoid the overhead of building a custom UI in React or HTML/CSS - saving our energy for improving LLM workflows.

Eventually, we’ll likely switch to a custom-built frontend. But until then, Streamlit lets us move fast and learn faster.

📦 Architecture Benefits at a Glance

Here’s what we gain by this separation of concerns:

✅ Maintainability: Code is modular and easier to debug or extend.
✅ Scalability: Frontend and backend can grow independently.
✅ Developer Experience: No fear adding new models, chains, or workflows.
✅ Deployment Flexibility: Deploy frontend and backend to different services with ease.
✅ Tooling Support: Easier to add monitoring, tracing, logging, background jobs, or security layers.

This structure mirrors how real-world AI products are built.

🔁 Recap

Let’s summarize our journey so far:

Iteration 1: A single-file prototype with Streamlit and FAISS. Quick and dirty.
Iteration 2: Modularized the logic but still kept everything inside one Streamlit app.
Iteration 3: Split into a decoupled frontend (Streamlit) and backend (FastAPI), creating a scalable, production-leaning RAG bot.

From hacking things together to building an extensible, maintainable system - we're not just playing with AI anymore. We're engineering real tools.

📦 Source Code

Version 1: Zlash65/rag-bot-basic

Version 2: Zlash65/rag-bot-chroma

Version 3: Zlash65/rag-bot-fastapi

💭 Final thoughts

We started with a single file prototype. Then, we broke things into modules. Now, we’ve split the app into an actual frontend and backend. And that’s a huge deal.

If you’ve made it this far, you’ve not only built something functional - you’ve learned how real-world AI tools are structured.

Don't stop here. Keep exploring. Keep tweaking. Build weird stuff. Break things and fix them.

This is how engineers grow.

Let’s keep shipping and improving - one iteration at a time.

Happy building! 🛠️

DEV Community

Fullstack RAG PDFBot - From Prototype to Production-Ready-ish

🚀 What’s New in Iteration 3?

✅ Why this split?

🧱 Our Project Structure

📂 `client/` - Streamlit Frontend

📂 `server/` - FastAPI Backend

🔄 What Changed from Iteration 2

🔍 Why We Switched to TokenTextSplitter

🔬 Simple Example

RecursiveCharacterTextSplitter

TokenTextSplitter

✨ Better User Experience

⚡ Why a Production-Ready Backend Matters

🌐 How the Frontend Talks to the Backend

🏗️ Why We Still Use Streamlit for the Frontend

📦 Architecture Benefits at a Glance

🔁 Recap

📦 Source Code

💭 Final thoughts

Top comments (0)

🚀 What’s New in Iteration 3?

✅ Why this split?

🧱 Our Project Structure

📂 client/ - Streamlit Frontend

📂 server/ - FastAPI Backend

🔄 What Changed from Iteration 2

🔍 Why We Switched to TokenTextSplitter

🔬 Simple Example

RecursiveCharacterTextSplitter

TokenTextSplitter

✨ Better User Experience

⚡ Why a Production-Ready Backend Matters

🌐 How the Frontend Talks to the Backend

🏗️ Why We Still Use Streamlit for the Frontend

📦 Architecture Benefits at a Glance

🔁 Recap

📦 Source Code

💭 Final thoughts

📂 `client/` - Streamlit Frontend

📂 `server/` - FastAPI Backend