Retrieval-Augmented Generation (RAG) system using LangChain, ChromaDB, and local LLMs.

#rag #localllm #aitrends2026 #retrievalaugmentedgeneration

The Problem: The "Documentation Drain"

We’ve all been there: you need a specific sql syntax or a complex join optimization strategy, and you're stuck searching a 200-page PDF. Standard AI models like ChatGPT are great, but they don't know the specifics of your project's internal documentation.

The goal was to build a system that:

Reads the entire PDF.
Indexes it for instant retrieval.
Answers complex queries using a local model for privacy and speed.

The Tech Stack (2026 Edition)

To keep the project modern and efficient, I used a modular stack:
Language: Python 3.12+ managed by uv (the fastest package manager).
Orchestration: LangChain and LangChain-Classic for the RAG pipeline.
Vector Database: ChromaDB for persistent, local storage.
Models: Google Gemini 2.5 Flash (for heavy lifting) and Qwen 3: 0.6B-F16 (running locally via Docker).
Frontend: Streamlit for a clean, browser-based chat interface.

Implementation: Step-by-Step

1. Data Ingestion & Chunking
A 200-page PDF is too large for an LLM to "read" all at once. We broke the document into smaller, overlapping chunks using the RecursiveCharacterTextSplitter. This ensures that context (like a SQL statement that spans two pages) isn't cut in half.

Python text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200) chunks = text_splitter.split_documents(raw_documents)

2. The Search Engine (Vectorization)

We converted these text chunks into mathematical vectors using Google’s Gemini Embeddings. These vectors are stored in a local ChromaDB folder. The beauty of this is persistence: once you index the document, you never have to do it again. You can simply "load" the database in milliseconds.

3. Going Local: Docker + Qwen 3

For maximum privacy and zero API costs, I integrated a local model. Using Docker Model Runner, I deployed Qwen 3 (0.6B). Even at a small parameter size, it handles SQL generation and technical explanations with impressive accuracy when grounded by the retrieved context.

4. The User Interface

While Jupyter Notebooks are great for development, they aren't for "users." I wrapped the entire logic in a Streamlit application. This converted the Python code into a professional web interface with a chat history, loading spinners, and source attribution (telling the user exactly which page of the PDF the answer came from).

Key Challenges Overcome

Rate Limiting: When embedding 200 pages for the first time, I hit Google's "Resource Exhausted" errors. I solved this by implementing Exponential Backoff logic—the script now automatically waits and retries if the API gets overwhelmed.

Library Shifts: In 2026, LangChain moved to a modular structure. I adapted by using langchain-classic to keep the robust retrieval chains working while staying compatible with the latest updates.

The Result

What started as a static PDF is now a dynamic SQL expert. I can ask, "How do I create a customer table with specific DB2 constraints?" and within seconds, the bot retrieves the correct syntax from the manual and formats the code for me.

For any software engineer working with heavy documentation, this RAG setup is a game-changer. It moves us from "searching for info" to "acting on info."

Youtube
GitHUb