DEV Community

Niyati Gupta
Niyati Gupta

Posted on

Stop Paying for OpenAI: Build Your Own Local RAG Pipeline in Python- a PDF Chatbot

STOP wasting money on OpenAI API credits. If we're heading into 2026, your local machine is more than capable of handling your data for free.

I’ve developed a PDF Chatbot that leverages HuggingFace and LangChain to deliver a premium AI experience with $0 overhead. No hidden fees, no API keys—just pure, local RAG (Retrieval-Augmented Generation).

To understand how the chatbot functions, here is a high-level overview of the logic flow:

Architechture

In essence, the system follows a streamlined RAG (Retrieval-Augmented Generation) pipeline:

PDF UploadText ChunkingVector EmbeddingFAISS StorageUser QueryContext RetrievalAI Response

Ready to take control of your data? Let's dive into the build. Here is the step-by-step guide.

Step 1: Environment Setup

Before we begin coding, ensure you have the following prerequisites installed on your system:

  • Python 3.9+: The core programming language.
  • PyCharm (or your preferred IDE): For writing and managing the project.

Step 2: Library Installation

Open your PyCharm terminal and run the following commands to install the necessary dependencies. We will be using Streamlit for the UI, PyPDF2 for PDF processing, and LangChain to orchestrate our AI workflow.

Bash

pip install streamlit pypdf2 
pip install langchain langchain-text-splitters langchain-community
pip install sentence-transformers faiss-cpu transformers
Enter fullscreen mode Exit fullscreen mode

Note: Since we are building an all-local solution, we also install sentence-transformers for our embeddings and faiss-cpu for our vector database.

Step 3: Initializing the UI with Streamlit

First, we import the Streamlit library. Streamlit is a powerful framework that allows developers to turn Python scripts into interactive web applications with minimal effort.

Python

import streamlit as st

# Configure the web application
st.set_page_config(page_title="PDF Chatbot", layout="wide")
st.header("PDF Chatbot")
Enter fullscreen mode Exit fullscreen mode

Heading

What is happening here?

  • import streamlit as st: We import the library with the alias st to keep our code concise.

  • st.set_page_config(...): This is a global configuration for your app.

  • page_title: Sets the name that appears on the browser tab.

  • layout="wide": By default, Streamlit centers content in a narrow column. Setting this to "wide" allows us to utilize the full width of the screen, which is better for chatbots.

  • st.header(...): This creates a large, bold title at the top of your webpage to greet the user.

Step 4: Building the Sidebar and File Uploader

To keep the interface clean, we will place the document controls in a sidebar. This allows the user to upload files without cluttering the main chat area.

Python

from PyPDF2 import PdfReader

with st.sidebar:
    st.title("Upload Document")
    file = st.file_uploader(
        "Upload a PDF and ask questions",
        type=["pdf"]
    )
Enter fullscreen mode Exit fullscreen mode

Sidebar

Breakdown of the Logic:

  • with st.sidebar:: This context manager tells Streamlit to place all the indented code inside the left-hand sidebar.

  • st.file_uploader(...): This creates a functional drag-and-drop zone.

  • type=["pdf"]: This is a security and usability filter. It restricts the user to selecting only PDF files, preventing errors from unsupported formats.

  • The file variable: Once a user uploads a document, the binary data is stored in the file variable. If no file is uploaded, this variable remains None.

Step 5: Extracting Text from the PDF

Now that we have the file, we need to read its contents. Since a PDF is a complex file format, we use the PdfReader class from the PyPDF2 library to loop through the pages and extract the raw text.

Python

# Initialize an empty string to store the document text
text = ""

if file is not None:
    pdf_reader = PdfReader(file)

    # Loop through each page in the document
    for page in pdf_reader.pages:
        page_text = page.extract_text()

        # Append the extracted text to our main string
        if page_text:
            text += page_text
Enter fullscreen mode Exit fullscreen mode

How this works:

  • The if file is not None: Check: This is a safety guard. It ensures the script doesn't crash if the user hasn't uploaded a file yet.
  • PdfReader(file): This initializes the reader object, which acts as a map of the entire PDF structure.
  • The for loop: We iterate through pdf_reader.pages to ensure we capture the content of every single page.
  • extract_text(): This method attempts to find all characters on the page and convert them into a standard Python string.
  • The if page_text: condition: Sometimes pages contain only images or are empty. This check ensures we only add actual text to our final text variable.

Step 6: Document Chunking with LangChain

Large Language Models have a "context window," meaning they can only read a certain amount of text at one time. To handle large PDFs, we must break the text into smaller, manageable pieces called chunks.

For this, we use the RecursiveCharacterTextSplitter.

Python

from langchain_text_splitters import RecursiveCharacterTextSplitter

if text:
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        length_function=len
    )

    # Split the unified text into a list of strings
    chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Why these specific settings?

  • chunk_size=1000: This sets the maximum number of characters each "piece" of text will contain. It’s small enough for the AI to process quickly but large enough to keep the meaning intact.
  • chunk_overlap=150: This is a crucial setting! By overlapping the chunks, we ensure that a sentence cut in half at the end of Chunk A is fully captured at the start of Chunk B. This prevents the AI from losing the context between sections.
  • length_function=len: This tells LangChain to use Python's built-in len() function. In this context, it means we are measuring the chunk size by character count. Since different AI models have different limits, knowing exactly how your chunks are measured ensures you don't accidentally send a piece of text that is too large for the model to handle. so "Hello World" has 11 characters.
  • RecursiveCharacterTextSplitter: Unlike a simple split, this "intelligent" splitter tries to break text at natural points like paragraphs and sentences first, making the chunks much more readable for the AI.

Step 7: Generating Vector Embeddings

Computers cannot "read" text the way humans do; they process numbers. We use Embeddings to convert our text chunks into long lists of numbers (vectors). These numbers represent the semantic meaning of the text.

For this project, we are using a high-efficiency, open-source model from HuggingFace-all-MiniLM-L6-v2.

Python

from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the Embedding Model (100% Free & Local)
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)
Enter fullscreen mode Exit fullscreen mode

Why "all-MiniLM-L6-v2"?

  • Efficiency: This is a "mini" model, meaning it is incredibly fast and designed to run on a standard CPU. You don't need an expensive GPU to get great results.
  • Accuracy: Despite its small size, it is excellent at mapping similar concepts together in a mathematical space.
  • Local Execution: Unlike OpenAI's text-embedding-ada-002, this model is downloaded to your machine. This means your data never leaves your computer, and you never pay a cent for API calls.

💡 Visualizing the "Numbers" (The Example)
You can explain it to your readers like this: If you have two chunks—one about "Apples" and one about "Oranges"—the model converts them into lists of numbers like this:

  • Chunk 1 (Apples): [0.12, -0.59, 0.88, ...]
  • Chunk 2 (Oranges): [0.15, -0.55, 0.82, ...]

Because the numbers are very similar, the computer "understands" that apples and oranges are related (they are both fruits), even if the words are spelled differently. This is how our chatbot finds the right information later!

Step 8: Storing Data in the FAISS Vector Database

Now that we have our text chunks and their corresponding "number versions" (embeddings), we need a way to store them so we can search them instantly. We use FAISS (Facebook AI Similarity Search), a high-performance vector database.

Python

from langchain_community.vectorstores import FAISS

# Store embeddings in a searchable FAISS index
vector_store = FAISS.from_texts(chunks, embeddings)
st.success("Document indexed successfully!")
Enter fullscreen mode Exit fullscreen mode

Why use a Vector Database like FAISS?
Traditional databases search for exact words. A vector database like FAISS searches for concepts.

  • Lightning Fast Retrieval: Developed by Facebook AI Research, FAISS is optimized to search through millions of vectors in milliseconds.
  • Vector Mapping: FAISS creates an index that maps each mathematical vector to its original text chunk:
    • Vector A [0.12, 0.88, ...] → Chunk 1
    • Vector B [0.91, 0.33, ...] → Chunk 2
  • Similarity Search: When you ask a question, FAISS calculates the "mathematical distance" (using metrics like Cosine Similarity or L2 distance) between your question and the stored chunks to find the closest match.

Why this makes your Chatbot "Smart":
Relevance: It only sends the most relevant pieces of information to the AI.

No Hallucinations: By giving the AI specific "Context," we prevent it from making things up. It can only answer based on what it finds in the FAISS index.

Efficiency: Instead of reading a 100-page PDF, the AI only reads the 3 or 4 most relevant paragraphs.

Step 9: Initializing the "AI Brain" (Flan-T5)

With our data indexed and searchable, we now need an intelligent engine to read the retrieved text and write a human-like response. We will use the Hugging Face Pipeline API to load google/flan-t5-large.

Python

from transformers import pipeline

# Load the LLM (Local Execution)
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-large",
    max_new_tokens=256,
)
Enter fullscreen mode Exit fullscreen mode

Why Flan-T5?
The "Flan" Advantage: Unlike standard models, "Flan" models are fine-tuned on a large collection of instructions, making them exceptionally good at following directions (like "Answer this question based only on this text").

Resource Efficient: Allows it to run comfortably on most modern laptops without a dedicated GPU.

Key Parameters:
pipeline("text2text-generation"): This helper function from Hugging Face simplifies the complex task of loading the tokenizer and the model into a single, easy-to-use object.

max_new_tokens=256: This limits the length of the AI's response. It ensures the answer is concise and prevents the model from "looping" or rambling.

What is a "Pipeline" anyway?

If you are new to AI, the word "Pipeline" might sound complicated. But in reality, it’s just a ready-made workflow that handles several messy steps for you in the correct order.

Think of a Coffee Machine :

When you want a coffee, you don't manually:

  • Grind the beans.
  • Boil the water to the exact temperature.
  • Manage the steam pressure.

You just press one button, and the machine does all those steps internally.

In the world of AI, the Pipeline is that one button.

The Pipeline handles all the "behind-the-scenes" work for you:

  • Loads the Tokenizer: It prepares the tool that turns words into numbers. -** Loads the Model:** It downloads and sets up the "AI Brain" (Flan-T5).
  • Pre-processes: It cleans your input text so the AI can understand it.
  • Inference: It feeds the data through the model to get an answer.
  • Post-processes: It turns the AI's "number output" back into human words.

Without the pipeline, you would have to write 20+ lines of code just to get the model started. With the pipeline, you just provide the input, and it gives you the output!

When you write this single line of code:

generator = pipeline("text2text-generation", model="google/flan-t5-large")

At this point: generator is a Pipeline object. But this Pipeline object has a special method inside it called __call__().
Because of that, Python allows:

generator("Explain FAISS")
Enter fullscreen mode Exit fullscreen mode

When you do:

generator(prompt)
Enter fullscreen mode Exit fullscreen mode

Python actually does:

generator.__call__(prompt)
Enter fullscreen mode Exit fullscreen mode

Inside__call__():

  • Text is tokenized
  • Tokens are passed to FLAN-T5
  • Output tokens are generated
  • Tokens are converted back to text
  • Final answer is returned
  • All hidden from you.

Step 10: The Query Interface

The final step in our UI is to create an input field where the user can actually type their question. In Streamlit, this is handled by a single, clean function.

Python

# Create an input field for the user's question
query = st.text_input("Ask a question from the document")
Enter fullscreen mode Exit fullscreen mode

How it works:

  • st.text_input: This command renders a text box on your web app.
  • User Interaction: As soon as the user types their question and hits "Enter," the text is captured and stored in the query variable.
  • Reactive UI: In Streamlit, the moment the query variable changes, the script re-runs, triggering the search and generation process we are about to build next.

Step 11: Retrieval - Finding the Right Information

We don't want to send the entire PDF to the AI; that would be too much data and very slow. Instead, we use our FAISS database to find only the most relevant sections.

Python

if query:
    # 1. Retrieve the most relevant chunks
    docs = vector_store.similarity_search(query, k=4)

    # 2. Combine the retrieved chunks into one block of text
    context = "\n\n".join([doc.page_content for doc in docs])
Enter fullscreen mode Exit fullscreen mode

What is happening here?

  • Vector Comparison: When you call similarity_search(query), the system converts your question into numbers (embeddings) and compares them against all the chunks in the database.
  • The "Top 4" (k=4): The parameter k=4 tells FAISS to return the four best-matching pieces of text. This ensures we get enough information to form a complete answer without overwhelming the AI.
  • Context Joining: We use "\n\n".join(...) to take those four separate pieces and glue them together into one large string called context. This "context" acts as the open book that the AI will read to find your answer.

Step 12: Prompt Engineering & Final Answer

Now that we have the relevant pieces of the PDF (the Context), we need to give them to the AI. But we don't just throw the text at it—we provide a Prompt that acts as a set of instructions.

Python

# Build the instruction (Prompt)
prompt = f"""
You are a helpful question-answering assistant.

Using ONLY the information from the context,
answer the question in complete sentences.

If the context contains a list or multiple conditions,
combine all of them into a single clear answer.

Do NOT copy bullet labels like (a), (b), (c).
Do NOT repeat headings or article numbers.

Context:
{context}

Question:
{query}
"""
Enter fullscreen mode Exit fullscreen mode

Why do we need such a detailed prompt?

Without instructions, an AI might try to answer using its own general knowledge from the internet. By using a strict prompt, we set "Guardrails":

  • No Outside Knowledge: We tell the AI to use ONLY the provided context. This ensures the answer comes from your PDF, not a random website.
  • Clean Formatting: By telling it to ignore bullet labels like (a) or (b), we make sure the final answer looks like a natural human response rather than a copy-paste job.
  • Avoid Hallucinations: If the answer isn't in the context, these instructions help the AI stay honest instead of making things up.

Step 13: Generating and Displaying the Answer

This is the moment of truth. We feed our carefully crafted prompt into the "AI Brain" we loaded earlier and display the result on the screen.

Python

# 1. Generate the answer using the pipeline
result = generator(prompt)

# 2. Extract the text from the result list
answer = result[0]["generated_text"]

# 3. Display the result in the UI
st.subheader("Answer")
st.write(answer)
Enter fullscreen mode Exit fullscreen mode

What is happening in these final lines?

  • generator(prompt): The model takes your prompt (Instructions + Context + Question) and runs it through its neural network. This process is called Inference.
  • The "Result" Format: Hugging Face pipelines usually return data in a list format, like this:** [{'generated_text': 'The answer is...'}]. That is why we use result[0]["generated_text"]** to grab just the text we need.
  • st.subheader & st.write: These Streamlit functions format the output beautifully, making the answer stand out so the user can read it clearly.

Step 14: Handling the "Empty State"

When a user first opens your app, they haven't uploaded a file yet. Without this final piece of logic, the screen would just be blank. We use st.info to provide a clear instruction to the user.

Python

else:
    st.info("Upload a PDF to start chatting.")

Enter fullscreen mode Exit fullscreen mode

else block

Why is this important?

  • User Guidance: It acts as a "Call to Action" (CTA). It tells the user exactly what their first move should be.
  • Conditional Logic: This else block is connected to our very first if text: statement. If there is no text (because no file was uploaded), the app displays this friendly blue info box instead of trying to run the AI logic.

That's it!!!

The Final Result: See it in Action

After you have uploaded the PDF:

After uploading PDF

Now, ask your questions!

Final Result

Get the Source Code

I have uploaded the complete project to GitHub for you to explore and run yourself.

Note on the GitHub Version: While this tutorial uses the large model for better accuracy, the version on GitHub is configured with google/flan-t5-small and k=2. I made these adjustments to ensure the app remains lightweight and runs smoothly within the resource limits of Streamlit Cloud. If you are running this locally on a powerful machine, feel free to switch back to the large version!

Here's the github link: PDF-Chatbot

Conclusion

Building a local PDF chatbot is more than just a cool project; it’s a major step toward data privacy and cost-efficient AI. By moving away from expensive API keys and leveraging the power of Open Source tools like LangChain, FAISS, and HuggingFace, you now have a private assistant that works entirely on your terms.

Top comments (0)