Dheeraj Malhotra

Posted on Mar 14

Build Your Own AI Chatbot: A Complete Guide to Local Deployment with ServBay, Python, and ChromaDB

#ai #rag #llm #programming

In an era where data privacy is paramount, setting up your own local language model (LLM) provides a crucial solution for companies and individuals alike. This tutorial is designed to guide you through the process of creating a custom chatbot using ServBay, Python 3, and ChromaDB, all hosted locally on your system. Exactly, you don't need to download any software except for Servbay. Here are the key reasons why you need this tutorial:

Complete Customization: Full control over configuration allows you to tailor the model to your specific needs without relying on third-party services.
Improved Privacy: Deploying your language model (LLM) locally protects sensitive information from online transmission risks, crucial for organizations handling private data.
Data Security Assurance: Minimizes security threats by keeping training materials, such as PDF files, secure within your environment, reducing exposure to external risks.
Control Over Data Management: Freedom to handle and process data as desired, including embedding proprietary information into a ChromaDB vector store, ensuring alignment with your standards.
Internet Independence: Ensures consistent access to your chatbot without needing an internet connection, maintaining service even when offline.

This tutorial aims to guide you in building a robust and secure local chatbot that prioritizes your privacy and control.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an advanced technique that combines the strengths of information retrieval and text generation to create more accurate and contextually relevant responses. Here's a breakdown of how RAG works and why it's beneficial:

What is RAG?

RAG is a hybrid model that enhances the capabilities of language models by incorporating an external knowledge base or document store. The process involves two main components:

Retrieval: In this phase, the model retrieves relevant documents or pieces of information from an external source, such as a database or a vector store, based on the input query.
Generation: The retrieved information is then used by a generative language model to produce a coherent and contextually appropriate response.

How Does RAG Work?

Query Input: The user inputs a query or question.
Document Retrieval: The system uses the query to search an external knowledge base, retrieving the most relevant documents or snippets of information.
Response Generation: The generative model processes the retrieved information, integrating it with its own knowledge to generate a detailed and accurate response.
Output: The final response, enriched with specific and relevant details from the knowledge base, is presented to the user.

Benefits of RAG

Enhanced Accuracy: By leveraging external data, RAG models can provide more precise and detailed answers, especially for domain-specific queries.
Contextual Relevance: The retrieval component ensures that the generated response is grounded in relevant and up-to-date information, improving the overall quality of the response.
Scalability: RAG systems can be easily scaled to incorporate vast amounts of data, enabling them to handle a wide range of queries and topics.
Flexibility: These models can be adapted to various domains by simply updating or expanding the external knowledge base, making them highly versatile.

Why Use RAG Locally?

Privacy and Security: Running a RAG model locally ensures that sensitive data remains secure and private, as it does not need to be sent to external servers.
Customization: You can tailor the retrieval and generation processes to suit your specific needs, including integrating proprietary data sources.
Independence: A local setup ensures that your system remains operational even without internet connectivity, providing consistent and reliable service. By setting up a local RAG application with tools like Ollama, Python, and ChromaDB, you can enjoy the benefits of advanced language models while maintaining control over your data and customization options.

ServBay

ServBay is an integrated, graphical, one-click installation local web development environment designed specifically for web developers, Python developers, AI developers, and PHP developers. This software is particularly well-suited for macOS. It includes a range of commonly used web development services and tools, covering web servers, databases, programming languages, mail servers, queue services, and more. ServBay aims to provide developers with a convenient, efficient, and unified development environment.

Core Features of ServBay

Support for Multiple Python Versions: Run multiple Python versions simultaneously to meet the needs of different projects.
Custom Domain Names and SSL Support: Easily configure local domain names and SSL certificates to simulate real production environments.
Quick Operations: Supports startup on boot, quick access via the menu bar, and command-line management to enhance development efficiency.
Unified Service Management: Integrates Python, PHP, Node.js, and Ollama, making it easy to manage multiple development services.
Clean System Environment: Avoids system pollution by running all services in isolated environments.
Intranet Penetration and Sharing: Supports intranet penetration for local websites, making it easier to share development results with team members.

ServBay Installation Guide

Requirements: macOS 12.0 Monterey or later
Download the Latest Version of ServBay

Installation:

Double-click the downloaded .dmg file to open it.
In the opened window, drag the ServBay.app icon into the Applications folder.

When using ServBay for the first time, initialization is required. Generally, you can select the default installation, or optionally select Ollama for AI programming support.

After installation is complete, open ServBay.
Enter your password. After the installation is complete, you can find ServBay in the Applications directory.
Access the main interface.

In addition to Python, ServBay also provides robust support for PHP and Node.js, covering a wide range of versions from PHP 5.6 to PHP 8.5 and Node.js 12 to Node.js 23.
One of ServBay's key features is the ability to quickly switch between different software versions. This flexibility is essential for developers who need to test and deploy applications in various environments.

One-click installation of all Python versions

One-click installation of all Ollama models

Prerequisites

Before diving into the setup, ensure you have the following prerequisites in place:

Python 3: Python is a versatile programming language that you'll use to write the code for your RAG app.
ChromaDB: A vector database that will store and manage the embeddings of our data.
Servbay: To download and serve custom LLMs in our local machine.

Step 1: Install Python 3 and setup your environment

To install and setup our Python 3 environment, follow these steps:
Click the python button of servbay, and then select one version of python.
Then make sure your Python 3 installed and run successfully:

$ python3 --version# Python 3.12.9

Create a folder for your project. For example, local-rag:

$ mkdir local-rag
$ cd local-rag

Create a virtual environment named venv:

$ python3 -m venv venv

Activate the virtual environment:

$ source venv/bin/activate
# Windows# venv\Scripts\activate

Step 2: Install ChromaDB and other dependencies

Install ChromaDB using pip:

$ pip install --q chromadb

Install Langchain tools to work seamlessly with your model:

$ pip install --q unstructured langchain langchain-text-splitters
$ pip install --q "unstructured[all-docs]"

Install Flask to serve your app as a HTTP service:

$ pip install --q flask

Step 3: Install Ollama

To install Ollama, follow these steps:
Click the AI button of servbay, and then select a model you like.

Build the RAG app

Now that you've set up your environment with Python, Ollama, ChromaDB and other dependencies, it's time to build your custom local RAG app. In this section, we'll walk through the hands-on Python code and provide an overview of how to structure your application.
app.py
This is the main Flask application file. It defines routes for embedding files to the vector database, and retrieving the response from the model.

import os  
from dotenv import load_dotenv  
from flask import Flask, request, jsonify  
from embed import embed  
from query import query  
from get_vector_db import get_vector_db  

# 设置临时文件夹  
TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')  
os.makedirs(TEMP_FOLDER, exist_ok=True)  

app = Flask(__name__)  

@app.route('/embed', methods=['POST'])  
def route_embed():  
    if 'file' not in request.files:  
        return jsonify({"error": "No file part"}), 400  

    file = request.files['file']  
    if file.filename == '':  
        return jsonify({"error": "No selected file"}), 400  

    embedded = embed(file)  
    if embedded:  
        return jsonify({"message": "File embedded successfully"}), 200  

    return jsonify({"error": "File embedded unsuccessfully"}), 400  

@app.route('/query', methods=['POST'])  
def route_query():  
    data = request.get_json()  
    response = query(data.get('query'))  
    if response:  
        return jsonify({"message": response}), 200  

    return jsonify({"error": "Something went wrong"}), 400  

if __name__ == '__main__':  
    app.run(host="0.0.0.0", port=8080, debug=True)

embed.py
This module handles the embedding process, including saving uploaded files, loading and splitting data, and adding documents to the vector database.

import os  
from datetime import datetime  
from werkzeug.utils import secure_filename  
from langchain_community.document_loaders import UnstructuredPDFLoader  
from langchain_text_splitters import RecursiveCharacterTextSplitter  
from get_vector_db import get_vector_db  

TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')  

# Function to check if the uploaded file is allowed (only PDF files)  
def allowed_file(filename):  
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in {'pdf'}  

# Function to save the uploaded file to the temporary folder  
def save_file(file):  
    # Save the uploaded file with a secure filename and return the file path  
    timestamp = datetime.now().timestamp()  
    filename = f"{timestamp}_{secure_filename(file.filename)}"  
    file_path = os.path.join(TEMP_FOLDER, filename)  
    file.save(file_path)  
    return file_path  

# Function to load and split the data from the PDF file  
def load_and_split_data(file_path):  
    # Load the PDF file and split the data into chunks  
    loader = UnstructuredPDFLoader(file_path=file_path)  
    data = loader.load()  
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)  
    chunks = text_splitter.split_documents(data)  
    return chunks  

# Main function to handle the embedding process  
def embed(file):  
    # Check if the file is valid, save it, load and split the data, add to the database, and remove the temporary file  
    if file.filename != '' and file and allowed_file(file.filename):  
        file_path = save_file(file)  
        chunks = load_and_split_data(file_path)  
        db = get_vector_db()  
        db.add_documents(chunks)  
        db.persist()  
        os.remove(file_path)  
        return True  
    return False

query.py
This module processes user queries by generating multiple versions of the query, retrieving relevant documents, and providing answers based on the context.

import os  
from langchain_community.chat_models import ChatOllama  
from langchain.prompts import ChatPromptTemplate, PromptTemplate  
from langchain_core.output_parsers import StrOutputParser  
from langchain_core.runnables import RunnablePassthrough  
from langchain.retrievers.multi_query import MultiQueryRetriever  
from get_vector_db import get_vector_db  

LLM_MODEL = os.getenv('LLM_MODEL', 'deepseek-r1:1.5b')  

# Function to get the prompt templates for generating alternative questions and answering based on context  
def get_prompt():  
    QUERY_PROMPT = PromptTemplate(  
        input_variables=["question"],  
        template="""You are an AI language model assistant. Your task is to generate five  
        different versions of the given user question to retrieve relevant documents from  
        a vector database. By generating multiple perspectives on the user question, your  
        goal is to help the user overcome some of the limitations of the distance-based  
        similarity search. Provide these alternative questions separated by newlines.  
        Original question: {question}"""  
    )  

    template = """Answer the question based ONLY on the following context:  
    {context}  
    Question: {question}"""  

    prompt = ChatPromptTemplate.from_template(template)  
    return QUERY_PROMPT, prompt  

# Main function to handle the query process  
def query(input):  
    if input:  
        # Initialize the language model with the specified model name  
        llm = ChatOllama(model=LLM_MODEL)  

        # Get the vector database instance  
        db = get_vector_db()  

        # Get the prompt templates  
        QUERY_PROMPT, prompt = get_prompt()  

        # Set up the retriever to generate multiple queries using the language model and the query prompt  
        retriever = MultiQueryRetriever.from_llm(db.as_retriever(), llm, prompt=QUERY_PROMPT)  

        # Define the processing chain to retrieve context, generate the answer, and parse the output  
        chain = ({"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())  

        response = chain.invoke(input)  
        return response  

    return None

get_vector_db.py
This module initializes and returns the vector database instance used for storing and retrieving document embeddings.

import os  
from langchain_community.embeddings import OllamaEmbeddings  
from langchain_community.vectorstores.chroma import Chroma  

CHROMA_PATH = os.getenv('CHROMA_PATH', 'chroma')  
COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'local-rag')  
TEXT_EMBEDDING_MODEL = os.getenv('TEXT_EMBEDDING_MODEL', 'nomic-embed-text')  

def get_vector_db():  
    # Create an instance of the embedding model  
    embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL, show_progress=True)  

    # Initialize the Chroma vector store with specified parameters  
    db = Chroma(  
        collection_name=COLLECTION_NAME,  
        persist_directory=CHROMA_PATH,  
        embedding_function=embedding  
    )  

    return db

Run your app!

Create .env file to store your environment variables:

TEMP_FOLDER = './_temp'
CHROMA_PATH = 'chroma'
COLLECTION_NAME = 'local-rag'
LLM_MODEL = 'mistral'
TEXT_EMBEDDING_MODEL = 'nomic-embed-text'

Run the app.py file to start your app server:

python3 app.py

Once the server is running, you can start making requests to the following endpoints:

Example command to embed a PDF file (e.g., resume.pdf):

#!/bin/bash  

curl --request POST \
--url http://localhost:8080/embed \
--header 'Content-Type: multipart/form-data' \
--form file=@/Users/liyinan/Documents/works/matrix_multi.pdf

Response
{
"message": "File embedded successfully"
}

Example command to ask a question to your model:


$ curl --request POST \
  --url http://localhost:8080/query \
  --header 'Content-Type: application/json' \
  --data '{ "query": "Who is Nasser?" }'

# Response
{
  "message": "Nasser Maronie is a Full Stack Developer with experience in web and mobile app development. He has worked as a Lead Full Stack Engineer at Ulventech, a Senior Full Stack Engineer at Speedoc, a Senior Frontend Engineer at Irvins, and a Software Engineer at Tokopedia. His tech stacks include Typescript, ReactJS, VueJS, React Native, NodeJS, PHP, Golang, Python, MySQL, PostgresQL, MongoDB, Redis, AWS, Firebase, and Supabase. He has a Bachelor's degree in Information System from Universitas Amikom Yogyakarta."
}

Conclusion

By following these instructions, you can effectively run and interact with your custom local RAG app using Python, Ollama, and ChromaDB, tailored to your needs. Adjust and expand the functionality as necessary to enhance the capabilities of your application.
By harnessing the capabilities of local deployment, you not only safeguard sensitive information but also optimize performance and responsiveness. Whether you're enhancing customer interactions or streamlining internal processes, a locally deployed RAG application offers flexibility and robustness to adapt and grow with your requirements.

Struggling with slow API calls? 👀

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

DEV Community

Build Your Own AI Chatbot: A Complete Guide to Local Deployment with ServBay, Python, and ChromaDB

Retrieval-Augmented Generation (RAG)

What is RAG?

How Does RAG Work?

Benefits of RAG

Why Use RAG Locally?

ServBay

ServBay Installation Guide

One-click installation of all Python versions

One-click installation of all Ollama models

Prerequisites

Step 1: Install Python 3 and setup your environment

Step 2: Install ChromaDB and other dependencies

Step 3: Install Ollama

Build the RAG app

Run your app!

Conclusion

Struggling with slow API calls? 👀

Top comments (0)

How is generative AI increasing efficiency?

Read next

Building a .NET Console App for Document Search (RAG) with OpenAI Embeddings

Implementing RAG with Azure OpenAI in .NET (C#)

Meet Docker Gordan AI.

MCP (Model Context Protocol) for Dummies 🫣

Okay