Gabriel Melendez

Posted on Sep 10

RAG-Powered Chat: OpenAI & ChromaDB Integration

#webdev #ai #rag #python

In the world of AI-driven applications, the ability to chat with your own data is a game-changer. Imagine an assistant that doesn't just rely on its pre-trained knowledge but can instantly learn from documents you provide. That's the power of Retrieval Augmented Generation (RAG), and today we're going to build a complete web application that brings this concept to life.

This guide will walk you through the development of "Chat RAG", a full-stack application featuring a React/Typescript frontend and a Python/FastAPI backend. We'll cover everything from setting up the architecture to deploying the final product. The codebase is available on GitHub.

Repo: Chat-RAG-Assistant

What we are building

Our application will allow users to:

Upload their documents (PDF, DOCX, TXT, etc.).
Ask questions in a real-time chat interface.
Receive answers generated by a LLM, backed by the context of the uploaded documents.
See which parts of the documents were used to generate the answer

Final Results

Let's get started.

The architectural Blueprint

Our architecture is designed to be modular, scalable, and maintainable.

Prerequisites

Before you write a single line of code, make sure you have the following installed:*

NodeJS (v18 or newer)
Python (v3.12 or newer)
PostgreSQL (v13 or newer)
Docker (optional, but highly recommended for easy setup)

Project Structure

We'll organize our monorepo logically:

Chat_RAG/
├── backend/ # FastAPI backend application
├── frontend/ # React frontend application
├── docs/ # Documentation, including deployment guides
├── scripts/ # Helper scripts for setup and execution
└── docker-compose.yml # Defines our multi-container setup

Important: Database Creation

Before starting the backend, ensure that the target database (e.g., "chatrag_db" for PostgreSQL) already exists. The backend will automatically create all required tables on startup if the database exists and the configured user has permissions. However, it will not create the database itself.
To create the database manually, use your preferred database administration tool or run:
"CREATE DATABASE chatrag_db;"
After the database is created, you can start the backend and it will handle table creation automatically.

Setting Up the FastAPI Server

FastAPI gives us a robust framework for building our API. The entry point is "backend/app/main.py".

We structure our application into several directories:

The RAG Service: From Document to Answer

The core of our application is the RAG pipeline, which we'll build within the backend/app/services/ directory. Here’s how it works:

Document Ingestion (document_processor.py): When a user uploads a file, we can't just feed the whole thing to the LLM. We need to break it down.

Load: We use LangChain's document loaders (PyPDFLoader, Docx2txtLoader, etc.) to read the content.
Split: The text is split into smaller, manageable "chunks" using a RecursiveCharacterTextSplitter. This ensures that we can find very specific pieces of information.

class DocumentProcessor:
  def __init__(self):
    self.text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=1000,
      chunk_overlap=200,
      separators=["\n\n", "\n", " ", ""],
      length_function=len,
    )

  async def save_uploaded_file(self, file_content: bytes, filename: str) -> str:
    unique_filename = f"{uuid.uuid4()}{Path(filename).suffix}"
    file_path = self.upload_dir / unique_filename
    async with aiofiles.open(file_path, 'wb') as f:
      await f.write(file_content)
    return str(file_path)

Embeddings and Vector Storage ("vector_store.py"). Next, we need to convert these text chunks into a format that a machine can understand and compare: numerical vectors, or "embeddings".

Embed: We use an embedding model, like OpenAI's "text-embedding-ada-002", to generate a vector for each text chunk.
Store: These vectors, along with their corresponding text chunks, are stored in ChromaDB.

class VectorStoreService:
  def __init__(self):
    self.embeddings = OpenAIEmbeddings(api_key=settings.openai_api_key,
                       model=settings.embedding_model)
    self.vector_store = Chroma(
      persist_directory=settings.chroma_persist_directory,
      embedding_function=self.embeddings,
      collection_name="documents",
    )

  async def add_documents(self, documents: List[Document], document_id: int, document_filename: str) -> str:
    for i, doc in enumerate(documents):
      doc.metadata.update({
        "document_id": document_id,
        "document_filename": document_filename,
        "chunk_index": i,
        "chunk_id": f"{document_id}_{i}",
      })
    ids = [f"{document_id}_{i}" for i in range(len(documents))]
    self.vector_store.add_documents(documents, ids=ids)
    self.vector_store.persist()

The Retrieval and Generation Loop (rag_service.py). This is where we answer the user's question.

Receive Query: The user sends a message through the chat
Retrieve Context: We take the user's query, create an embedding for it, and use that to search ChromaDB. The database returns the most relevant text chunks from the documents. This is the "Retrieval" in RAG.
Augment Prompt: We construct a detailed prompt for the LLM. This prompt includes the original user query and the retrieved text chunks as context.
Generate Answer: This complete prompt is sent to the LLM (e.g., "gpt-5-nano-2025-08-07"). The model generates an answer based on the provided context. This is the "Generation" in RAG.

class RAGService:
  async def process_document_pipeline(self, document_model: DocumentModel, db: Session) -> Dict[str, Any]:
    document_model.processing_status = ProcessingStatus.PROCESSING
    db.commit()
    processing_result = await self.document_processor.process_document(document_model.file_path, document_model)
    vector_store_id = await self.vector_store.add_documents(
      documents=processing_result['chunks'],
      document_id=document_model.id,
      document_filename=document_model.original_filename,
    )
    # update DB model with metadata (chunk count, preview, etc.)
    return {"vector_store_id": vector_store_id, **processing_result}

Real-Time Chat with WebSockets

To create a fluid, ChatGPT-like experience, we use WebSockets. The endpoint in "api/routes/chat.py" handles the connection. When a user sends a message, it's passed to our rag_service, and the generated response is streamed back token-by-token over the same WebSocket connection. This provides instant feedback to the user.

We'll use React, TypeScript, and Vite to build a clean, responsive UI. I'm not going to dive deep into the Frontend part; you can check the GitHub Repo and see the codebase if you're interested.

Dockerizing for Consistency and Deployment

To ensure our application runs the same way everywhere, we use Docker. Our "docker-compose.yml" file defines two main services: "backend" and "frontend".

Backend: Builds the Docker image from "backend/Dockerfile", exposes the port, and sets the necessary environment variables for the database URL, API keys, etc.
Frontend service: Builds from "frontend/Dockerfile" and starts the Vite development server.
Volumes: We define a named volume "chroma_db" to persist the vector database on disk. This ensures that our document embeddings aren't lost every time we restart the containers.

With Docker, starting the entire application is as simple as running one command: "bashdocker-compose up --build"

Conclusion and What's Next

Congratulations! You now have a blueprint for building a sophisticated, full-stack RAG chat application. We've designed a modular system with a clear separation of concerns, from data processing on the backend to state management on the frontend.
This project is a fantastic starting point. Here are a few ideas for taking it to the next level:

User Authentication: Implement a login system to provide private document storage for each user.
Advanced RAG: Experiment with more advanced retrieval strategies like re-ranking or query transformations to improve accuracy.
Model Flexibility: Add a UI element to allow users to switch between different LLMs or adjust parameters like temperature.

Happy coding!

DEV Community