Rishi Kumar

Posted on Dec 16, 2024

Customize ChatGPT for Your Codebase : OpenAI

#openai #rag #llm #chatgpt

Today, developers often turn to ChatGPT for help with code errors. However, without access to their specific codebase, ChatGPT’s solutions can be too generic. While uploading relevant files can provide better context, sharing the entire codebase is usually impractical. To achieve more accurate, code-specific assistance, developers can either fine-tune a language model with their own code or use Retrieval-Augmented Generation (RAG). These approaches enable more tailored and effective solutions, enhancing the debugging and development process.

RAG is like a smart helper that looks for answers in your stuff (codebase) and then talks to you about it. It helps you find the right information and explains it to you.

Learn how to set up a RAG system tailored to your codebase, enabling quick access to information and efficient resolution of code-related issues.

Table of Contents

Introduction to RAG
Key Benefits of Using RAG for Codebases
Requirements and Initial Setup
Creating RAG Vectors: Processing and Embedding the Codebase
Querying the RAG System: Retrieving Information and Solutions
Conclusion and Future Enhancements

Introduction to RAG
Retrieval-augmented generation (RAG) combines the strengths of information retrieval systems and generative AI models. By leveraging a vector store to index and retrieve relevant information from a large corpus, RAG enhances the generative capabilities of AI, allowing it to produce more accurate and contextually relevant responses.

In the context of software development, RAG can be employed to:

Navigate Complex Codebases: Quickly locate specific functions, classes, or modules.
Diagnose and Resolve Errors: Provide solutions based on existing code patterns and documentation.
Enhance Documentation: Generate contextual explanations or summaries of code segments.

We will focus on setting up a RAG system to interact with your codebase, facilitating efficient information retrieval and problem-solving.

Requirements and Initial Setup
Before diving into the implementation, ensure you have the following prerequisites:

1. Environment Setup
Python 3.7 or higher: Ensure Python is installed on your system. You can download it from python.org.

2. API Access
OpenAI API Key: Sign up for an OpenAI account and obtain an API key from the OpenAI Dashboard.

3. Project Directory Structure
Organize your project directory as follows:

rag_project/
│
├── codebase/     # Your existing codebase (e.g., JavaScript files)
├── .env          # Environment variables (contains OPENAI_API_KEY)
├── create_rag_vectors.py  # Script to process and embed codebase
├── query_rag.py           # Script to query the RAG system
├── requirements.txt       # Python dependencies

4. Installing Dependencies
Create a requirements.txt file with the following content:

dotenv
langchain
faiss-cpu
tiktoken

Install the dependencies using pip:

pip install -r requirements.txt

Note: Depending on your system and specific requirements, you might need to install additional dependencies or use faiss-gpu if you have GPU support.

5. Configuring Environment Variables
Create a .env file in your project root and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key_here

Ensure this file is added to .gitignore to prevent accidental exposure of sensitive information.

Creating RAG Vectors: Processing and Embedding the Codebase

The first step in setting up the RAG system is to process your codebase, preprocess the code files, split them into manageable chunks, and create vector embeddings using OpenAI's embeddings. These vectors are then stored in a FAISS index for efficient retrieval.

1. Script Overview
We'll create a Python script named create_rag_vectors.py that performs the following tasks:

Load and Preprocess Code Files: Read code files, remove comments and excessive whitespace.
Split Documents into Chunks: Break down large code files into smaller segments for effective embedding.
Create Vector Store: Generate embeddings and store them using FAISS.
Track Processing and Token Consumption: Log which files are being processed and count the total tokens used.

2. The create_rag_vectors.py Script
Below is the complete script with detailed explanations embedded as comments.


import os
import re
from pathlib import Path
from dotenv import load_dotenv
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
import tiktoken
import logging

# Configure logging for better visibility and control
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load environment variables from .env file
load_dotenv()

# Retrieve API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# Initialize embedding model
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

# Define the path to your codebase
codebase_path = Path("codebase")  # Update this path as needed

# Initialize the tokenizer
# 'cl100k_base' is suitable for models like text-embedding-ada-002
encoder = tiktoken.get_encoding("cl100k_base")

def preprocess_code(code_str: str, extension: str) -> str:
    """
    Preprocesses code by removing comments and excessive whitespace based on file extension.
    """
    # Remove comments based on file extension
    if extension in [".ts", ".js"]:
        # Remove multi-line comments
        code_str = re.sub(r"/\*[\s\S]*?\*/", "", code_str)
        # Remove single-line comments
        code_str = re.sub(r"//.*", "", code_str)
    elif extension == ".py":
        # Remove single-line comments
        code_str = re.sub(r"#.*", "", code_str)
        # Remove multi-line docstrings
        code_str = re.sub(r'"""[\s\S]*?"""', "", code_str)
        code_str = re.sub(r"'''[\s\S]*?'''", "", code_str)
    # Remove excessive whitespace
    code_str = re.sub(r"\s+", " ", code_str)
    return code_str

def load_code_files(path: Path, extensions: list = [".js"]) -> list:
    """
    Loads and preprocesses code files from the specified path.
    Supports recursive directory traversal.
    """
    documents = []
    if path.is_dir():
        # Recursively find all files with the given extensions
        for ext in extensions:
            for file in path.rglob(f"*{ext}"):
                logging.info(f"Processing file: {file}")
                try:
                    with open(file, "r", encoding="utf-8", errors="ignore") as f:
                        content = f.read()
                        # Preprocess the code
                        content = preprocess_code(content, ext)
                        # Append as a Document
                        documents.append(Document(page_content=content, metadata={"source": str(file)}))
                except Exception as e:
                    logging.error(f"Error reading {file}: {e}")
    elif path.is_file() and path.suffix in extensions:
        logging.info(f"Processing file: {path}")
        try:
            with open(path, "r", encoding="utf-8", errors="ignore") as f:
                content = f.read()
                # Preprocess the code
                content = preprocess_code(content, path.suffix)
                # Append as a Document
                documents.append(Document(page_content=content, metadata={"source": str(path)}))
        except Exception as e:
            logging.error(f"Error reading {path}: {e}")
    else:
        logging.warning(f"Path {path} is neither a directory nor a supported file.")
    return documents

def count_tokens(chunks: list, encoder) -> int:
    """
    Counts the total number of tokens in all chunks using the specified encoder.
    """
    total_tokens = 0
    for chunk in chunks:
        tokens = encoder.encode(chunk.page_content)
        total_tokens += len(tokens)
    return total_tokens

def main():
    # Step 1: Load code documents
    logging.info("Loading code files...")
    documents = load_code_files(codebase_path)
    logging.info(f"Loaded {len(documents)} documents.")

    # Step 2: Split into chunks
    logging.info("Splitting documents into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,       # Adjust based on your needs
        chunk_overlap=200,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_documents(documents)
    logging.info(f"Created {len(chunks)} chunks.")

    # Step 3: Count tokens in all chunks
    logging.info("Counting tokens in chunks...")
    total_tokens = count_tokens(chunks, encoder)
    logging.info(f"Total tokens consumed: {total_tokens}")

    # Step 4: Create vector store
    logging.info("Creating vector store...")
    vector_store = FAISS.from_documents(chunks, embeddings)
    logging.info("Vector store created.")

    # Step 5: Save the vector store to disk
    index_path = "faiss_index.pkl"
    vector_store.save_local(index_path)
    logging.info(f"Vector store saved to {index_path}.")

if __name__ == "__main__":
    main()

3. Script Breakdown

Environment Variables and API Initialization:
- The script loads environment variables, retrieves the OpenAI API key, and initializes the embedding model using OpenAIEmbeddings.
Preprocessing Function (preprocess_code):
- Removes comments and excessive whitespace based on the file extension, enhancing the quality of the embeddings.
Loading and Processing Code Files (load_code_files):
- Recursively traverses the specified codebase directory, processes supported file types (e.g., .js), and appends them as Document instances.
Token Counting (count_tokens):
- Utilizes the tiktoken library to accurately count the number of tokens in each chunk, aiding in tracking API usage.
Vector Store Creation:
- Splits documents into chunks using RecursiveCharacterTextSplitter to manage large files effectively.
- Generates embeddings for each chunk and stores them in a FAISS vector store for efficient retrieval.
Logging:
- Implements Python’s built-in logging module for informative and structured logs, replacing simple print statements.

4. **Running the Script**
Ensure that your codebase is placed inside the codebase directory. Then, execute the script:

python create_rag_vectors.py

Expected Output:
2024-04-27 12:00:00 - INFO - Loading code files... 2024-04-27 12:00:01 - INFO - Processing file: codebase/src/app.js 2024-04-27 12:00:02 - INFO - Processing file: codebase/src/utils/helpers.js 2024-04-27 12:00:03 - INFO - Processing file: codebase/tests/test_app.js 2024-04-27 12:00:04 - INFO - Loaded 3 documents. 2024-04-27 12:00:04 - INFO - Splitting documents into chunks... 2024-04-27 12:00:05 - INFO - Created 5 chunks. 2024-04-27 12:00:05 - INFO - Counting tokens in chunks... 2024-04-27 12:00:06 - INFO - Total tokens consumed: 3500 2024-04-27 12:00:06 - INFO - Creating vector store... 2024-04-27 12:00:07 - INFO - Vector store created. 2024-04-27 12:00:07 - INFO - Vector store saved to faiss_index.pkl

Note: The timestamps and counts will vary based on your actual codebase.

Querying the RAG System: Retrieving Information and Solutions

With the vector store created and saved, the next step is to build a querying mechanism that allows you to interact with your codebase using natural language. This enables you to ask questions or seek solutions related to your code, and the RAG system will retrieve and generate relevant responses.

1. Script Overview
We'll create another Python script named query_rag.py that performs the following tasks:

Load the Vector Store: Retrieve the FAISS index from disk.
Initialize the Chat Model: Use OpenAI's chat models to generate responses.
Set Up the RetrievalQA Chain: Combine retrieval and generation for coherent answers.
Interact with the User: Accept user queries and provide responses based on the codebase.

2. The query_rag.py Script
Below is the complete script with detailed explanations embedded as comments.


import os
from dotenv import load_dotenv
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
import logging
import tiktoken

# Configure logging for better visibility and control
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load environment variables from .env file
load_dotenv()

# Retrieve API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# Initialize embedding model
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

# Path to the saved FAISS index
index_path = "faiss_index.pkl"

# Initialize the tokenizer
encoder = tiktoken.get_encoding("cl100k_base")

def load_vector_store(index_path: str, embeddings) -> FAISS:
    """
    Loads the FAISS vector store from the specified index path.
    """
    logging.info("Loading vector store from disk...")
    vector_store = FAISS.load_local(index_path, embeddings)
    logging.info("Vector store loaded successfully.")
    return vector_store

def initialize_qa_chain(vector_store: FAISS) -> RetrievalQA:
    """
    Initializes the RetrievalQA chain with the vector store and chat model.
    """
    chat_model = ChatOpenAI(
        openai_api_key=api_key,
        temperature=0.2,  # Lower temperature for more deterministic responses
        model_name="gpt-4"  # Specify the model as needed
    )
    qa_chain = RetrievalQA.from_chain_type(
        llm=chat_model,
        chain_type="stuff",  # Simple chain type; explore others as needed
        retriever=vector_store.as_retriever(),
        return_source_documents=True
    )
    return qa_chain

def main():
    # Step 1: Load the vector store
    vector_store = load_vector_store(index_path, embeddings)

    # Step 2: Initialize the RetrievalQA chain
    qa_chain = initialize_qa_chain(vector_store)

    logging.info("RAG system is ready. You can start querying.")
    logging.info("Type 'exit' to terminate the session.")

    while True:
        try:
            # Accept user input
            query = input("\nEnter your query: ")
            if query.lower() in ["exit", "quit"]:
                logging.info("Exiting the RAG system. Goodbye!")
                break

            # Get the response from the QA chain
            response = qa_chain({"query": query})

            # Display the answer
            print("\nAnswer:")
            print(response["result"])

            # Optionally, display source documents
            if response.get("source_documents"):
                print("\nSource Documents:")
                for doc in response["source_documents"]:
                    print(f"- {doc.metadata['source']}")

        except KeyboardInterrupt:
            logging.info("Interrupted by user. Exiting...")
            break
        except Exception as e:
            logging.error(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

3. Script Breakdown

Environment Variables and API Initialization:
- The script loads environment variables, retrieves the OpenAI API key, and initializes the embedding model using OpenAIEmbeddings.
Loading the Vector Store (load_vector_store):
- Retrieves the FAISS index from disk, enabling the system to access the preprocessed and embedded codebase.
Initializing the RetrievalQA Chain (initialize_qa_chain):
- Combines the vector store with OpenAI's chat models to create a QA system capable of generating contextually relevant answers based on retrieved information.
Interactive Query Loop:
- The script enters a loop where it continuously accepts user queries.
- For each query, it retrieves relevant information from the vector store and generates a coherent response.
- The user can exit the loop by typing 'exit' or 'quit'.
Logging and Error Handling:
- Utilizes the logging module to provide informative messages and handle potential errors gracefully.

4. Running the Script

Execute the script to start interacting with your RAG system:

python query_rag.py

Sample Interaction:


2024-04-27 12:10:00 - INFO - Loading vector store from disk...
2024-04-27 12:10:01 - INFO - Vector store loaded successfully.
2024-04-27 12:10:01 - INFO - RAG system is ready. You can start querying.
2024-04-27 12:10:01 - INFO - Type 'exit' to terminate the session.

Enter your query: How does the authentication mechanism work in app.js?

Answer:
The authentication mechanism in `app.js` is implemented using JSON Web Tokens (JWT). When a user logs in, the server generates a JWT that contains the user's information and sends it back to the client. The client then includes this token in the headers of subsequent requests to authenticate and authorize access to protected routes.

Source Documents:
- codebase/src/app.js
Note: The above interaction assumes that your app.js file contains relevant authentication logic. The response is generated based on the information embedded in the FAISS index.

Conclusion and Future Enhancements
You've successfully set up a Retrieval-Augmented Generation (RAG) system tailored for your codebase. This system empowers you to:

Efficiently Retrieve Information: Quickly locate and understand specific parts of your codebase without manual searching.
Streamline Debugging: Obtain solutions and insights related to code errors based on existing code patterns.
Enhance Documentation and Knowledge Sharing: Facilitate better collaboration and knowledge retention within your development team.

Future Enhancements:

Modify the extensions list in create_rag_vectors.py to include other programming languages or file types relevant to your project.
For larger codebases, implement parallel processing or batch token counting to improve performance.
Develop a web-based or GUI interface for more user-friendly interactions.
Integrate with chat platforms like Slack or Microsoft Teams for seamless accessibility.
Automate the RAG vector creation process to run on code commits or merges, ensuring the vector store remains up-to-date.
Restrict access to the RAG system based on user roles to maintain codebase security.
Explore different retrieval strategies or vector store optimizations to enhance response accuracy and speed.

We demonstrated a simple RAG system to give you an basic understanding of how it works. By continuously improving and expanding your RAG system, you can make your development process smoother, making it easier to manage code and fix errors.

Top comments (2)

SeongMin Kim • Dec 16 '24

Thanks for great Article.

I wonder : Every time we commit some changes to codebase, should we proceed "Creating RAG Vectors: Processing and Embedding the Codebase" steps ?

Rishi Kumar • Dec 17 '24

Yes, that’s right. But If you set up your vector database to focus on specific files or features, you can update only the parts that change. By monitoring your repository for updates and using an AI agent to reprocess just the modified sections, you can keep your vector database up to date without having to redo the entire codebase.

PS: in article i used very basic level example.