Sunil Kumar Dash

for Composio

Posted on Sep 3, 2024

I saved 30 hours of coding with this search tool that chats with codebases at 91% accuracy! 🤯

#ai #python #webdev #programming

TL;DR

I was recently assigned a challenging project that involved working with an existing Django repository, and I immediately realized I knew as much about Django as my pet, Goldfish.

You must be asking, Why not use ChatGPT? The problem was that the entire codebase could not fit in ChatGPT’s prompt, and even if it could, it would have been highly unreliable.

So, I built an AI bot that lets you chat with any codebase with maximum accuracy using a code indexing tool.

So, here’s how I did it.

I used a code indexing tool to analyse and index the entire codebase.
Built an AI bot that accepts questions, understands the context, and retrieves relevant code chunks.
Then, the bot analyzes the code and answers accordingly.

The crux of this workflow is the code indexing tool, which intelligently parses an entire code base and indexes the codes in a vector database.

What is RAG?

RAG stands for Retrieval Augmented Generation. As the name suggests, RAG involves retrieving data from various knowledge bases, such as Vector DBs, Web pages, Internet, etc and generating an answer using an LLM.

The Key components of a typical RAG system involve

Embedding Model: A deep learning model is used to create embeddings of data (texts, images, etc).
Vector DBs: For managing vector embeddings.
LLM: Also, a deep learning model for generating text responses.

Here is a diagram of a typical RAG workflow.

Embeddings and Vector databases

Before moving ahead, let’s get acquainted quickly with embeddings and vector databases.

Embeddings

Embeddings or vectors represent data(texts, images, etc.) numerically in a multi-dimensional space. Deep learning models trained over millions of data understand the relationship or proximity between different data points.

For example, the term Donald Trump will be closer to the US than China. The words ‘Cat’ and ‘Kitten’ will be close.

The embeddings are used to calculate the semantic similarity between sentences. We can extend this concept to codes as well.

Vector Databases

Traditional DBs are not suitable for managing embeddings. We need specialized DBs and algorithms to store and retrieve data. These are called vector databases.

Indexing techniques are the methods we use to organize data for search and storage in a vector database.

Vector databases use methods like HNSW, IVF, etc., for indexing and similarity search and BM25 and Hybrid Search for querying.

The best thing is that you do not have to worry about everything. The CodeAnalysis tool from Composio handles abstracts away all the complexities.

Composio - Open-source platform for AI tools & Integrations

Here’s a quick introduction about us.

Composio is an open-source tooling infrastructure for building robust and reliable AI applications. We provide over 100+ tools and integrations across industry verticals from CRM, HRM, and Sales to Productivity, Dev, and Social Media.

They also provide local tools such as CodeAnalyser, RAG, SQL, etc.

This article discusses using the CodeAnalysing tool to index a codebase for questions and answers.

Please help us with a star. 🥹

It would help us to create more articles like this 💖

Star the Composio repository ⭐

How does it work?

This project explains how to build an AI tool that lets you conveniently chat with any code base.

Input Repository Path: Provide the path to a local codebase.
Code Analysis and Indexing: The tool analyzes the code using a code analysis tool and indexes it into a vector database.
Query with Prompts: After indexing, you can submit prompts or questions related to the codebase.
Retrieve and Respond: The tool fetches relevant code snippets from the database and generates responses based on the code content.

Here is an overall workflow of the project.

Technical Description

Under the hoof, the AI bot receives the path string to the codebase and performs the following actions.

Generates a Fully Qualified Domain Name (FQDN) cache for code entities.
Creates an index of Python files.
Builds a vector database from chunked codes for efficient searching.

Tech Stack

CrewAI: For building the Agent.
Composio: CodeIndexing and CodeAnalysis tool

Let’s get started ✨

Begin by creating a Python virtual environment.

python -m venv code-search
cd code-search
source bin/activate

Now, install the following dependencies.

pip install composio-core
pip install crewai
pip install composio-crewai

composio-core: Core Composio library is used to access the tools.
crewai: Agentic framework for building agents.
composio-crewai: CrewAI plugin for Composio.

Set up Composio

Next, set up Composio.

composio login

You will be directed to the login page.

Once you log in, an authentication key pops up. Copy it and paste it into your terminal.

Also, you will need an OpenAI API key. So, go to OpenAI

Next, Create a .env file and add environment variables for the OpenAI API key.

OPENAI_API_KEY=your API key

To create an OpenAI API key, go to the official site and create an API key in the dashboard.

Importing Libraries

Let’s import required libraries and modules and load environment variables.

import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI
from composio_crewai import ComposioToolSet, Action, App

# Load environment variables
load_dotenv()

This will import libraries and load the environment variable.

Defining helper functions

In this section, we will define three helper functions.

get_repo_path: This function prompts the user for a valid repository path.
create_composio_toolset(repo_path): Create a ComposioToolSet instance for accessing tools.
create_agent(tools, llm): Create a Code analysis agent using CrewAI.

So, let’s take a look at the codes.

get_repo_path()

def get_repo_path():
    """
    Prompt the user for a valid repository path.

    Returns:
        str: A valid directory path.
    """
    while True:
        path = input("Enter the path to the repo: ").strip()
        if os.path.isdir(path):
            return path
        print("Invalid path. Please enter a valid directory path.")

The function simply asks a valid path of a code file to the user in the terminal.

create_composio_toolset(repo_path)

def create_composio_toolset(repo_path):
    """
    Create a ComposioToolSet instance using the given repository path.

    Args:
        repo_path (str): Path to the repository to analyze.

    Returns:
        ComposioToolSet: Configured ComposioToolSet instance.
    """
    return ComposioToolSet(
        metadata={
            App.CODE_ANALYSIS_TOOL: {
                "dir_to_index_path": repo_path,
            }
        }
    )

The above function returns an instance of ComposioToolSet with the CODE_ANALYSIS_TOOL . The tool accepts the code base path. This tool is responsible for creating indexes of the code files.

create_agent(tools, llm)

def create_agent(tools, llm):
    """
    Create a Code Analysis Agent with the given tools and language model.

    Args:
        tools (list): List of tools for the agent to use.
        llm (ChatOpenAI): Language model instance.

    Returns:
        Agent: Configured Code Analysis Agent.
    """
    return Agent(
        role="Code Analysis Agent",
        goal="Analyze codebase and provide insights using Code Analysis Tool",
        backstory=(
            "You are an AI agent specialized in code analysis. "
            "Your task is to use the Code Analysis Tool to extract "
            "valuable information from the given codebase and provide "
            "insightful answers to user queries."
        ),
        verbose=True,
        tools=tools,
        llm=llm,
    )

This function returns a CrewAI agent. The agent is defined with

role: Role assigned to the agent.
goal: Final goal of the agent.
backstory: Provides additional context to the LLM for answer generation.
tools: the CODE_ANALYSIS_TOOL
llm: The OpenAI instance received in its arguments.

Defining the main() function

Finally, let’s define the main() function.

def main():
    # Get repository path
    repo_path = get_repo_path()

    # Initialize ComposioToolSet
    composio_toolset = create_composio_toolset(repo_path)

    # create a code index for the repo.
    print("Generating FQDN for codebase, Indexing the codebase, this might take a while...")
    resp = composio_toolset.execute_action(
        action=Action.CODE_ANALYSIS_TOOL_CREATE_CODE_MAP,
        params={},
    )

    print("Indexing Result:")
    print(resp)
    print("Codebase indexed successfully.")

    # Get tools for Code Analysis
    tools = composio_toolset.get_tools(apps=[App.CODE_ANALYSIS_TOOL])

    # Initialize language model
    llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

    # Create agent
    agent = create_agent(tools, llm)

    # Get user question
    question = input("Enter your question about the codebase: ")

    # Create task
    task = Task(
        description=f"Analyze the codebase and answer the following question:\n{question}",
        agent=agent,
        expected_output="Provide a clear, concise, and informative answer to the user's question.",
    )

    # Create and execute crew
    crew = Crew(agents=[agent], tasks=[task])
    result = crew.kickoff()

    # Display analysis result
    print("\nAnalysis Result:")
    print(result)

if __name__ == "__main__":
    main()

This is what is happening in the above code.

We start by calling the get_repo function, which asks for the repository directory and also defines the ComposioToolSet instance.

Next, using the CODE_ANALYSIS_TOOL_CREATE_CODE_MAP we create the vector index of the code files. This is the most prolonged and most compute-intensive phase. So, this may take a while. The tool crawls through the repository, intelligently chunks code, and creates vector indexes in a vector database.
Then create instances of OpenAI, CODE_ANALYSIS_TOOL, and finally, create an AI agent.
In the next step, ask the user a question regarding the code base from the terminal.
Now, define the task; this gives the Agent a purpose. The CrewAI Task is defined with
- description: A clear description.
- agent: The agent we defined earlier.
- expected_output: The expected outcome from the AI agent.
Finally, kick off the Crew and log the result.

Once you are done, run the Python script.

python code_qa.py

This will initiate the entire flow.

For the first time, it will take a while, as the agent will crawl and index the code files.

So, here is the tool in action. 👇

You can find the complete code here: Code Indexing AI tool

Thank you for reading the article.

Next Steps

In this article, you built a complete AI tool that lets you question and answer over your codebase.

If you liked the article, explore and star the Composio repository for more AI use cases.

Star the Composio repository ⭐

Top comments (10)

Nevo David • Sep 3 '24

This is so useful for so many developers.

Sunil Kumar Dash • Sep 3 '24

Thank you so much, Nevo.

johnwings21 • Sep 3 '24

Looks good, I will give it a try.

Sunil Kumar Dash • Sep 4 '24

Sure John

Bonnie • Sep 3 '24

That intro made me keep reading more.

Awesome and well written article, Sunil.

Sunil Kumar Dash • Sep 4 '24

Thank you so much, @the_greatbonnie,

Finndersen • Jan 14

Really interesting! I have a couple of questions:

Where/how is the CODE_ANALYSIS_TOOL_CREATE_CODE_MAP tool implemented? Is it closed-source within the Composio platform? If so, how does it access and process the local code repository?
I see you print the resp output from the CODE_ANALYSIS_TOOL_CREATE_CODE_MAP call, however it's never provided to the agent? How does the agent/LLM get that information in its context?