Debugging large code bases with ChromaDB and Langchain

#llm #langchain #legacy #chromadb

Over the last week, I've been diving back into Langchain for an upcoming project. While working through some code, I hit an edge case that stumped me. My first instinct was to turn to Anthropic's Claude and OpenAI's GPT-4 for help, but their suggestions didn't quite cut it. Frustrated, I turned to the usual suspects - Google and StackOverflow - but came up empty-handed there too.

I started digging into Langchain's source code and I managed to pinpoint the exact line throwing the error, but understanding why my code was triggering it remained a mystery. At this point, I'd normally fire up the debugger and start stepping through the code line by line. But then a thought struck me: what if I could leverage the power of Large Language Models (LLMs) to analyze the entire Langchain codebase? I was curious to see if I could load the source code into Claude and get it to help me solve my problem, combining the LLM's vast knowledge with the specific context of Langchain's internals.

To do this I need to do the following using Langchain:

Connect to the Langchain GitHub repository
Download and chunk all the Python files
Store the chunks in a Chroma vector database
Creating an agent to query this database

Here is the code I used to download and store the results in ChromaDB

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import GithubFileLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OpenAIEmbeddings

# Load environment variables from .env file
load_dotenv()

# Step 1: Get GitHub access token and repo from .env
ACCESS_TOKEN = os.getenv("GITHUB_TOKEN")
REPO = "langchain-ai/langchain"

# Step 2: Initialize the GithubFileLoader
loader = GithubFileLoader(
    repo=REPO,
    access_token=ACCESS_TOKEN,
    github_api_url="https://api.github.com",
    branch="master",
    file_filter=lambda file_path: file_path.endswith(
        ".py"
    )
)

# Step 3: Load all documents
documents = loader.load()

# Step 4: Process the documents
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Step 5: Initialize the vector store
embeddings = OpenAIEmbeddings(disallowed_special=())  
vectorstore = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db", collection_name="lang-chain")

vectorstore.persist()

The following code is how I was able to create a simple langchain chain to query the code



# Initialize embeddings and load the persisted Chroma database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings, collection_name="lang-chain")

# Create a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 200})

# Initialize the language model
llm = ChatAnthropic(
    model="claude-3-5-sonnet-20240620",
    temperature=0.5,
    max_tokens=4000,
    top_p=0.9,
    max_retries=2
)


messages = [
    ("system",""" TODO.  Put in your specific System Details"""),
    ("human","""{question}""")
]


prompt = ChatPromptTemplate.from_messages(messages)

# Define the chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()

questions = "The question you want to ask to help debug your code"
result = chain.invoke(question)

By downloading and storing the entire Langchain codebase in a vector database, we can now automatically include relevant code snippets in our prompts to answer specific questions. This approach leverages Chroma DB, allowing us to store the code locally and use collections to manage different codebases or branches. This method provides a powerful way to contextualize our queries and get more accurate, code-specific responses from LLMs.

While this technique proved effective in solving my Langchain issue, it's important to note that it took about 5-6 iterations of prompt refinement to reach a solution. Although it required some effort, this approach ultimately unblocked my progress and allowed me to move forward with my project. The key to success lies in crafting well-structured prompts with relevant context, which is crucial for obtaining useful responses from the LLM. While I applied this method to Langchain, it's a versatile technique that could be used with any repository, especially legacy codebases. Reflecting on past experiences where I've inherited complex, poorly documented systems, a tool like this would have significantly accelerated the process of understanding, fixing, and refactoring existing code. This approach represents a valuable addition to a developer's toolkit, particularly when dealing with large, complex codebases.

Top comments (1)

Igor Hut • Aug 22 '24

Great idea! Thank you for sharing this one with the rest of us. Have you tried using chat.langchain.com/? Do you have any idea what the cost of embedding the whole LC code base was?

DEV Community

Debugging large code bases with ChromaDB and Langchain

Top comments (1)

Read next

LLM Models and RAG Applications Step-by-Step - Part III - Searching and Injecting Context

Generative Audio

RAG vs. Fine-Tuning: Which Is Best for Enhancing LLMs?

Create Your Own AI RAG Chatbot: A Python Guide with LangChain