Golevka

Posted on Jan 7

From 2 A.M. Frustrations to Smarter Repositories: How I Built an AI Assistant for GitHub

Preface

As a software developer, I’ve always found the process of finding answers on GitHub frustratingly inefficient. The familiar drill: spending hours scouring search engines or jumping between similar GitHub Issues, hoping to find a solution. And even when you finally give up and create a new issue, the waiting game begins — anywhere from half a day to several days before getting a maintainer’s response. I’m sure many developers can relate to this pain point.

2023 was my “AI awakening year.” Like many developers, I dove headfirst into using ChatGPT for everyday coding challenges. At first, it felt like magic: a well-crafted prompt could unlock just the answer I needed. But as the honeymoon phase ended, I started noticing cracks. ChatGPT, while confident, often strayed into outright nonsense — especially when it came to niche open-source projects or specific framework quirks. It was like asking your overly enthusiastic coworker who pretends to know everything. We all know that guy, right?

You might be thinking what I thought next: wouldn’t it be amazing if every GitHub repository had its own “AI butler”? Imagine having an AI assistant that understands the complete context of your repository, ready to answer questions at any time. Not just a general-purpose ChatGPT, but a custom-tailored intelligent assistant that knows every corner of your codebase and understands the history behind every issue. This idea sparked something in me, and I decided to turn it into reality…

Building the GitHub Assistant Prototype

While researching existing solutions, I discovered that many developers were already pushing the boundaries of large language models. GPTs technology, for instance, enables models to call functions and fetch private, real-time data. Products like GitHub Copilot have successfully implemented complex conversational AI systems, popularizing the concept of intelligent agents. As I dug deeper into these implementations, I realized that building an agent might be simpler than I initially thought: by using a large language model as the core and combining it with various GitHub tools, we could create a straightforward yet reliable GitHub Agent.

Eager to test the feasibility of this idea, I quickly set up a prototype. I started with a ChatGPT account to access the language model service and prepared a Python development environment. After studying various community implementations, I decided to leverage LangChain’s toolkit for building the agent — it seemed like the perfect foundation for what I had in mind.

Following the design shown above, our first requirement was a stable LLM hub that could not only provide model services directly but also integrate with various tools. The system prompt proved crucial in enabling the model to select appropriate tools based on user queries and generate accurate responses. I started by defining the model’s role as a repository assistant:

You are a skilled assistant dedicated to {repo_name}, capable of delivering comprehensive insights and solutions pertaining to {repo_name}. You excel in fixing code issues correlated with {repo_name}.

The system prompt plays a critical role in the model’s reasoning process. Through carefully crafted prompts, I could explicitly define the agent’s capabilities, ensuring these skills would be applied consistently when processing user inputs. Through experimentation, I discovered that these skill definitions needed to strike a delicate balance — too specific, and they would constrain the model’s capabilities; too broad, and they wouldn’t effectively guide tool usage. After careful consideration, I started with defining the agent’s interaction capabilities:

### Skill 1: Engaging Interaction
Your primary role involves engaging with users, offering them in-depth responses to their {repo_name} inquiries in a conversational fashion.

Next, I wanted to enable the assistant to search for relevant information both on the internet and within the repository, broadening its knowledge base and improving the relevance and accuracy of its responses.

### Skill 2: Insightful Information Search
For queries that touch upon unfamiliar zones, you are equipped with two powerful knowledge lookup tools, used to gather necessary details:
   - search_knowledge: This is your initial resource for queries concerning ambiguous topics about {repo_name}. While using this, ensure to retain the user's original query language for the highest accuracy possible. Therefore, a specific question like '{repo_name} 的特性是什么?' should be searched as '{repo_name} 的特性是什么?'.
   - tavily_search_results_json: Should search_knowledge fail to accommodate the required facts, this tool would be the next step.
   - search_repo: This tool is used to retrieve basic information about a GitHub repository, including star count, fork count, and commit count.

Finally, I empowered the Agent to make intelligent decisions about whether to provide direct answers or guide users toward creating an issue.

### Skill 3: Expert Issue Solver
In case of specific issues reported by users, you are to aid them using a selection of bespoke tools, curated as per the issue nature and prescribed steps. The common instances cater to:
   - Routine engagement with the user.
   - Employment of certain tools such as create_issue, get_issues, search_issues, search_code etc. when the user is facing a specific hurdle.

If you directly ask an LLM about a repository’s star count, it typically can’t provide accurate numbers. To solve this, I created a tool — essentially a function — that allows the GitHub Assistant to fetch and return precise repository metrics:

from github import Github
g = Github()

def (repo_name):
    """
    Get basic information of a GitHub repository including star count, fork count, and commit count.

    :param repo_name: Name of the repository in the format 'owner/repo'
    :return: A object with basic repo information.
    """
    repo_detail_info = g.get_repo(repo_name)
    return json.dumps({**repo_detail_info})

With the prompts, language model service, and GitHub tools in place, I used LangChain’s AgentExecutor and related methods to combine these components into a basic GitHub Assistant prototype. The implementation converts user queries into OpenAI-compatible message formats before passing them to the Assistant for processing. The complete implementation details are available in our GitHub repository for those interested in the technical specifics.

Building a More Practical GitHub Assistant

With the basic prototype in hand, it was time to put it to work in real-world scenarios. Thanks to GitHub’s webhook functionality, which triggers notifications for common repository actions like code submissions and issue creation, I only needed to provide an HTTP endpoint to establish a connection between GitHub and my Assistant prototype. Once connected, I wanted the GitHub Assistant to excel in two key areas:

Code Review: The ability to analyze submitted code, checking for compliance with standards and potential bugs.
Issue Response: The capability to comprehend and respond to newly created issues based on their titles and content, while also participating in ongoing issue discussions.

These two capabilities, while equally important, required different specialized skills. I decided to take a divide-and-conquer approach, creating two distinct agents based on the original prototype. Each agent would focus on its specific domain of expertise, ultimately leading to higher quality outputs.

Pull Request (PR) Review Functionality

When a user submits code to the repository, GitHub notifies the GitHub Assistant through WebHooks with relevant information. At this point, the Assistant determines if the PR review agent needs to be activated.

The PR review agent shares a similar structure with the original GitHub Assistant prototype. I first endowed it with the identity of a professional code reviewer, tasking it with evaluating code across four key dimensions: functionality, logical errors, security vulnerabilities, and major performance issues.

Character Description
You are an experienced Code Reviewer, specializing in identifying critical functional issues, logical errors, vulnerabilities, and major performance problems in Pull Requests (PRs).

To make the PR review agent function like a real engineer, I wanted it to provide both high-level PR summaries and detailed code comments. To achieve this, I developed two tools for interacting with GitHub repositories: create_pr_summary for posting general comments in the PR discussion area, and create_review_comment for adding line-specific comments in code commits. These tools correspond to the agent’s two main tasks: summarizing the overall PR and conducting detailed code reviews.

You are an AI Assistant specialized in reviewing pull requests with a focus on critical issues.
You are equipped with two tools to leave a summary and code review comments:
- create_pr_summary: Used to create a summary of the PR.
- create_review_comment: Used to leave a review comment on specific files.

Code review demands precision and thoroughness. Drawing inspiration from the chain-of-thought approach, I structured the prompts to define specific task methodologies for optimal results.

Task 1: PR Summary

For PR summaries, I emphasized the importance of following a specific markdown format while keeping the content concise. These summaries are posted to the GitHub PR comments section using the create_pr_summary tool.

## Task 1: Summarize the Pull Request
Using `create_pr_summary` tool to create PR summary.
Provider your response in markdown with the following content. follow the user's language.
  - **Walkthrough**:  A high-level summary of the overall change instead of specific files within 80 words.
  - **Changes**: A markdown table of files and their summaries. Group files with similar changes together into a single row to save space.

Here’s how it performs in a real repository:

Task 2: Line-by-Line Code Review

Compared to PR summaries, reviewing user code presented a more significant challenge. Before implementing the automated review process, I needed to establish mechanisms for users to opt-out of code reviews when desired and automatically skip reviews for draft PRs.

Skip Task Whitelist
**SKIP_KEYWORDS**: A list of keywords. If any of these keywords are present in the PR title or description, the corresponding task will be skipped.
- Examples: "skip", "ignore", "wip", "merge", "[skip ci]"
- If the draft flag is set to true, the task should be skipped.

Next, I crafted specific instructions emphasizing the focus on logical and functional changes while ignoring formatting modifications.

Review the diff for significant errors in the updated files. Focus exclusively on logical, functional issues, or security vulnerabilities. Avoid comments on stylistic changes, minor refactors, or insignificant issues.

Most crucially, I needed to ensure the Agent could provide precise, location-specific comments within the code.

### Specific instructions:
- Take into account that you don't have access to the full code but only the code diff.
- Only comment on code that introduces potential functional or security errors.
- If no critical issues are found in the changes, do not provide any comments.

Unlike humans who can visually parse GitHub’s code review interface, enabling GitHub Assistant to perform code reviews required transforming PR code changes into a more machine-friendly format, rather than working with GitHub’s binary file representations.

Code repositories contain diverse file types, but not all warrant review. Some files have limited review value, while others are too complex for current language models to process effectively. I implemented path-based filtering to ignore certain special files, such as build artifacts, images, and project configuration files.

To enable the model to precisely reference specific lines of code, I developed tools to process each line of code, annotating them with line numbers and clearly distinguishing between code additions and deletions.

After these preprocessing steps, I package the processed content as user messages for the Agent, rather than embedding them in system prompts. In the prompts, I specify that the code has been formatted according to a specific structure, which is crucial for the model to correctly interpret code changes and conduct meaningful reviews.

### Input format
- The input format follows Github diff format with addition and subtraction of code.
- The + sign means that code has been added.
- The - sign means that code has been removed.

The PR review Agent then analyzes the code according to the prompt requirements, providing line-specific review comments using the [create_review_comment] function to add feedback at precise locations in the PR.

Additional Optimization Tips

While it’s straightforward to make language models generate output, it’s more challenging to prevent unnecessary comments. To address this, I implemented a mechanism allowing users to skip code reviews through specific keywords in PR titles or descriptions.

# Skip Task Whitelist
**SKIP_KEYWORDS**: A list of keywords. If any of these keywords are present in the PR title or description, the corresponding task will be skipped.
- Examples: "skip", "ignore", "wip", "merge", "[skip ci]"
- If the draft flag is set to true, the task should be skipped.

To handle edge cases, I added constraints and clarifications at the end of the prompts. These additions help the Agent focus on new and modified code while avoiding comments on minor style inconsistencies, formatting issues, or changes that don’t affect functionality. Given GitHub’s international nature, I also ensured that the output language matches the language used in PR titles and comments.

# Constraints
- Strictly avoid commenting on minor style inconsistencies, formatting issues, or changes that do not impact functionality.
- Do not review files outside of the modified changeset (i.e., if a file has no diffs, it should not be reviewed).
- Only flag code changes that introduce serious problems (logical errors, security vulnerabilities, typo or functionality-breaking bugs).
- Respect the language of the PR's title and description when providing summaries and comments (e.g., English or Chinese).

Issue Handling and Discussion

Similar to the PR review functionality, GitHub Assistant can respond to issues and issue_comment events from the GitHub platform. The Issue handling Agent follows a similar implementation to the prototype Agent, with a focus on effectively utilizing provided tools such as open internet search and repository issue lookup capabilities. I included specific instructions to optimize tool usage and enhance effectiveness.

* If the found issue_number is the same as this issue_number: {issue_number}, it means no similar issues were found, You don't need to mention the issue again.
* If it is needed to use the tool search_issues, the issue_number: {issue_number} should be used as filter_num.

Compared to the original GitHub Assistant prototype, the Issue handling Agent places stronger emphasis on factual accuracy. It’s designed to provide serious, well-researched responses without making assumptions or pretending to know more than it does.

* If you don’t have any useful conclusions, use your own knowledge to assist the user as much as possible, but do not fabricate facts.
* Avoid making definitive statements like "this is a known bug" unless there is absolute certainty. Such irresponsible assumptions can be misleading.

Here’s how the bot performs when handling issues in open source repositories:

Building a More Reliable GitHub Assistant

After considerable effort, I successfully integrated the GitHub Assistant prototype seamlessly with GitHub. However, real-world usage revealed its limitations: when responding to issues, it essentially just relocated the language model service without addressing its inherent weaknesses. I wanted it to have a deeper understanding of repository documentation and code, keep track of the latest code changes, and learn from historical issue discussions to provide better solutions.

Research led me to two main approaches for addressing these limitations: fine-tuning and RAG (Retrieval-Augmented Generation). Fine-tuning, which involves combining new data with the existing model through additional training, requires significant computational resources and isn’t ideal for frequently changing codebases. The RAG approach, on the other hand, not only requires fewer resources but also adapts dynamically to repository updates, making it a perfect fit for the ever-changing nature of codebases.

Implementing RAG capabilities required a two-step approach. First, I needed to vectorize valuable repository content, including code and historical issues, storing them in a vector database. Second, I had to develop knowledge retrieval tools that could use vector-based search to retrieve relevant content. These retrieved results would then be fed to the language model, enabling GitHub Assistant to provide more accurate and timely responses.

Step 1: Repository Vectorization

Given the large number of files and diverse file types in code repositories, it wouldn’t be efficient to simply feed all repository content directly into a vectorization model. Instead, I took a file-level granular approach, recursively traversing repository files and creating vectorization tasks for each. To prevent this process from blocking instance creation, I implemented AWS Lambda Functions to break down vectorization into asynchronous tasks.

Github File Download

Developers can use GitHub’s open APIs to fetch file contents from specific repository paths. PeterCat uses these APIs to download files from specified repository locations.

repo = github.get_repo(repoinfo)
file_content = repo.get_contents(path, ref=commit_id)
file_sha = file_content.sha
return base64.b64decode(file_content.content).decode("utf-8")

Before performing text vectorization, I implemented SHA-based duplicate checking against the vector database. If a file’s SHA already exists in the database, we skip the vectorization process. In code repositories, besides source code files, most valuable information exists in Markdown format, such as README.md files. To minimize noise in our dataset, we exclude all non-Markdown files from processing.

Additionally, historical Issue information holds tremendous value — an area often overlooked by large language models like GPT. However, not all Issues are worth storing, as low-quality content can degrade RAG retrieval effectiveness. To address this, I implemented filtering criteria to only include closed Issues with high engagement levels.

Text Vectorization

After collecting Markdown files and high-quality Issue content from the repository, the next step is vectorization. Due to input length constraints, we need to split long texts into smaller chunks based on a defined CHUNK_SIZE. When text is split into separate blocks, processing each block independently can result in lost context between blocks. To address this, we implement overlapping regions (CHUNK_OVERLAP) between blocks, ensuring that important contextual information is shared across different chunks. This overlap helps minimize the boundary effect and enables the RAG algorithm to capture transition information more accurately.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)
docs = text_splitter.split_documents(documents)

The split text chunks are then vectorized using OpenAI’s embedding model, and the resulting vectors are stored in a Supabase database.

Step 2: Content Retrieval

When users interact with GitHub Assistant, we search for relevant content within the repository based on their input. During this process, we first vectorize the user’s input and then match it against our vector database. To facilitate this, we create a similarity search function in Supabase based on embeddings, structured like this:

begin
  return query
  select
    id,
    content,
    1 - (rag_docs.embedding <=> query_embedding
  ) as similarity
  from rag_docs
  order by rag_docs.embedding <=> query_embedding;
end;

The vector-based search results aren’t always perfectly aligned with user queries. Therefore, we pass the retrieved text content through the language model for comprehension and refinement, ultimately producing responses that better match user needs. Thanks to this focused approach, GitHub Assistant can provide more specialized answers compared to direct language model outputs.

Evolving into a GitHub Assistant Factory

As the prototype matured, I started thinking about how to benefit more open-source projects. To realize the vision of providing every GitHub repository with its own specialized AI assistant, I needed to create a GitHub Assistant factory — a system where users could simply provide repository information and receive a custom-tailored GitHub Assistant in return.

Building an assistant factory required additional support, and fortunately, I work for a company that strongly encourages open source activities. Our department, AFX, has produced several successful open source projects like Ant Design, UMI, and Mako. I quickly assembled a team of talented colleagues to help build and open-source this assistant factory.

During our project naming discussions, an interesting idea emerged — Peter Cat. The name has a clever bilingual wordplay: in Chinese, “Peter” sounds like “Pi Tao” (皮套), which means “to wear a suit” or “to put on a costume.” This double meaning perfectly reflects the essence of these AI assistants: like digital avatars wearing custom-tailored suits, each uniquely designed to serve their respective repositories. The “Cat” suffix pays homage to GitHub’s iconic black cat mascot, and thus the name “PeterCat” was born.

PeterCat is a brand-new project that allows us to freely experiment with different technology stacks.. Given our diverse product requirements — including a GitHub App, third-party user portals, and PeterCat’s official website — we adopted a decoupled frontend-backend architecture, connecting services and products through HTTP interfaces.

Creating AI Assistants with Freedom

We aimed to make the AI assistant creation process as simple and intuitive as possible. Users only need to input a repository URL, and PeterCat automatically generates a well-configured assistant — complete with an avatar, name, and even personality traits (defined through system prompts). In most cases, these default settings work perfectly fine.

Click the link to view the showcase

However, we wanted to make the process even more engaging. So we developed an “assistant for creating assistants” — a specialized wizard for customizing GitHub Assistants. You can chat directly with it, telling it exactly what kind of assistant you want. For instance, you might say “I want an assistant focused on performance optimization” or “Help me create an assistant that’s good at guiding newcomers.” The best part? The preview window updates in real-time as you speak, letting you instantly see your “creation” take shape.

Click the link to view the showcase

The implementation principles behind this are similar to the Agents mentioned earlier — in fact, this simple, composable pattern proves highly effective in practice.

One-Click Assistant Deployment to Repositories

The final step involves helping users deploy their AI assistants to actual repositories. In our early prototype, we used webhooks to connect assistants with repositories. However, to make this process truly seamless for users, we needed a more comprehensive solution.

We ultimately decided to develop a GitHub App to serve as the connection bridge. Users simply need to authorize the GitHub App like any other application, and PeterCat automatically retrieves their repository list. Then, users can select a target repository from the list on the PeterCat website after creating their GitHub Assistant, and the AI assistant is officially integrated into the repository. From that point forward, it actively participates in both Issue discussions and PR reviews.

Click the link to view the showcase

Summary and Future Outlook

From concept to reality, PeterCat has embarked on an exciting journey. In September 2024, we officially announced the project’s open source release at the Shanghai INCLUSION·Conference on the Bund. Within just three months, it garnered over 850 stars, with 178 open source projects adopting our AI assistant. While these numbers are impressive, what truly gratifies me isn’t the growth rate, but rather each individual problem successfully solved.

For instance, in one case, our AI assistant helped a user resolve issues with the Ant Design table component through multiple rounds of dialogue, saving them hours of frustration and allowing them to focus on building features instead. User feedback like this makes all our efforts worthwhile — it’s exactly the impact we hoped to achieve: making it easier for developers to use and contribute to open source projects.

Click the link to view the showcase

Reflecting on this project, the most profound realization has been the power of the open source community. Standing on the shoulders of numerous excellent open source projects enabled us to transform our idea into reality in such a short time. Technologies like LangChain, FastAPI, and Supabase not only provided robust support for our project but also demonstrated the democratization of technology — any developer with an idea can now access cutting-edge technical practices through the open source community.

And that’s the beauty of open source — it’s not about gatekeeping technology; it’s about building a staircase for collective progress. If you’ve ever felt this spark, why not join us? Write your own story in code, and let’s shape the future of open source, one pull request at a time.

PeterCat is still in its early stages, it shows promising potential. As a developer, I’ve mapped out several growth directions for the next phase:

First, breaking down repository isolation: Currently, each AI assistant works independently, somewhat like a “frog in a well.” By implementing a multi-agent architecture, we can enable assistants to collaborate and share knowledge bases, providing more comprehensive solutions.
Second, enhancing code comprehension: There’s significant room for improvement in handling complex business logic and cross-file context scenarios. This requires deeper investment in code semantic understanding.
Third, seamless IDE integration: We plan to develop VS Code plugins, allowing developers to access assistance without switching windows, creating a more natural development experience.
Finally, empowering users with greater control: While RAG technology has solved many challenges, users remain the domain experts. We want to make it easier for users to manage and optimize their knowledge bases, as they best understand their project requirements.

Come chat with our friendly cat! 🐾

If you’re interested in this project, come join us in petting the cat (or rather, writing code)! Whether you want to contribute code or simply star the project, any support is the best encouragement for this AI kitty.

👉 Try it online: https://petercat.ai

🐱 GitHub Repository: https://GitHub.com/petercat-ai/petercat

☀ Learn More: https://medium.com/@petercat.assistant/we-open-sourced-a-kitty-project-petercat-4d8e06a349d5