As a developer fascinated by artificial intelligence, I have embarked on a new journey to deepen my knowledge of NLP and the problem of code search. After a preliminary analysis of CodeBERT solutions, I remain skeptical as to whether a single AI model can generate embeddings that work equally well for both programming languages (PL) and natural languages (NL). These two domains differ vastly in semantics, with each following its own rules, deeply tied to their unique use cases.
In this post, I want to share some of my thoughts and findings, particularly regarding open-source tools like SeaGOAT and GitHub’s approach to code search as described in their blog and here.
A Look at SeaGOAT: Combining Simplicity with Functionality
SeaGOAT is an open-source tool written in Python that employs two “engines” for code search. The first is ripgrep
, a traditional text-searching tool. In essence, it works by breaking a user’s query into individual words and then retrieving every line from the repository containing at least one of those words. The simplicity here is notable: it relies on the assumption that, for example, a function handling map rendering will likely include the word "map," which a user might also use in their search.
The second mechanism is chromadb
, a database designed to store embeddings. SeaGOAT uses the all-MiniLM-L6-v2
model to generate these embeddings, which is the default model used by chromadb
. While this model performs well in generating vector representations, it wasn’t trained on code, and therein lies the problem. It faces the same semantic challenges I mentioned earlier: trying to generate consistent vector embeddings for both natural and programming languages. Because of this, I chose to skip further tests with SeaGOAT and instead turn my attention to GitHub’s approach.
GitHub’s Solution: A Two-Model System
GitHub is a huge company that operates commercially, and the problem of code search is one of the most important challenges they have had to face. Initially, GitHub’s search relied on keyword matching—a straightforward approach similar to ripgrep. But their latest innovations present a much more nuanced solution.
Their system uses two AI models in tandem:
The Documentation Model: This model is trained on the task of generating documentation for code. It takes programming language (PL) as input and maps it into an embedding space tied to natural language (NL).
The Search Query Model: This model is tuned to the same embedding space but works in the opposite direction. It takes natural language (NL) queries as input and generates embeddings in the same vector space as the documentation model.
The brilliance of this system lies in its duality. Both models process entirely different types of input, yet their outputs exist within the same semantic vector space. This allows for meaningful matches between user queries and code fragments, despite the inherent differences in the languages being processed.
This approach, in my opinion, feels far more intuitive and semantically accurate than other solutions I’ve encountered. By allowing each model to specialize in its domain while sharing a unified embedding space, GitHub has created a system that respects the nuances of both natural and programming languages.
Closing Thoughts
The more I explore, the more I realize the depth of the code search problem. While tools like SeaGOAT offer valuable insights, the sophistication of GitHub’s solution sets a high bar for others. Their two-model approach, bridging the gap between PL and NL, seems like a step in the right direction.
As I continue my exploration, I’m eager to delve deeper into these dual-model architectures and understand how they might be adapted or extended for even more effective code search solutions.
For now, this journey remains ongoing, and I’m grateful for the learning opportunities it provides. If you’ve worked on similar problems or have insights to share, I’d love to hear your thoughts in the comments below.
Top comments (0)