DEV Community

Cover image for Exploring Cody - An AI Coding Assistant That Knows Your Codebase
Maxim Saplin
Maxim Saplin

Posted on

Exploring Cody - An AI Coding Assistant That Knows Your Codebase

Recently, AI coding assistants have gained popularity, promising to revolutionise the way we code (get all us fired). The tide began in 2021 with GitHub Copilot and ChatGPT took it by storm in late 2022.

The above two products represent the 2 flavours of the assistants:

  • Text completion plugins - you type something in IDE and get auto suggestions that you can accept and get a code snippet inserted at the cursor position. GitHub Copilot, Tabnine, Amazon CodeWhisperer.
  • Chats - you have a separate window to talk and then copy and paste the snippet into IDE. GitHub Copilot X, ChatGPT.

In this blog post, I will explore the capabilities of Sourcegraph's Cody, an AI coding assistant that leverages codebase understanding to provide contextualized suggestions and recommendations.

Current Constraints of AI Coding Assistants

While Copilot brought cautious interest it was ChatGPT that blew my mind. Being able to write runnable code in unknown languages, ideate and iterate over own code, discuss optimisation problems, save time bootstrapping new projects or quickly fixing seldomly used parts. It was true magic!

Yet the backlash didn't take long. I found myself using AI in coding less often. I can outline 2 major reasons:

  1. Crisis of faith. Too often LLM hallucinations and inaccuracies derailed the process: catching tricky bugs in runtime, spending more time debugging. After all reading the code that you didn't write requires more effort.
    • IMO writing code is also a learning process, it gives time and depth. Generating code hinders that aspect.
  2. To much work putting up relevant context. When defining a problem for ChatGPT you must invest focus and effort into writing exhaustively what is that you want, making sure you provide all that's needed, copying and pasting code snippets from various parts of solution.

With the first part there's little that can be done (at least now) and one using AI should get used to extensively reviewing it's output and grasp the skill.

For the second part the task seems to be a puzzle, rather than a mystery.

I even tried creating my own AI plugin for VSCode which partly solved 2nd problem (it tries to smartly prime OpenAI model with surrounding code and plug the output directly into IDE).

Still a solution that can let Large Language Model easily navigate the entire code base, give it understanding of code structure, locate relevant dependencies and put up a good and exhaustive context for the given task... That seemed like a next level that could bring my trust in AI coding back and increase the usage of AI tooling.

Codebase-aware Assistants

The first tool that specifically pitched its' superpowers via solution-wide scope was Replit. Yet the feature is only available as part of their online IDE and there're no plugins for your IDE of chose.

Hence the second tool.

Cody

.. is an AI coding assistant that writes code and answers questions for you by reading your entire codebase and the code graph.
Cody uses a combination of Sourcegraph's code graph and Large Language Models (LLMs) to eliminate toil and keep human devs in flow. You can think of Cody as your coding assistant who has read through all the code in open source, all the questions on StackOverflow, and your own entire codebase, and is always there to answer questions you might have or suggest ways of doing something based on prior knowledge.

To start using the assistant you need to download a desktop app, login and point to repository folder. From this point you can start "chatting to your code" via the UI. For a closer integration there're IDE plugins, which use Cody app as a local server executing the requests. I used the one for VSCode (Cody AI).

My Trials

I tested Cody on two different Flutter repositories: a small, freshly created project (https://github.com/maxim-saplin/ambilytics) and a larger and longer living project (https://github.com/maxim-saplin/data_table_2). My trials aimed to assess Cody's abilities in daily routine solving task at hand.

First Impressions

After pointing Cody to one of my GitHub repos, it could navigate the full directory structure and answer questions about the overall codebase. However, it still exhibited the typical LLM limitations around factual consistency that I've seen with ChatGPT.

Image description

When I asked it to list all the .dart files, it omitted the test files located in a separate /tests folder. The next question was if there were any tests, it was able to list those files (again, a pattern noticed in ChatGPT when you ask a follow up question and LLM correct its' previous error). Lastly I asked it to clarify this inconsistency, it gave a nonsensical response.

Image description

So while Cody scans the full codebase, it doesn't necessarily develop a coherent understanding of it.

Task 1: Import Missing Dependencies 🍊

I've started with smaller repo. My first task was to add multiple missing dependencies in currently open .dart file (copied and pasted to solution folder) file and update pubspec.yaml file (think package.json). It was a partial success: most dependencies where identified, some incorrectly (flutter_test instead of test). Cody correctly identified pubspec.yaml to be changed (even while it was not open in editor). I had to manually insert the dependencies (no automation). Moreover, the suggested versions were outdated (stale training data and no connection to package manager).

Image description

Task 2: Generate Unit Tests 🍅

My second task was to generate unit tests from scratch using Cody's Recipes tab. Unfortunately, the tests it produced wouldn't compile - it referenced non-existent variables and imported the wrong testing packages. When I asked why Cody suggested inserting test-specific code into production code, it offered to use mocks.

Later when I wrote these tests myself (with no help of AI), I realized it required some upfront design and refactoring to make the code testable. So Cody's strictly bottom-up approach resulted in low quality. A more experienced developer would start top-down with a testing strategy.

That looks like a failure to me.

Image description

Task 3: Code Smell Recipe 🍊

Cody did better at finding code smells - it flagged some unused imports and long methods for me to clean up. This kind of task plays more to the LLM's strengths since it's just pattern matching on the existing code rather than synthesizing brand new code.

The suggestions were helpful, though not integrated into my IDE. I had to manually scan the transcript, search for ocurence in the file and apply each recommendation, rather than just clicking on them and jumping to the right place.

Again, minimal productivity boost, yet the task is done - a it is partial success to me.

Image description

Task 4: Code Completion 🍅

Unfortunately, it was so slow to the point of being unusable compared to CodeWhisperer. I even didn't notice the feature right away cause it was not able to keep up with typing and auto suggestion were hard to come by. Failure.

Task 5: Fixing Failing Unit Test 🍅

In my fifth task I switched back to larger repository. I asked Cody to fix a failing unit test.

I copied and pasted error message from the failing test. Expectation was that LLM will review both the class being tested and the unit-test and will be able to get a deep understanding of both parts.

The root cause - wrong icons expected in the test (correct icons can be found in a private class at the bottom of the same file)

Cody failed by giving generic directions and no fixed code snippet.

Image description

Task 6: Generating Simple Unit Test 🍅

My final task involved generating a simple unit test. Cody failed to find the correct context, and the tests created didn't make sense and tested nothing. Failure

Image description

Test Results

🍏 Success: 0/6
🍊 Partial success: 2/6
🍅 Failure: 4/6

Conclusion, Cody fails to impress.

Cody's code structure awareness brings little use. It can't understand the code base well, pick up local practices, or find relevant dependencies.

First of all code awareness didn't give any edge. I expected way better/more work done with less effort on my end (explaining and prepping). But it felt like I kept using ChatGPT with IDE integration and there was little to no code awareness in our dialogs.

Secondly I had an expectation of more automation and changes done by the tool in multiple parts of the solution, not copy-and-paste everywhere. While there might be no such promise, this is an intuitive "next big thing" feature making AI coding assistants close to real developers. Not yet.

Top comments (6)

Collapse
 
deepak2431 profile image
Deepak Kumar

Hey @maximsaplin, This is a great feedback. Not a Dart user, but I tried few of your tests here but I got the expected results I think so. As to find the generated tests, Cody listed me out the files where tests are present. I also tried to generate the Unit test with the recipes and I feel that I got correct result.

What my take on using AI coding assitant is that if you provide it better prompts then you are going to get better results. It's same like searching something on Google, if your search query is not perfect it will show you something else. And agreed that it's not always 100% CORRECT but if it's 75% CORRECT, then there it is improving the productivity of the developers.

Image description
Image description

Collapse
 
maximsaplin profile image
Maxim Saplin

Hi Kumar, thanks for the comment. Which language did you try Cody with?

Regarding good prompts being key to good results I don't completely agree. Firstly in many tests I used Recipies tab (with Cody's precooked prompts).

Secondly a good prompt is a big prompt, cause apparently it must give enough context - and that's time consuming. After all it's the task to good tool to manage the context for you (and save time and focus).

Thirdly if we speak of next level coding assistant prompt engineering (as technique, trickery, magic or routine work of copy-pasting) must be a thing of the past, with natural language being enough.

Collapse
 
deepak2431 profile image
Deepak Kumar • Edited

Hey Maxim, I have been using Cody for a while with my projects which are based on Python, TypeScript and JS.

With the Recipes thing you mentioned, were the results generated were 100% incorrect? As I use to generate my tests most of the time it generates 90-95% of CORRECT results, and the rest 5% I need to change. So, that's there it is saving my time and making me productive.

I won't say that a Good Prompt is a Big Prompt, if you see my screenshots shared to ask about the test files, I have just used one liner prompt with enough context about the results I want.

Like, I won't say it's a next level coding assitant but still it stands at a good point for now. Something right in your IDE, to which you can ask questions, get help with resolving of issues and much more with it's other features.

Thread Thread
 
maximsaplin profile image
Maxim Saplin

Oh, it's just now that I noticed your screenshots.

  1. The test doesn't makes sense, first of all it doesn't have a valid description of what it does. Secondly judging by the look of the produced code it won't compile (controller._attahc() is an attempt to access a private member). Lastly even if you get it compiled it still doesn't make any sense.

  2. Same result as I got when specifically asked for tests. No tests were listed when I asked to list all files.

Speaking of prompt, apparently behind the scenes Cody is cooking a nice fat prompt with proper context and relevant info.

Thread Thread
 
deepak2431 profile image
Deepak Kumar

Why the tests doesn't makes any sense? I am not a Dart user so can't say if the tests are correct or not, but would like to hear from you.

It has listed out the tests files on asking it I think so. Not all of them, but the ones it has listed is CORRECT and also it gave some description of the test files.

Yeah, I know behind the scenes it's having a prompt with the required context but that's how any AI applications are made on the top of LLM models. Isn't it?

Thread Thread
 
maximsaplin profile image
Maxim Saplin

The test is supposed to validate some meaningful behavior. The code produced is not doing anything like that. Specifically if you check what the controller is supposed to do is to manager widget's state and there're plenty of methods, such as changing page number, going to a specific row etc. The generated code is just random non-compilable Dart snippet with no purpose.