What I learned in 6 months of working on a CodeGen dev tool GPT Pilot

#ai #opensource #programming #discuss

For the past 6 months, I’ve been working on GPT Pilot (https://github.com/Pythagora-io/gpt-pilot) to understand how much we can really automate coding with AI, so I wanted to share our learnings so far and how far it’s able to go.

When I started building GPT Pilot, I wrote this blog post on how it is envisioned. Now, I want to share my revised thinking and go into what worked and what didn’t work.

Finally, you’ll see examples of apps that were created with GPT Pilot and how you can create an app with a real AI pair programmer.

What is the idea behind GPT Pilot?

GPT Pilot is envisioned as a real AI developer – not an autocomplete or a chat bot. Rather, it is a developer who creates a plan for how your app or feature should be built and starts coding. It wants to do most of the coding by itself, but when it gets stuck, it needs clarification about the given requirements, or requires a code review, it asks you for help.

Is AI like a junior developer? Or…

I often see CodeGen GPT-4-based tools that say they are building an AI junior developer. Somehow, I’ve always had a problem with that because when I use ChatGPT for coding, it gives me answers and ideas that only a super-senior person could give – something that absolutely no junior dev would even be able to grasp. Still, no LLM can build an app nearly as well as a senior developer can, but the knowledge GPT-4 has about coding is way beyond any junior developer. I would say that GPT-4 has so much knowledge about every part of software development like it’s the most senior developer in the world but with the memory of a goldfish. I picture it as a superhuman robot that just stands in the middle of a room and can only do a single small action at a time, but it cannot combine many actions and work repetitively. You must tell it exactly what it should do next. This is what we’re after with GPT Pilot – we want to create a framework of thinking for the LLM that gets that superhuman robot to continuously work by revising its previous actions, have a feedback loop, and determine what should it do next in order to finish the end goal, which is to build a production-ready application.

In the blog post I mentioned above, I outlined the main pillars on which GPT Pilot was built. But these have changed a bit based on our team’s learnings, so here are the revised pillars:

A human is needed to supervise the AI not only because AI is not good enough but also because you might want to change how something works or looks after it’s implemented. It’s common for a developer or product manager, once they see what an implementation looks like, to decide to change it. Or, you realize there are more edge cases than you initially anticipated and think it’s easier to refactor your current implementation than to fix every issue. The problem is when you finish the entire app and then try to refactor – this is when it becomes much harder because every change will impact all the other features. On the other hand, if you do the refactor before you commit your changes, you’ll be able to proceed with the next features on top of well-written code. This is why it’s crucial for an AI developer to have a human in the loop whenever a task is implemented. This way, the human can review the implementation of each task (just like a code review before merging a PR) before GPT Pilot continues onto the next task. If a human tells GPT Pilot what is wrong, it will be much easier to fix the issues within the task itself. At the same time, the LLM has the context of what needs to be done in the task and what has been done so far.
AI can iterate over its own mistakes. I have a feeling that many people judge ChatGPT’s ability to write code by how well it delivers the first time you ask it to code something. If it doesn’t produce working code, many will think it’s not impressive. In reality, humans almost never write working code on the first try. Instead, you write code, run it, see the errors, and iterate. This is exactly what GPT Pilot enables GPT-4 to do – after it writes code, GPT Pilot can run the code, take the output, and ask the LLM if the output is correct, if something should be fixed, and if so, how.
Software development can be orchestrated. There are many repetitive routines that all developers go through when building an app. One of the routines can be – write code, run it, read the errors, change code, rerun it, etc. Another higher-level one can be – take a task, implement it, test the implementation (repeat until all tests pass), send it for review, fix the issues (repeat until the reviewer approves), and deploy. Many of these routines can be orchestrated if we have an intelligent decision-maker in the loop (like an LLM).
The coding process is not a straight line. When we created the first version of GPT Pilot, we thought it would need to iterate over tasks, implement code, fix it, and move on. In reality, you don’t continuously progress when coding an app – you rewrite your code all the time. Sometimes, you refactor the codebase because, after the initial implementation, you realize there is a better way to implement something. Other times you do it because of a change in requirements. Like I mentioned in #1, after you see that a solution isn’t working, you sometimes need to roll back a bunch of changes, think about an alternative solution to the problem, and try solving it that way. To make GPT Pilot, or any other AI developer, work at scale, it needs to have a mechanism that will enable it to go back, choose an alternative path, and reimplement a task.

What did we learn?

LLMs, in general, are a new technology that everyone is trying to understand – how it works, what can be done better, how to do proper prompt engineering, etc. Our approach is to focus on building the application layer instead of working on getting LLMs to output better results. The reasoning is that LLMs will get better, and if we spend weeks optimizing a prompt, it might be completely solved with the new version of GPT. Instead, we’re focusing on what the user experience should look like and which mechanisms are needed to control the LLM to enable it to continuously work, getting closer and closer to the final solution. So, here are our learnings so far:

The initial description of the app is much more important than we thought. Our original thinking was that, with the human’s input, GPT Pilot would be able to navigate in the right direction and get closer and closer to the working solution, even if the initial description was vague. However, GPT Pilot’s thinking branches out throughout prompts, beginning with the initial description. And with that, if something is misleading in the initial prompt, all the other info that GPT Pilot has will lead in the wrong direction. So, when you correct it down the line, it will be so deep into this incorrect way that it will be almost impossible to get it onto the right path. Now, as I’m writing this, it seems so obvious, but that is something we needed to learn – to focus much more on the initial description. So, we built a new agent called “Spec Writer,” which works with you to break down the project requirements before it starts coding.
Coding is not a straight line. As I mentioned above in the pillars section, refactoring happens all the time, and GPT Pilot must do so as well. We haven’t implemented a solution for this yet, but it will likely work by adding the ability for GPT Pilot to create markers around its decision tree so that whenever something isn’t working, it can review markers and think about where it could have made a wrong turn.
Agents can review themselves. My thinking was that if an agent reviews what the other agent did, it would be redundant because it’s the same LLM reprocessing the same information. But it turns out that when an agent reviews the work of another agent, it works amazingly well. We have 2 different “Reviewer” agents that review how the code was implemented. One does it on a high level, such as how the entire task was implemented, and another one reviews each change before they are made to a file (like doing a git add -p).
LLMs work best when they can focus on one problem compared to multiple problems in a single prompt. For example, if you tell GPT Pilot to make 2 different changes in a single description, it will have difficulty focusing on both. So, we split each human input into multiple pieces in case the input contains several different requests.
Verbose logs help. This is very obvious now, but initially, we didn’t tell GPT-4 to add any logs around the code. Now, it creates code with verbose logging, so that when you run the app and encounter an error, GPT-4 will have a much easier time debugging when it sees which logs have been written and where those logs are in the code.
Splitting the codebase into smaller files helps a lot. This is also an obvious conclusion, but we had to learn it. It’s much easier for GPT-4 to implement features and fix bugs if the code is split into many files instead of a few large ones. Just like we, humans, think – it’s much easier to process smaller chunks of code rather than big ones. We do that by creating levels of abstraction – functions, classes, etc. One of the easiest ways to get the LLM to create a more manageable abstraction is just to tell it to create more modular code and split it into more files. It works amazingly well, and the end result is also better for us, developers.
For a human to be able to fix the code, they need to be clearly shown what was written in the task and the idea behind it. GPT Pilot’s goal is to do 90% of all coding tasks and leave the other 10% to a human. This 10% usually comprises fixes or small changes that the LLM struggles to notice, but for the human, it might be a simple change. However, the problem is that it’s not easy to tell the human what is not working and what code they should look at. If GPT Pilot writes 3,000 lines of code, the human developer, if they want to help GPT Pilot, needs to understand the entire codebase before diving into the actual code. In future versions of GPT Pilot, the human developer will have a detailed explanation of what code has been added to the current task and the reasoning behind it. This way, you will be able to help GPT Pilot.
Humans are lazy. LLMs are better off asking humans questions rather than letting humans think about all the possible errors. Again, it’s very obvious looking back, but one of the things we noticed was that people were much more willing to answer questions that GPT Pilot asked them instead of having an open-ended input field where they could write unconstrained feedback.
It’s hard to get an LLM to think outside the box. This was one of the biggest learnings for me. I thought you could prompt GPT-4 by giving it a couple of solutions it had already used to fix an issue and tell it to think of another solution. However, this is not as remotely easy as it sounds. What we ended up doing was asking the LLM to list all the possible solutions it could think of and save them into memory. When we needed to try something else, we pulled the alternative solutions and told it to try a different, but specific, solution.

Apps created with GPT Pilot

Currently, you can create simple but non-trivial apps with GPT Pilot. We’ve also seen people create apps with GPT Pilot that are very impressive, such as an app that can fine tune a ResNet model to count palm trees and then, when you upload an image, count the trees in it. Here are a couple of apps we created, along with the code, stats, and demos:

Prompt Lab (DEMO)

Think of this as OpenAI Playground on steroids – it enables you to load a conversation from a JSON array or enter it manually, run the conversation with the LLM X number of times, save the conversation to the database, and load it back in. We actually use this app when we engineer a prompt and want to see how it behaves. The Playground is not good enough because it’s time consuming to repetitively rerun a single prompt. With Prompt Lab, you can choose how many times to run it and let the app run the prompt repeatedly. Once it’s finished, you can analyze the results. This might be useful for people who are building an AI app and need to optimize their prompts.

⏳ Time spent: ~2 days
💾 Github repo

SQLite database analyzer tool (DEMO)

This is also an app we use internally to analyze a local SQLite database. It pulls the data from the database in a format that’s very specific to the GPT Pilot database structure, but it can easily be modified to fit other structures. It reads and uploads your SQLite database, splits the rows by a specific field, unpacks the rows into values, loads the LLM conversation data into a form, and enables you to simply change one of the messages and submit the LLM request to GPT-4 to see how the results will look. This way, you can analyze the conversations GPT Pilot’s agents have with the LLM and easily explore what would happen if the prompts were different.

⏳ Time spent: ~2 days
💾 Github repo

Code Whisperer (DEMO)

Code Whisper is a fun project we created as an example to showcase. The idea is that you can use it to ask the LLM questions about your codebase. You paste in the link to a public Github repo. Then, it clones the repository, sends the relevant files to the LLM for analysis, which creates a description for each file about what the code does, and saves those descriptions into the database. After that, you can ask the app questions about the codebase, and the codebase shows you the response. In this demo, we use GPT-3.5.

⏳ Time spent: 7 hours
💾 Github repo

Star History (DEMO)

I’ve been releasing open-source projects for years now, and I’ve always wanted to see how fast my Github repo is growing compared to other successful repositories on https://star-history.com/. The problem is that on Star History, I’m unable to zoom into the graph, so a new repo that has 1,000 stars cannot be compared with a big repo that has 50,000 because you can’t see how the bigger repo does in its beginning. So, I asked GPT Pilot to build this functionality. It scrapes Github repos for stargazers, saves them into the database, plots them on a graph, and enables the graph to be zoomed in and out.

⏳ Time spent: 6 hours
💾 Github repo

Conclusion

I hope you gained some insight into the current state, problems, and findings that we deal with at GPT Pilot.

So, to recap:

We think that a real AI developer tool should be based on the following pillars. Human is needed to supervise the AI, we should enable the AI to iterate over its own mistakes, software development can be orchestrated, and we should aim to implement the orchestration layer on top of LLMs, and the AI developer should be able to refactor the codebase because, in reality, the coding process is not a straight line.

We think that a real AI developer tool should be based on the following pillars:

A human is needed to supervise the AI
We should enable the AI to iterate over its own mistakes
Software development can be orchestrated, which should be implemented on a layer on top of LLMs
**The AI developer should be able to refactor the codebase **because, in reality, the coding process is not a straight line

So far, we’ve learned that:

The initial app description is much more important than we thought
Coding is not a straight line, agents can review themselves
LLMs work best when they focus on one problem compared to multiple problems in a single prompt
Verbose logs do miracles
Splitting the codebase into smaller files helps a lot
For a human to be able to fix the code
They must be clearly shown what has been written and the idea behind it
Humans are lazy
It’s hard to get the LLM to think outside the box

What do you think? Have you noticed any of these behaviors in your interactions with LLMs, or do you have a different opinion on any of these?

I would be super happy to hear from you, so leave a comment below or shoot me an email at zvonimir@gpt-pilot.ai.

You can try creating an app with AI with GPT Pilot here:

If you liked this post, it would mean a lot if you star the GPT Pilot Github repo, and if you try it out, let us know what you think.