TL;DR;
Large Language Models (LLM) still make a lot of mistakes in code generation, as of February, 2025.
We may use strictly defined shorthand algorithms (along with some additional comments) to generate better results. The reason is, natural human languages (English or any other) themselves are ambiguous, algorithms and programming languages are not!
What is it about?
Code generation with Artificial Intelligence is not yet up to the mark. Claude 3.5 Sonnet is better, but it too makes time-consuming debugging errors from time to time.
I'll continue relying on tools such as GitHub Copilot to boost productivity. That said, making the most of Generative AI for coding still feels elusive! Hopefully, that will change in 2025.
Alternatives to GitHub Copilot, like Cursor, Windsurf etc. are in my radar, but I'd rather use Copilot more efficiently first, before diving fully into the other options.
A little more background
Earlier today, while using GitHub Copilot to write a CLI Shell script, it was making time consuming mistakes.
The script is about syncing Firefox profiles in two different computers - in my specific case, two Laptops, both with Ubuntu installed.
Instead of keeping hundreds of open tabs, I usually create many Firefox profiles in command line. Each profile is fine tuned for specific tasks, projects etc. As web developers, we have to use different browsers, but my personal preference is still Firefox.
Also, the sync option within Firefox could've been used for this, but it has other drawbacks. So syncing it this way felt easier.
AI code generation - friend or foe?
Few days ago, I used GitHub copilot for developing a browser plugin in Vanilla JavaScript. It did a much better job then. However, for this shell script project, it gave me more headache than solution!
So basically, this is still on and off for me. Overall, it's productive—though there's room for improvement.
It's possible that other people are using these LLMs more efficiently; however, for the sake of this post, let's assume that's not the case.
My observation
In my experience, in visual projects such as UI development or JavaScript based animation (using three.js or p5.js), AI code generation works pretty well.
However, in the projects where we require highly precise complex logical reasoning, they sometimes make terrible mistakes that are time consuming to debug!
Why this is the case? Is it because we can easily ignore the mistakes in visual output, and consider them as acceptable variations?
Or is it because, non-programmers and beginners usually ask for this kind of code generation and easily get mesmerized. Then they share their experiences with others in social media, who in turn also get mesmerized. Neither group is capable nor willing enough to find the issues in the generated code.
Or is it because, as humans, we tend to share only the generated content that performs well? But when they don’t work as expected, we avoid sharing on social media. We might assume it’s our fault, thinking we didn’t instruct the AI well enough.
The reality is, different models perform differently in visual programming vs. mission critical business logic based programming.
Besides, in visual programming, 80% accurate result is understandably a far better outcome than 60% accurate result. However, in mission critical business logic based programming (e.g. banking software, medical software where error tolerance is 0%), both 80% and 60% accuracy may be considered as complete failures.
A very important consideration
Let's extend the above context even further. Consider the following two hypothetical situations:
1) An LLM generates 80% accurate code, where the remaining 20% inaccurate code is very difficult to debug!
2) Another LLM generates 60% accurate code, but the remaining 40% inaccurate code is very easy to debug.
If this is the case, then the second result would most likely be far more preferable to many developers!
Hence, this scenario must also be considered in judging the code generation performance of an LLM.
Update (March 01, 2025): Let's see twitter thinks!
The good parts
In any case, I've noticed something else interesting about AI-generated code. They usually produce much better results using shorthand strict algorithms instead of natural languages like English.
Yes, most LLMs work pretty well when we use pseudocode as prompts!
For example, following is an algorithm like prompt that extracts Firefox profile names from their corresponding directory names. It worked for me pretty well:
Convert the following algorithm to shell code, lines starting with # are comments with additional instructions:
profiles=array
for each directory in sync_from
# ending = * after 1st dot (non empty)
if directory name = "something.ending"
profile_name=ending
profiles[]=profile_name
end_if
end_for
This resulted in the following Shell script using Claude 3.5 Sonnet in GitHub Copilot:
#!/bin/bash
# Initialize profiles array
declare -a profiles
# Iterate through directories in sync_from
for dir in "$sync_from"/*; do
# Check if it's a directory
if [ -d "$dir" ]; then
# Get base name of directory
base_name=$(basename "$dir")
# Extract part after first dot
profile_name=${base_name#*.}
# Check if profile_name is not empty and not equal to base_name
if [ "$profile_name" != "$base_name" ] && [ -n "$profile_name" ]; then
# Add to array
profiles+=("$profile_name")
fi
fi
done
And the following Shell script with DeepSeek R1:
#!/bin/bash
profiles=()
for dir in "$sync_from"/*; do
if [[ -d "$dir" ]]; then
dirname=$(basename "$dir")
first_part="${dirname%%.*}"
ending="${dirname#*.}"
if [[ "$dirname" != "$first_part" && -n "$ending" ]]; then
profiles+=("$ending")
fi
fi
done
Suffice it to say, both worked.
However, when I tried to generate the same Shell script with Natural Language (e.g. English), they produced unexpected results. I finally was able to generate expected output after a few trials, but that's not an efficient way of programming!
Multiple trial-and-error phases and hands-on fixes were needed to get English prompts to work as expected. Though with the pseudocode prompt, both Claude 3.5 and DeepSeek R1 produced predictable results on the first attempt!
Questions to developer community:
What's your experience with AI in Software Development? Did you get 10x productivity boost using different LLMs as some people claim?
Do you know of any better way, or a better algorithm that produces predictable results with different Generative AI models?
Please let me know!
if you prefer, you may also engage me in the following related Tweet:
Hopefully my experience was helpful to you. Happy Coding! 🥰
Top comments (3)
Yeah, it will take a bit more effort to make LLMs useful for enterprise codebases. Not only better LLMs but complete systems around them. The more freedom there is (like with generic prompts to implement a snake game) the easier for a model to come up with something. But I would bet that google could also find a good implementation quite easily somewhere on the web.
I agree - generic prompts where the LLM not only understands the language of the user, but also the intent of the prompt - is the right way forward.
Intent understanding can be done with tooling. For example, GitHub copilot, Cursor etc. can already examine development environment along with other files in the environment, including UI/UX design image files. Using these, LLMs should eventually be able to achieve much better results. But we're definitely not there yet.
Hi Fayaz,
Great job! Fantastic article with clear explanations.
Regards,
Ram