Fayaz

Posted on Feb 6 • Edited on Mar 1

State of generative AI in Software Development: The reality check!

#vscode #ai #productivity #development

TL;DR;

Large Language Models (LLM) still make a lot of mistakes in code generation, as of February, 2025.

We may use strictly defined shorthand algorithms (along with some additional comments) to generate better results. The reason is, natural human languages (English or any other) themselves are ambiguous, algorithms and programming languages are not!

What is it about?

Code generation with Artificial Intelligence is not yet up to the mark. Claude 3.5 Sonnet is better, but it too makes time-consuming debugging errors from time to time.

I'll continue relying on tools such as GitHub Copilot to boost productivity. That said, making the most of Generative AI for coding still feels elusive! Hopefully, that will change in 2025.

Alternatives to GitHub Copilot, like Cursor, Windsurf etc. are in my radar, but I'd rather use Copilot more efficiently first, before diving fully into the other options.

A little more background

Earlier today, while using GitHub Copilot to write a CLI Shell script, it was making time consuming mistakes.

The script is about syncing Firefox profiles in two different computers - in my specific case, two Laptops, both with Ubuntu installed.

Instead of keeping hundreds of open tabs, I usually create many Firefox profiles in command line. Each profile is fine tuned for specific tasks, projects etc. As web developers, we have to use different browsers, but my personal preference is still Firefox.

Also, the sync option within Firefox could've been used for this, but it has other drawbacks. So syncing it this way felt easier.

AI code generation - friend or foe?

Few days ago, I used GitHub copilot for developing a browser plugin in Vanilla JavaScript. It did a much better job then. However, for this shell script project, it gave me more headache than solution!

So basically, this is still on and off for me. Overall, it's productive—though there's room for improvement.

It's possible that other people are using these LLMs more efficiently; however, for the sake of this post, let's assume that's not the case.

My observation

In my experience, in visual projects such as UI development or JavaScript based animation (using three.js or p5.js), AI code generation works pretty well.

However, in the projects where we require highly precise complex logical reasoning, they sometimes make terrible mistakes that are time consuming to debug!

Why this is the case? Is it because we can easily ignore the mistakes in visual output, and consider them as acceptable variations?

Or is it because, non-programmers and beginners usually ask for this kind of code generation and easily get mesmerized. Then they share their experiences with others in social media, who in turn also get mesmerized. Neither group is capable nor willing enough to find the issues in the generated code.

Or is it because, as humans, we tend to share only the generated content that performs well? But when they don’t work as expected, we avoid sharing on social media. We might assume it’s our fault, thinking we didn’t instruct the AI well enough.

The reality is, different models perform differently in visual programming vs. mission critical business logic based programming.

Besides, in visual programming, 80% accurate result is understandably a far better outcome than 60% accurate result. However, in mission critical business logic based programming (e.g. banking software, medical software where error tolerance is 0%), both 80% and 60% accuracy may be considered as complete failures.

A very important consideration

Let's extend the above context even further. Consider the following two hypothetical situations:

1) An LLM generates 80% accurate code, where the remaining 20% inaccurate code is very difficult to debug!

2) Another LLM generates 60% accurate code, but the remaining 40% inaccurate code is very easy to debug.

If this is the case, then the second result would most likely be far more preferable to many developers!

Hence, this scenario must also be considered in judging the code generation performance of an LLM.

Update (March 01, 2025): Let's see twitter thinks!

The good parts

In any case, I've noticed something else interesting about AI-generated code. They usually produce much better results using shorthand strict algorithms instead of natural languages like English.

Yes, most LLMs work pretty well when we use pseudocode as prompts!

For example, following is an algorithm like prompt that extracts Firefox profile names from their corresponding directory names. It worked for me pretty well:

Convert the following algorithm to shell code, lines starting with # are comments with additional instructions:

profiles=array
for each directory in sync_from
    # ending = * after 1st dot (non empty)
    if directory name = "something.ending"
        profile_name=ending
        profiles[]=profile_name
    end_if
end_for

This resulted in the following Shell script using Claude 3.5 Sonnet in GitHub Copilot:

#!/bin/bash

# Initialize profiles array
declare -a profiles

# Iterate through directories in sync_from
for dir in "$sync_from"/*; do
    # Check if it's a directory
    if [ -d "$dir" ]; then
        # Get base name of directory
        base_name=$(basename "$dir")
        # Extract part after first dot
        profile_name=${base_name#*.}
        # Check if profile_name is not empty and not equal to base_name
        if [ "$profile_name" != "$base_name" ] && [ -n "$profile_name" ]; then
            # Add to array
            profiles+=("$profile_name")
        fi
    fi
done

And the following Shell script with DeepSeek R1:

#!/bin/bash

profiles=()

for dir in "$sync_from"/*; do
    if [[ -d "$dir" ]]; then
        dirname=$(basename "$dir")
        first_part="${dirname%%.*}"
        ending="${dirname#*.}"
        if [[ "$dirname" != "$first_part" && -n "$ending" ]]; then
            profiles+=("$ending")
        fi
    fi
done

Suffice it to say, both worked.

However, when I tried to generate the same Shell script with Natural Language (e.g. English), they produced unexpected results. I finally was able to generate expected output after a few trials, but that's not an efficient way of programming!

Multiple trial-and-error phases and hands-on fixes were needed to get English prompts to work as expected. Though with the pseudocode prompt, both Claude 3.5 and DeepSeek R1 produced predictable results on the first attempt!

Questions to developer community:

What's your experience with AI in Software Development? Did you get 10x productivity boost using different LLMs as some people claim?
Do you know of any better way, or a better algorithm that produces predictable results with different Generative AI models?

Please let me know!

if you prefer, you may also engage me in the following related Tweet:

Hopefully my experience was helpful to you. Happy Coding! 🥰

Top comments (4)

Peter Truchly • Feb 13

Yeah, it will take a bit more effort to make LLMs useful for enterprise codebases. Not only better LLMs but complete systems around them. The more freedom there is (like with generic prompts to implement a snake game) the easier for a model to come up with something. But I would bet that google could also find a good implementation quite easily somewhere on the web.

Fayaz • Feb 13

I agree - generic prompts where the LLM not only understands the language of the user, but also the intent of the prompt - is the right way forward.

Intent understanding can be done with tooling. For example, GitHub copilot, Cursor etc. can already examine development environment along with other files in the environment, including UI/UX design image files. Using these, LLMs should eventually be able to achieve much better results. But we're definitely not there yet.

Ramkumar M N • Mar 1

Hi Fayaz,
Great job! Fantastic article with clear explanations.

AI code assistants like GitHub Copilot improve productivity to 2x-4x depending on the use case. For small, well-defined tasks, they speed up development significantly. However, for complex logic or large-scale code generation, their accuracy drops, requiring manual intervention.
No perfect solution yet, but limiting context size, providing structured prompts, and iterating on smaller code blocks improve results. We shall go with a hybrid approach to combining AI suggestions with traditional software engineering best practices. This give more predictable outcomes.

Regards,
Ram

Fayaz • Mar 10

This is inline with my experience as well.
Thanks for your comment! ❤️