Over the past year, like many others in the tech space, I've witnessed a generative AI frenzy. There's an inverse relationship at play: the less technical and more managerial a person or organization becomes, the stronger their push for embracing this "revolutionary" technology (which, let's be honest, has been around since the 1970s). Promises of 10x delivery, hell, even 30x delivery, echo through conference rooms. "Just fully augment yourselves with these tools and software engineering will be obsolete," they say.
Call me a cynic, but I don't take bold claims at face value. So I did what all software engineers do: I found a problem I needed to solve and started building, this time with GitHub Copilot as my co-pilot. Even with a modest 10x improvement, I should have had a finished product in no time, right?
Not exactly.
I previously wrote about the first phase of this experiment, where I encountered what most developers discover: a period of almost godlike rapid progress, completely undone by refactoring hell and an ungodly spaghetti mess of code. Yet I remained undeterred, determined to find the best practices for using these tools or drive myself insane trying.
So what happened after another month of experimentation? Did I produce my dream application, or is it dead in the water? Let's find out together.
Background: Building a Password Manager in Go
For those who missed my earlier article, here's the context. To test the effectiveness of AI coding tools and strategies for working with them, I decided to build a desktop application in an unfamiliar language. The application: a password manager (I'm tired of cloud-based password storage and wanted my own tooling). The language: Go.
With a background in distributed systems using Java and TypeScript, this was an excellent opportunity to learn something new while relying on AI models for guidance. As it stands, the application builds cross-platform, stores secrets securely, and (according to GPT-4) can share secrets across LAN and Bluetooth.
Agentic Coding Meets TDD: A Fundamental Tension
One of my main concerns with the push for agentic workflows is that they fundamentally conflict with one of my core development practices: Test-Driven Development (TDD) and Behavior-Driven Development (BDD). I've found tremendous success with this strategy. It keeps solutions concise, stable, and always focused on the end user. Nothing exists without good reason, and refactoring can be done proactively with minimal technical debt. From a business perspective, you get a more stable, easier-to-change product that serves user needs while remaining comprehensible to the developers who built it.
So I entered the second phase with a new plan: How can I work with LLMs while maintaining a TDD/BDD workflow? As it turns out, with considerable difficulty.
To test this approach, I outlined a request for a new feature: create a mechanism to securely share secrets between machines. Unlike my previous attempts, I gave the agent (GPT-4) one key instruction: no implementation, just method signatures, test cases, and reasoning behind decisions. Then, write an integration test covering the full journey plus several negative cases.
Initially, everything went well. I remained involved in decision-making, code quality improved, and test coverage reached an all-time high. Had I discovered the magic formula for peak AI utilization?
Problems emerged during implementation. A persistent issue with these models is their reluctance to use dependency injection in classes, triggering refactor cycle number one. Then came test mock injection, where despite instructions to use Mockery (which I'd set up in phase one), GPT continued creating unnecessary custom mocks.
Despite these hiccups, progress seemed rapid. After a few days, I had something that would supposedly work, until GPT changed its mind and decided to completely redo the implementation. The encryption keys were wrong, so let's add more. More helper methods. More custom code that could be handled by third-party libraries. Fortunately, I remained highly involved at this stage, with changes made one file at a time, allowing me to challenge decisions and learn from mistakes.
Eventually, with new keys added, full keychain integration, and a robust ephemeral/hybrid key generation system, we seemed to be in business. Debugging took considerable time, but my solid test cases, especially the integration test, revealed issues quickly and helped me understand the often convoluted logic the agent produced.
This marked the end of the first part of my feature work. The business logic functioned, tests passed, and the code appeared well-structured with solid design patterns and high coverage. This was my greatest project success and the most enjoyable phase.
The Agentic Death Spiral: Enter YOLO Development
Around this time, I received a Copilot update featuring a fancy new model: GPT-4o (and 4o-mini). Excited to test the hype, I decided to tackle the substantial technical debt from earlier poor code. I gave it a suitably complex task that all previous models had failed: review the codebase, identify improvement areas, and create a refactoring plan.
The result was pleasantly surprising. Within ten minutes, I had a detailed issue list and a multi-step remediation plan. I approved the refactor and let the agent take control. The outcome: hundreds of deleted lines, better test assertions, and numerous cognitive complexity warnings banished to the ether. I had seemingly solved the elusive problem of AI-driven code improvement.
But pride comes before a fall, and cracks were already appearing. The agent reverted to large multi-file changes and dreaded 1000-character word salads detailing these modifications. I became detached, skimming outputs and simply entering "go ahead with the next change" when prompted. The roles reversed, I became the passenger with GPT driving. Thus began the misery of YOLO development.
The unraveling accelerated when I decided to add transport mechanisms for sending secret bundles between devices. First came LAN support with hundreds of lines of custom code, all of which I had to remove in favor of a dedicated library. (Why do agents always avoid libraries?) Did this work? I have no idea. The AI wrote the tests, and I was so disconnected that I barely read them.
The UI was built in record time. I got so lazy that I simply prompted the AI to "improve the UI and make it look nicer," which it interpreted as an invitation to remove half the options and create the world's worst modal menu with only a theme-changing option. However, having little interest in UI development, I was happy to let the model lead.
YOLO development reached its final form when I prompted for Bluetooth setup. At this point, I wasn't even reading prompt outputs, let alone code. I seemed to be making progress, so I didn't care, until being informed that no method existed for setting up a Bluetooth adapter on Windows in Go. The model attempted to create a custom C# package for some inexplicable reason, but I'd had enough. I just wanted the work completed so I could move on. The code had become a lost cause, and the functionality I'd built this for (peer-to-peer secret bundle sharing) was untestable and unrecognizable from the carefully crafted code in the first phase.
Is YOLO Development Inevitable with Agentic Coding?
This question's answer likely varies by individual, but for me, it depends entirely on several factors:
Model Complexity and Autonomy: The more the model does and the fewer mistakes it makes, the higher the likelihood of YOLO development. This represents the critical paradox of LLM-assisted development. I can only see this worsening as models improve.
Codebase Size and Quality: Larger, more complex codebases with worse quality and more machine-generated code increase the chance of total developer disconnect. Fortunately, models perform poorly with very large codebases, somewhat protecting themselves from this issue.
Developer Technical and Product Knowledge: As the project progressed (especially with GPT-4o), the agent suggested increasingly incomprehensible changes. This was particularly true with transport logic, where I had limited knowledge. Consequently, I had no idea what the code was doing and was too checked out to read and understand it.
Lessons Learned: Finding Signal in the Noise
Despite my complaints above (and believe me, I love complaining—just ask anyone I've done a retrospective with), I remain interested in the AI-assisted development paradigm. I have more experiments planned and still hope to eventually develop a system that enables my best possible work, both personally and professionally.
There's much to extract from this experience. Even if it ultimately failed to maintain my engagement, I learned considerably. Here are my key takeaways:
Request Small, Concise Changes: Be crystal clear with the model (repeatedly) to do one thing at a time. Treat the model as a software novice, outline desired steps and avoid allowing direct file changes. I've found much more success copying from output into files, as it forces me to read and understand placement.
Test, Test, Test: Robust testing was a lifesaver. Models are intelligent enough to run tests after changes, making issue resolution much faster. When models made large changes, seeing tests fail as expected was reassuring and prevented regressions. Just ensure you read the tests, models tend to create redundant assertions while missing key checks.
BDD Outperforms TDD: One major takeaway was how successful BDD testing proved. When I wrote scenarios upfront (even when the model wrote test code), output quality was higher and more product-oriented. I could also better guide the model toward correct choices and priority-based feature implementation. TDD results were mixed—models frequently ignored tests despite repeated prompting. A proper ML-based TDD solution seems distant with current practices.
Take Breaks and Stay Vigilant: The longer I worked with these models, the more detached I became. While playing Dwarf Fortress during code generation sounds appealing, it meant split focus and allowed poor code addition. Dedicated sessions within set timeframes work best, with the added benefit of reflection time to refine processes.
Don't Allow Changes You Don't Understand: This is self-explanatory. If the model adds logic you can't reasonably decipher, it shouldn't be added. It's fine to be unclear on syntax occasionally, but core flow should be easily understood and sensible to the engineer in charge.
But You Don't Need to Understand Scripts: One significant model win is their scripting prowess. I heavily leveraged models for deployment and binary build scripts for each OS. While understanding these better might have been valuable, scripts like these are single-purpose—they work or they don't. Though debugging took longer, ensuring local testing capability made this area work very well.
Use Linters, Code Coverage, and Proper CI: Models excel at recognizing build checks. These tools proved extremely useful, allowing me to mandate code quality and test coverage in builds while providing a way for models to ensure changes pass local CI builds.
Know and Use Design Patterns: Planning changes ahead is crucial. I found models poor at implementing design patterns, leading to excessive duplication, helper reliance, and overly long files. Create packages upfront with proper naming and strongly instruct pattern adherence for much better results.
Experience Matters: As mentioned in my previous article, these tools amplify developer skill and experience. They can't think for you—the more input and guidance provided, the better the results. I also find that highly granular input across multiple prompts performs much better than single giant prompts or vague requests.
Conclusion: Software Engineering Isn't Going Anywhere
So concludes my first LLM-driven project. I leave with more questions than answers, so hopefully there will be future articles on TDD, professional usage, and other holistic topics. One thing I'm certain of: software engineering isn't disappearing. If anything, skills and experience are now more important than ever.
But what about speed, and that all-important 10x improvement? Well, not really. It took more than a month to finish this, and I'm not convinced it was any quicker than if I had just hand written all the code. The quality would probably have been better in that case as well.
The promise of AI-assisted development remains compelling, but the path forward requires careful navigation of its inherent paradoxes. The better these tools become, the more vigilant we must be about maintaining our agency as developers. The future lies not in replacement, but in thoughtful collaboration—where we remain firmly in the driver's seat.
For context, you can read my previous article and view the project repository.
Top comments (1)
Good stuff.