Ashley Childress

Posted on Mar 4

I Stopped Reviewing Code: A Backend Dev’s Experiment with Google Gemini

#devchallenge #geminireflections #gemini

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge

🦄 I’ve been officially obsessed with AI for nearly a year now. Not from an ML research angle and not from a purist implementation standpoint. The thrill, for me, is in finding the limits as a user and then leaning on them until something gives. One of my favorite Hunter S. Thompson lines talks about “the tendency to push it as far as you can.” That has been my operating principle this entire year.

This build started as a portfolio experiment. It turned into something else entirely. This challenge became the cleanest environment I’ve found to test what actually happens when you step out of the implementation loop and let the model build the world without you.

What I Built with Google Gemini

When I saw the New Year, New You Portfolio Challenge, I knew it required a UI. That wasn’t a surprise. What was a surprise was how quickly I would realize I didn’t understand what I was looking at once it started coming together.

I’m a backend developer. You hand me a distributed systems problem and I’ll happily spend hours untangling it. You ask me to make a div visible in a browser and my brain actively searches for the exit. With only one weekend to build, there was no room for the "eyes-glazing-over" phase. Google Gemini would implement and I would supervise—that was my whole plan.

I walked in expecting Antigravity, powered primarily by Gemini Pro, to behave like every other AI system I’d tested—predictable and fairly easy to keep inside the guardrails. I thought I already knew what those guardrails looked like: strict types, linting, and the familiar routine of code review.

The Pivot: Dropping the Code Review Ritual

Initially, I followed the "responsible" pattern: prompt, review the diff, run tests, approve. It felt disciplined. It looked professional.

Very quickly, I realized I had no meaningful context for what I was reviewing in a frontend stack. I wasn't improving the output; I was participating in ceremony. So, I stopped reviewing code altogether.

Instead of validating lines of code, I validated outcomes. If the UI rendered correctly and passed functional tests, that was success. I cranked up the autonomy, taught Antigravity my repository expectations, and let it run. Copilot reviewed the code in my place, and Gemini responded in a closed loop. I stepped out of the implementation and into the role of a systems auditor.

Demo

This portfolio iteration documents what happens when you turn an agent loose inside a defined system.

For this build, the Antigravity panel was the primary interface. I defined the repo rules and testing expectations there, and Gemini implemented directly within that structure. It became the control surface for the entire loop.

V1 Release: Preserved version v1.1.0
Live Portfolio: https://anchildress1.dev

Replacing Trust With Systems

I didn’t simply remove oversight; I replaced it with Lighthouse audits and expanded test coverage. My assumption was simple: if the browser behaves and the tests pass, the code is "safe." I believed I had replaced trust in code with trust in systems. I was wrong—I had confused passing tests with structural integrity.

What I Learned

High Reasoning Isn’t Optional

I learned that for autonomous development, reasoning depth is a stability requirement. With lower reasoning modes (like Flash), changes were often partial—updating 2/3 of the files but "forgetting" the tests or documentation.

Switching to High Reasoning mode in Gemini Pro changed the pattern. Runtime errors dropped, and cross-file consistency improved. It finally started "remembering" to keep the docs aligned with the code changes without constant nudging.

Reasoning depth wasn’t about intelligence—it was about reliability under autonomy. Gemini’s deeper reasoning and context retention made the closed-loop workflow viable; without it, cross-file consistency collapsed quickly under autonomy.

The Reality Check: Sonar

After the high of the successful build wore off, I introduced Sonar as a retrospective audit. The UI rendered correctly. The tests passed. Everything appeared stable.

Sonar reported 13 reliability issues and assigned the project a C reliability rating. Of those issues, 66% were classified as high severity. Security review surfaced three hotspots, including a container running the default Python image as root and dependency references that did not pin full commit SHAs.

Maintainability scored an A, but still carried 70 maintainability issues—structural patterns that didn’t break behavior, yet increased long-term complexity.

That was the moment confidence turned into scrutiny.

The application worked. The tests passed. But reliability, security posture, and structural integrity told a different story. The tests validated behavior; Sonar validated assumptions. And those are not the same thing.

The lesson? AI-generated tests can pass because they were written to satisfy the implementation, not challenge it. Structural validation requires an independent layer of review outside the generation loop.

Google Gemini Feedback

What Worked Well

Cohesive Implementation: High reasoning Gemini Pro produced cross-file changes that respected the intent of the repository.
Agentic Orchestration: The model switching was seamless, and the orchestration interface made it possible to define expectations clearly and enforce them consistently.

Where Friction Appeared

Cooldown Transparency: While the interface shows when current credits refresh, the length of the next cooldown remains a black box.
Tool Performance: MCP responsiveness materially impacted iteration speed, sometimes forcing me to batch requests rather than work in small, rapid increments.

💡 Pro Tip: It would be a massive UX win to see exactly how long your next cooldown will be (e.g., "Your next cooldown will be X hours long") directly on the models page. Knowing if the lockout is 1 hour or 96 hours is vital for developer planning.

The Final Verdict: Autonomy Still Demands an Audit

The lesson wasn’t that Gemini failed; it was that systems-level trust requires more than passing tests. In future builds, autonomy won’t ship without an explicit adversarial audit. Whether that means a mandatory Sonar gate, a red-team prompt pass, or a second high-reasoning model instructed to hunt for the first model’s shortcuts—the loop must be challenged.

This project began as a weekend experiment to escape the “teleportation” haze of frontend development. It ended as an exploration of the razor-thin edge of system-level trust. The real build wasn’t the portfolio—it was discovering what happens when you lean on the limits of AI until they finally give.

Removing myself from the implementation loop didn’t eliminate responsibility; it redefined it. The more freedom you give an agent, the more rigor you must give your audit.

🛡️ The Tools Behind The Curtain

This post was brewed by me—with a shot of Google Gemini and a splash of ChatGPT. If you catch a bias or a goof, call it out. AI isn’t perfect, and neither am I.

Top comments (12)

Alois Sečkár • Mar 4

Whenever I skip the verification phase of AI-generated code, it backfires almost immedeately. For me it is invaluable tool to move forward it stuck or to start with something you dont know how/dont want to do, but I am less and less confident in trusting any output. If nothing else, the proposed code is almost always unnecessarily bloated.

Ashley Childress • Mar 4

I agree on this, too. In prod-level code I'm almost always asking myself how it's possible it figured that was a good idea! It definitely was not.

markdown • Mar 6

I **agree **with you!

theScottyJam • Mar 7 • Edited

For static pages, I think it's probably fine to use an LLM without doing a personal review of the code, as long as you do thorough manual testing, make sure it loads at a good speed, follows good accessability standards, etc.

For anything else, I would never let LLMs run loose - I'd be too scared of it introducing security vulnerabilities or desasterous bugs (such as dropping database data), and I would be responsible for any damage it caused.

Ashley Childress • Mar 7

One important thing to note that this is not a production system, which changes the game entirely! This project is a personal playground designed to test these sorts of limits. In a real prod environment, I completely agree with you!

That being said, in the future this becomes more and more possible. This particular problem is already being addressed today with things like CodeQL and Sonar scans. Thorough tests beyond the standard unit/integration suites are also fast becoming a baseline requirement.

The question is not whether or not AI can handle the job, but what do we need to do as engineers to teach it how to do so properly?

Christie Cosky • Mar 8

"AI-generated tests can pass because they were written to satisfy the implementation, not challenge it."

I discovered the same thing earlier this year when using AI to write unit tests: the tests mirrored the code instead of validating it. Everything passed, even when the implementation was actually wrong.

I wonder if using TDD would result in better outcomes, but I haven't tried it yet myself. It's a concept I've read about, but have had a hard time figuring out how to put into practice.

Ashley Childress • Mar 8

I tested this some early on, but AI really ended up writing tests that were either incomplete or the code was written satisfy the tests and not functionality. You need an adversarial component of some kind to challenge the "quick solution". Most LLMs are trained to find the quickest correct path, which is rarely the accurate one.

Christie Cosky • Mar 8

We have some job security for now then :D

In addition to manual verification of tests, I also have a Claude skill that is checks each unit test's correctness. When I generate them, they all follow a specific method name pattern:

<methodName>_when<Conditions>_<expectedBehavior>

Then I have a Claude skill check that the method under test matches the first part of the pattern, that the condition setup matches the second part of the pattern, and that the assertions match the expected behavior from the last part of the pattern. It actually does find problems this way.

Ashley Childress • Mar 8

Thanks—that’s a helpful observation. I may adopt a similar pattern for a gap in my own testing while I work on improving my local implementation flow for Claude.

You might also want to take a look at Verdent. It’s a higher-cost option, but one of the more complete implementations I’ve tested so far. It automates several of the manual setup steps involved in this process. 😃

That said, most tools still require a decent amount of customization before this type of automation becomes practical in a production environment. It’s likely feasible long term, though I expect it to shift development responsibilities away from traditional implementation work toward configuring and guiding AI systems. Other factors—particularly operating cost—will likely influence how quickly broader AI implementation progresses.

In the meantime, it's definitely fun to experiment with!

ReRoutd Admin • Mar 8

Appreciate the honest write-up. This mirrors what a lot of backend/platform teams in the US are seeing: AI can accelerate review prep, but not replace human accountability.

The line about tests validating implementation instead of behavior is the key risk. We’ve had better outcomes when teams require:

contract tests against real integration boundaries
mutation testing for critical paths
and “human sign-off” gates for auth, billing, and data-deletion code

Curious if you tracked defect escape rate before/after this experiment? That metric usually makes the business case clear fast.

Ashley Childress • Mar 8

I didn’t explicitly track it this time around, which in hindsight feels like an obvious thing I should probably start doing. Noted for the next build—thanks!

For v2, I leaned pretty heavily on Claude and the Sonar MCP. After a rather aggressive cleanup pass (and an extra GHA scan for Sonar), most of the bigger issues are now caught ahead of time. I'm still working out the best way to make the review pass more reliable and automatic though.

The higher-reasoning models are doing a lot of the heavy lifting when it comes to getting anything close to quality output. One thing I’m pretty convinced of now is that running multiple adversarial reviews, each with different LLMs, should help a lot. That’s next on my list to experiment with.

Robert Cizmas • Mar 4

Hi, Ashley! Great post! I invite you to test etiq.ai, I think you'll love it. It is an integrity layer for AI generated code and you can visualise your pipeline, debug fast and test different lines of code.

View full discussion (12 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.