DEV Community

Gerus Lab
Gerus Lab

Posted on

We Let AI Agents Write 80% of Our Code for 6 Months — Here's What Actually Happened

The Experiment That Almost Killed Our Engineering Culture

Six months ago, we at Gerus-lab made a radical decision. We would let AI agents — Claude Code, Cursor, GitHub Copilot — handle the majority of our development work. Not just autocomplete. Not just suggestions. We gave them architecture decisions, full feature implementations, and even code reviews.

The industry was buzzing about "super agents" executing 80-90% of routine development tasks. CNN reported that developer jobs aren't dying despite AI's rapid advance. GitHub hit 43 million monthly pull requests. 84% of developers now use AI tools daily.

We wanted to push further. What happens when you actually trust AI agents with real production code?

Here's our honest, unfiltered report.

Month 1: The Honeymoon Phase

The productivity spike was intoxicating. Our team of 8 engineers at Gerus-lab was suddenly shipping features at the pace of a team of 20. AI agents handled boilerplate generation, API integrations, database migrations, and unit tests. We measured a 47% reduction in time-to-merge for standard features.

Our developers were ecstatic. No more writing CRUD endpoints by hand. No more tedious test scaffolding. The mundane work that used to eat 60% of engineering time? Gone.

We deployed Claude Code for complex refactoring across our monorepo. Cursor became our pair programming partner for new feature development. Copilot handled the autocomplete layer. Codium generated our test suites. It felt like we had unlocked a cheat code.

The numbers looked incredible:

  • Pull requests per week: up 156%
  • Lines of code per developer: up 340%
  • Bug tickets from QA: down 23%
  • Developer satisfaction: through the roof

We were ready to write the blog post about how AI had transformed everything. We were wrong.

Month 2-3: The Cracks Appear

The first sign of trouble came from our most senior engineer. She noticed something disturbing during a routine architecture review: the codebase was becoming homogeneous. Every service looked the same. Every API followed identical patterns. The AI agents had converged on a single "best practice" template and applied it everywhere — regardless of whether it was appropriate.

Our payment processing service had the same architecture as our notification service. Our real-time WebSocket handler was structured like a REST API. The AI didn't understand context — it understood patterns.

Then the subtle bugs started appearing. Not the obvious crashes that tests catch. The insidious logic errors that only surface under specific conditions:

  • A race condition in our order processing that only triggered under high concurrency
  • A memory leak in a data pipeline that only manifested after 72 hours of continuous operation
  • An authentication bypass that worked in edge cases our test suite didn't cover

These weren't bugs that AI-generated tests would catch, because the AI had written both the code AND the tests with the same blind spots.

Month 4: The Reckoning

We hit a critical incident. A production outage that lasted 4 hours, caused by an AI-generated database migration that looked correct but had a subtle ordering dependency. Our engineers had reviewed the code — or rather, they had "approved" it. The review process had degraded.

Here's what we discovered: when AI writes 80% of the code, humans start reviewing at 20% capacity. Our engineers had developed what we now call "AI trust bias." If the code looked clean, was well-commented (AI is excellent at comments), and passed automated tests, developers would approve it without the deep critical thinking they'd apply to human-written code.

We ran an internal audit. The findings were sobering:

  • Code review depth had dropped 62% — average review time fell from 45 minutes to 17 minutes per PR
  • Architecture decisions were being made by default rather than by design
  • Knowledge transfer between team members had nearly stopped — why discuss code you didn't write?
  • Junior developers weren't learning — they were becoming AI prompt operators, not engineers

This last point hit hardest. Our two junior engineers, hired to grow into mid-level roles, had essentially become AI supervisors. They could prompt Claude Code to generate features, but they couldn't explain WHY the code worked. They were shipping faster than ever but learning slower than ever.

Month 5: The Pivot

We didn't abandon AI. That would be foolish — the productivity gains are real. Instead, we at Gerus-lab developed what we call the "70-30 Framework" that we now use across all our client projects.

The 70-30 Framework

AI handles 70% — the mechanical work:

  • Boilerplate code generation
  • Test scaffolding (but humans write the test LOGIC)
  • Documentation generation
  • Dependency updates and migration scripts
  • Standard CRUD operations
  • Code formatting and linting fixes

Humans own 30% — the thinking work:

  • Architecture decisions and system design
  • Business logic implementation
  • Security-critical code paths
  • Performance-sensitive operations
  • Code review with mandatory "explain the why" comments
  • All database schema designs

The key insight: AI is a force multiplier for execution, not a replacement for engineering judgment.

New Review Rules

We instituted what we call "adversarial reviews" for AI-generated code:

  1. The Three Why Test: For every AI-generated function, the reviewing engineer must explain WHY this approach was chosen over alternatives. If they can't, the PR is rejected.

  2. Chaos Testing: AI-generated code must pass randomized fault injection before merging. AI writes for the happy path. Humans must ensure the sad path works too.

  3. Architecture Fridays: Weekly sessions where the entire team reviews AI-suggested architectural patterns and decides whether they're appropriate or just convenient.

  4. Junior Pairing: Junior engineers must implement at least two features per sprint WITHOUT AI assistance. Learning matters more than shipping speed.

Month 6: The Results

After implementing the 70-30 Framework, our metrics tell a nuanced story:

  • Shipping speed: Still 89% faster than pre-AI baseline (down from 156%, but sustainable)
  • Production incidents: Down 71% from the AI-everything era
  • Code review quality: Back to pre-AI depth, with faster turnaround
  • Junior developer growth: Measurable skill improvement resumed
  • Developer satisfaction: Actually HIGHER than during the AI honeymoon — engineers feel empowered, not replaced
  • Client satisfaction: Up 34% — fewer bugs, better architecture, faster delivery

What We Learned: 5 Truths About AI in Production Engineering

1. AI Amplifies Your Engineering Culture — For Better or Worse

If your team already has strong code review practices, AI makes good code faster. If your reviews are sloppy, AI makes bad code faster. The tool magnifies whatever habits exist.

2. The "Vibe Coding" Trap Is Real

There's a seductive comfort in watching AI generate working code. It FEELS productive. But "works" and "works correctly under all conditions" are very different things. We've seen this pattern across our client projects at Gerus-lab — teams that "vibe code" with AI ship fast initially and pay the debt later.

3. Junior Developers Are the Canary in the Coal Mine

If your juniors stop asking "why does this work?" and start asking "what prompt should I use?", you've lost something important. The next generation of senior engineers needs to understand systems, not just orchestrate AI.

4. AI-Generated Tests Are Necessary But Not Sufficient

AI writes tests for the code it generated. It shares the same assumptions and blind spots. Always have humans write the edge case tests. Always question the test coverage AI reports.

5. The 80/20 Rule Is Actually 70/30

The hype says AI handles 80-90% of work. In our experience, the sustainable number for production-quality software is closer to 70%. That remaining 30% — the architecture, the judgment calls, the "is this actually the right solution" thinking — is where human engineers earn their keep.

The Bottom Line

AI agents are the most powerful tools to enter software engineering since the cloud. But they're tools, not engineers. The companies that will win in 2026 and beyond aren't the ones that automate the most — they're the ones that automate the RIGHT things and keep humans in charge of what matters.

At Gerus-lab, we've built this philosophy into every project we deliver. From Web3 platforms on TON and Solana to AI-powered SaaS products, we use AI to ship faster while maintaining the engineering rigor our clients depend on.

Thinking about integrating AI agents into your development workflow? Talk to our team — we've made the mistakes so you don't have to.


This post is based on real engineering data from our team at Gerus-lab. We're an engineering studio with 14+ completed projects across Web3, AI, GameFi, and SaaS. Explore our work at gerus-lab.com.

Top comments (0)