DEV Community

Cover image for Rethinking Team Development in the Age of LLMs
Shinsuke KAGAWA
Shinsuke KAGAWA

Posted on

Rethinking Team Development in the Age of LLMs

This is a long read. Feel free to skim the headings and jump to what you care about.

TL;DR:

  • Code reviews shift from reviewing code to reviewing specs, plans, and rules
  • Design the whole Spec → Plan → Test → Implementation cycle, not just specs
  • Small teams (2-3 people) work better — scale by adding teams, not people
  • Run quality checks in isolated contexts to avoid LLM shortcuts
  • Document implicit knowledge explicitly — LLMs can't read your mind

As LLMs become part of the core development workflow, many of our long-standing processes start to break down. In this post, I'll walk through what changes, why it matters, and how teams can adapt.

Much of what I write here comes from building and maintaining a company-internal chatbot on my own, and later open-sourcing the framework behind it. Working solo with LLMs surfaced a lot of pain points that I didn't expect — especially around review, scaling, and process design.

For context, the framework I open-sourced is here (not required to read this post, but included for transparency):
👉 ai-coding-project-boilerplate


1. Why Our Processes Need to Change

1.1 The Problem We're Facing

In LLM-based development, the traditional code review feedback loop doesn't quite work anymore.

Traditional Development LLM-Based Development
Code gets corrected through reviews Prompts need feedback
Feedback lives in the code Prompt context doesn't stick around
Code quality comes from reviews Final quality depends on prompts

1.2 Code Reviews Aren't What They Used to Be

Traditional code reviews do a lot of things: security checks, performance tuning, readability improvements, and knowledge sharing. In the LLM era, these functions shift in two key ways:

1. Way More Output

As LLMs generate more code, it becomes unrealistic to review everything by hand.

2. Feedback Goes Somewhere Else

Here's the bigger issue: giving feedback on code doesn't actually change future output. To improve what LLMs generate, you need to provide feedback to prompts and rules (like CLAUDE.md).

That doesn't mean we can throw away code reviews — far from it. Security and performance still need human eyes. But things like readability and knowledge sharing? Those should be codified into standards and documentation. What we used to do implicitly in reviews now needs to be explicit rules for the LLM.

When I started building with LLMs, my basic stance was: don't manually fix the output. I wanted to see what the LLM could do on its own — partly to stress-test it, partly because I didn't want my own limitations to cap the LLM's performance. What I found was that output quality was wildly inconsistent. So I started asking: "Why did it generate that?" and writing down the answers. Over time, that turned into a set of rules that had mostly been stuck in my head.

1.3 Why "Just Add More People" Doesn't Work

The Scaling Problem

  • Communication paths grow as n(n-1)/2 for team size n
  • Governance overhead (reviews, syncs, docs) eats into productivity
  • Brooks's Law (1975) called this out decades ago — LLMs just make it more obvious

In my experience (and backed by plenty of research), larger teams tend to suffer from higher communication overhead and lower efficiency. So "more value with the same people" is hard to pull off when you're already overstaffed. A more realistic path: minimize the number of people per mission first, get efficient, then add people who fit the culture as needed.

2. Rethinking How We Build Software

2.1 It's About the Whole Process, Not Just Specs

You've probably heard "Spec-Driven Development" thrown around, but focusing only on specs isn't enough.

"The person who communicates the best will be the most valuable programmer in the future. The new scarce skill is writing specifications that fully capture your intent and values."
— Sean Grove (OpenAI), AI Engineer Conference

What actually matters is designing the whole Spec → Plan → Test → Implementation cycle — not just writing better specs.

2.2 Specs Will Never Be Perfect (And That's OK)

Writing specs with zero ambiguity? Nearly impossible. Instead of fighting that battle head-on, control it through process.

The approach I recommend:

  • Don't chase perfect specs — run short cycles of spec → plan → test → implementation instead
  • Use LLMs to review specs from multiple angles (edge cases, ambiguities, missing assumptions, etc.)
  • Humans make the final call
  • Catch spec gaps early in the testing phase

Don't aim for perfect specs — aim for a process that catches problems before they get expensive.

2.3 Picking the Right Implementation Approach

There's no one-size-fits-all here — the right slicing strategy depends heavily on your project's maturity and architecture.

Vertical Slicing (Feature-Driven)

  • Build across all layers for each feature
  • Works when: features are mostly independent, you want to ship something usable early, or changes touch all layers anyway
  • How to validate: Does each slice deliver something real and usable to the end-user?

Horizontal Slicing (Foundation-Driven)

  • Build architecture layers one at a time
  • Works when: foundation stability matters most, multiple features share the same base, or you need to validate layer by layer
  • How to validate: Integration testing once all layers are ready

Hybrid (Mix and Match)

  • Combine approaches based on what the project needs
  • Works when: requirements are fuzzy, you need to shift gears between phases, or you're moving from prototype to production

For product work, I usually default to vertical slicing and switch to horizontal only for shared foundations. But honestly, the right answer depends on your context.

One thing I learned the hard way: without explicit guidance, LLMs tend to slip into horizontal slicing — building layer by layer — even when vertical makes more sense. They also tend to leave interfaces with existing code underspecified. When I open-sourced my framework, I had to document a step-by-step derivation process to get consistent vertical implementations. That documentation only exists because I ran into the wall first.

2.4 How Quality Assurance Changes

A typical QA flow might look like this:

Requirements →(Human Review)→ Specification
  ↓
Specification →(AI)→ Code
  ↓
Code →(AI + CI)→ Test Execution
  ↓
Test Results →(Human)→ Feedback
Enter fullscreen mode Exit fullscreen mode

In practice, most of the important review work shifts to the specification stage. Code gets validated through test pass/fail. That said, for high-risk domains (security, finance, healthcare), code reviews may still have a role.

What About Test Quality?

If LLMs write both code and tests, how do you trust the tests? A few things help:

  • Generate tests from specs (design docs), so they're tied to requirements
  • Have humans review the specs — that indirectly validates test direction
  • Track coverage metrics for completeness
  • Use production monitoring to catch what slipped through

The point isn't to write perfect tests. It's to build a feedback loop that catches issues early, before they become expensive.

One pattern I kept hitting: LLMs love to generate tests. Too many tests. Without guidance, they'd create exhaustive test suites that were expensive to maintain and didn't add proportional value. I ended up documenting a filtering process that focused on ROI and critical user journeys — only then did test generation become manageable.

2.5 Keeping Code Maintainable

Aspect How to Ensure It
Understandability Specs and design docs
Modifiability Clear boundaries
Testability Test-first approach + testable boundaries
Operability Mature CI/CD
Analyzability Clear mapping back to specs

The way I see it:

Keeping things changeable usually costs less than trying to get them perfect upfront

2.6 Where Does Architecture Fit?

Specs drive design. Architecture is how you make specs happen. But for large, long-lived systems, specs change over time — so architecture that can handle change still matters.

In LLM-era development, spec clarity tends to outweigh architectural cleverness. But that's not "ignore architecture" — it's "unclear specs can't be saved by good architecture." Good architecture just makes spec changes less painful.

3. The Development Process, Step by Step

3.1 Overview

Step What Happens Who's Responsible
1. Spec (Why/What) Define requirements clearly. Lock down interfaces. Human
2. Plan (How) Break work into the smallest useful slices Human + AI
3. Test Generation Generate tests from the plan. Catch spec issues early. AI
4. Implementation Generate code based on the plan AI
5. Continuous Refactoring Improve code while keeping contracts stable AI + Human

3.2 Why Specs Matter So Much

If your specs are fuzzy, your architecture will just inherit that fuzziness

  • Specs are the source of truth
  • Architecture is just a means to fulfill specs
  • Ambiguity and contradictions need to be resolved at the spec stage

So what's important?

  1. Clarify specs to a level LLMs can actually work with
  2. Derive tests and implementation from specs consistently
  3. Run this as a repeatable process

Here's something that surprised me: when I started defining clear process steps — spec, plan, test, implement — each step naturally became its own unit of work with distinct inputs and outputs.

That's when I realized these steps could be handled by separate, specialized "sub-agents" with fresh context windows. The main orchestrator doesn't do the work — it just coordinates. This approach solved two problems at once: context exhaustion (each agent starts fresh) and quality consistency (each agent has a single responsibility).

3.3 What Makes "Small Changes" Possible

You'll need:

  • Trunk-Based Development: Merge to main often
  • CI/CD Pipelines: Automated tests and deploys
  • Feature Flags: Roll out gradually in production

What DORA research tells us:

  • High-performing teams do "small batch changes" with "autonomy"
  • Deployment frequency and lead time are what set Elite teams apart

Kent Beck nailed it:

"Make it work, make it right, make it fast"

Don't aim for perfect on the first try. Get it working, then improve.

4. Data and Real-World Examples

4.1 Productivity Numbers

McKinsey (2024)

  • 35-50% time savings for certain tasks
  • 30-60% reduction for docs, new code, and refactoring

BairesDev Survey (2025)

  • 92% of devs use AI-assisted coding
  • Average 7.3 hours saved per week

A word of caution: A METR experiment found that for experienced devs working on familiar OSS projects, AI tools sometimes decreased productivity. Results vary a lot depending on the task and the developer's experience level.

4.2 What Companies Are Seeing

Google (Code Migration Project)

  • 80% of code changes successfully generated by AI
  • 50% reduction in migration time

Airbnb

  • Migrated 3,500 test files in 6 weeks (originally estimated at 1.5 years)

4.3 Refactoring at Scale

Qodo Survey

  • 60-80% time savings with AI refactoring tools
  • 4x ROI by focusing on high-impact legacy components
  • 70% fewer production incidents with systematic test protocols

"Consistent iteration, not one-time rewrites, transforms technical debt from an exponentially growing problem into a quantifiable, declining curve."

5. How Organizations Need to Change

5.1 Small Teams Win

Communication paths explode fast:

  • 3 people → 3 paths
  • 7 people → 21 paths (7x)
  • 10 people → 45 paths (15x)

What research says:

Source Finding
Hackman (2002) 4-5 people max if everyone needs to understand the whole
Google DORA (2023-2024) Elite teams tend to be small and autonomous

On team size:

Research already says smaller is better. If LLMs boost productivity even more, 2-3 people might be enough. This isn't rigorously proven — it's a gut feel from practice. Some projects might need 3-4.

5.2 How to Scale

Traditional LLM Era
Grow teams to scale Keep teams small, add more teams
Managers manage people Managers optimize delivery
Split by specialty Split by product/feature
Design governance processes Enable autonomous teams

What managers need to do now:

In the LLM era, managers connect product delivery with everything around it and maximize outcomes.

Specifically:

  • Optimize headcount and reform dev processes
  • Smooth out bottlenecks around dev (requirements, QA, etc.)

Implementation cost used to dominate. As LLMs shrink that, surrounding processes become the bottleneck. Managers need to optimize the whole system.

Also: managers need to actually use LLM-based development themselves. They become evangelists and enablers for the new way of working.

5.3 The Direction Forward

When efficiency improves, "fewer people per mission" vs. "more value with same people" looks like a choice. But Brooks's Law suggests large teams are the problem.

So: minimize people per mission first, get efficient, then grow thoughtfully. "More value with the same headcount" only makes sense once you're already efficient.

6. Risks and Limitations

6.1 Not All Bugs Are Equal

For typical app development:

  • Prioritize by impact × probability
  • Throwing resources at rare, low-impact bugs is over-engineering

Exceptions (these need different treatment):

  • Financial systems (regulatory, audit)
  • Healthcare/Aerospace (life-critical)
  • Security-critical (small-looking issues can be huge)
  • Distributed systems (race conditions, non-deterministic bugs)

For these, automated security scans and formal verification help more.

6.2 LLMs Miss Context (and Take Shortcuts)

Qodo Survey:

  • 65% say AI "misses relevant context" during refactoring
  • Structural tasks are hit hardest

There's another failure mode I kept seeing: quality checks happen late in the implementation process, when the context window is already strained. At that point, LLMs start taking shortcuts — they'll produce code that passes lint and tests technically, but cuts corners in ways that matter. They want to "finish" the task.

My fix was to run quality checks in a separate context — a fresh sub-agent that only does quality validation. It doesn't carry the baggage of the implementation session.

Bonus: I found that LLMs are actually better at reviewing and fixing code they didn't write. A separate reviewer agent can be more objective than the one that generated the code in the first place.

How to deal with it:

  • Tests guarantee behavior
  • Small changes limit blast radius
  • Early refactoring shortens how long bad code sticks around
  • Run quality checks in isolated contexts

6.3 Spec-Driven Development Has Limits

Marmelab raised some fair critiques:

  • Agents don't always follow specs
  • As projects grow, specs start missing the point
  • "It's just Waterfall coming back"

My response: That's exactly why you need to design the whole process, not just write better specs. Build systems that catch and fix spec issues early.

6.4 Applying This to Existing Orgs

  • In Japan (and many places), legal and cultural constraints make team downsizing hard
  • Understand the ideal, then find realistic compromises
  • Incremental progress is the practical path

7. Making It Real

7.1 What Changes

Aspect Traditional With LLMs
Quality Assurance Code review Tests + AI review
What Gets Reviewed Code Specs, plans, rules
Prompt Knowledge Share via pairing
Initial Quality vs. Refactoring Initial quality focus Continuous refactoring
Change Strategy Big changes, carefully Small changes, often

7.2 Document What Used to Be Implicit

What code reviews used to handle implicitly (readability, knowledge sharing) now needs to be written down for LLMs.

Write these down:

  • Coding standards (style, naming, patterns)
  • Architecture rules (separation of concerns, dependency direction)
  • Domain knowledge (business logic, terminology)

This stabilizes LLM output quality and reduces human review load.

When I open-sourced my framework, the biggest surprise was how many "implicit rules" had been living in my own head. I knew them, but I'd never written them down — because I didn't have to. Once I documented them for contributors (and for LLMs), output quality improved immediately. Even for me.

7.3 Sharing Prompt Know-How

The problem: Prompt context doesn't live in the code

Solutions:

  • Small teams: Share as tacit knowledge without formalizing everything
  • Pair/mob programming: Level up prompt skills across the team
  • Commit logs: Not ideal for lookups, but practical

7.4 Infrastructure You'll Need

  • CI/CD Pipelines: Automated tests and deploys
  • Trunk-Based Development: Merge often
  • Feature Flags: Gradual rollouts
  • Monitoring: Catch production issues early

8. Wrapping Up

These ideas didn't come from theory — they came from building real products with LLMs and hitting the limits repeatedly. Every rule I wrote down exists because I saw the LLM fail without it.

8.1 The Core Ideas

  • Design the whole Spec → Plan → Test → Implementation cycle, not just specs
  • Build processes that catch and fix spec issues early — not perfect specs
  • Pick your slicing strategy (vertical/horizontal/hybrid) based on context
  • Feedback on prompts and rules improves output — feedback on code doesn't
  • Document what used to be implicit

8.2 How Organizations Should Evolve

  1. Team size: Small (2-3 people) by default — a gut feel from practice
  2. Scaling: Add teams, not people
  3. Order of operations: Minimize people per mission first, then grow
  4. Independence: High autonomy, minimal dependencies

8.3 For Existing Organizations

  • Separate ideals from constraints
  • Aim for ideals in new projects
  • Make incremental progress on existing ones

Appendix: Choosing Your Implementation Strategy

Here's a framework I use when deciding how to slice work.

Phase 1: Analyze Current State

  • How is the existing codebase structured?
  • Look at: separation of concerns, data flow, dependencies, tech debt

Phase 2: Explore Strategies

  • What patterns should you reference?
  • Check: similar projects, OSS implementations, tech stack examples

Phase 3: Assess Risk

  • What could go wrong?
  • Evaluate: technical, operational, and project risks

Phase 4: Check Constraints

  • What are you working with?
  • Consider: technical, time, resource, and business constraints

Phase 5: Decide

  • Vertical / Horizontal / Hybrid?
  • Set verification levels per task (L1: Works → L2: Tests pass → L3: Build succeeds)

Phase 6: Document It

  • Write down why you chose this approach in a Design Doc

If this resonated with you, feel free to check out the repo — stars, issues, and PRs are all welcome:
👉 ai-coding-project-boilerplate

References

Academic Research

  • Hackman, J.R. (2002). Leading Teams: Setting the Stage for Great Performances
  • Google DORA Report (2023-2024). State of DevOps
  • McKinsey (2024). The Economic Potential of Generative AI
  • Becker, J. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developers. METR.

Industry Reports and Case Studies

Classic Works

  • Brooks, F.P. (1975). The Mythical Man-Month
  • Beck, K. (1999). Extreme Programming Explained

Top comments (0)