Shinsuke KAGAWA

Posted on Nov 26

Rethinking Team Development in the Age of LLMs

#ai #productivity #llm #softwaredevelopment

This is a long read. Feel free to skim the headings and jump to what you care about.

TL;DR:

Code reviews shift from reviewing code to reviewing specs, plans, and rules
Design the whole Spec → Plan → Test → Implementation cycle, not just specs
Small teams (2-3 people) work better — scale by adding teams, not people
Run quality checks in isolated contexts to avoid LLM shortcuts
Document implicit knowledge explicitly — LLMs can't read your mind

As LLMs become part of the core development workflow, many of our long-standing processes start to break down. In this post, I'll walk through what changes, why it matters, and how teams can adapt.

Much of what I write here comes from building and maintaining a company-internal chatbot on my own, and later open-sourcing the framework behind it. Working solo with LLMs surfaced a lot of pain points that I didn't expect — especially around review, scaling, and process design.

For context, the framework I open-sourced is here (not required to read this post, but included for transparency):
👉 ai-coding-project-boilerplate

1. Why Our Processes Need to Change

1.1 The Problem We're Facing

In LLM-based development, the traditional code review feedback loop doesn't quite work anymore.

Traditional Development	LLM-Based Development
Code gets corrected through reviews	Prompts need feedback
Feedback lives in the code	Prompt context doesn't stick around
Code quality comes from reviews	Final quality depends on prompts

1.2 Code Reviews Aren't What They Used to Be

Traditional code reviews do a lot of things: security checks, performance tuning, readability improvements, and knowledge sharing. In the LLM era, these functions shift in two key ways:

1. Way More Output

As LLMs generate more code, it becomes unrealistic to review everything by hand.

2. Feedback Goes Somewhere Else

Here's the bigger issue: giving feedback on code doesn't actually change future output. To improve what LLMs generate, you need to provide feedback to prompts and rules (like CLAUDE.md).

That doesn't mean we can throw away code reviews — far from it. Security and performance still need human eyes. But things like readability and knowledge sharing? Those should be codified into standards and documentation. What we used to do implicitly in reviews now needs to be explicit rules for the LLM.

When I started building with LLMs, my basic stance was: don't manually fix the output. I wanted to see what the LLM could do on its own — partly to stress-test it, partly because I didn't want my own limitations to cap the LLM's performance. What I found was that output quality was wildly inconsistent. So I started asking: "Why did it generate that?" and writing down the answers. Over time, that turned into a set of rules that had mostly been stuck in my head.

1.3 Why "Just Add More People" Doesn't Work

The Scaling Problem

Communication paths grow as n(n-1)/2 for team size n
Governance overhead (reviews, syncs, docs) eats into productivity
Brooks's Law (1975) called this out decades ago — LLMs just make it more obvious

In my experience (and backed by plenty of research), larger teams tend to suffer from higher communication overhead and lower efficiency. So "more value with the same people" is hard to pull off when you're already overstaffed. A more realistic path: minimize the number of people per mission first, get efficient, then add people who fit the culture as needed.

2. Rethinking How We Build Software

2.1 It's About the Whole Process, Not Just Specs

You've probably heard "Spec-Driven Development" thrown around, but focusing only on specs isn't enough.

"The person who communicates the best will be the most valuable programmer in the future. The new scarce skill is writing specifications that fully capture your intent and values."
— Sean Grove (OpenAI), AI Engineer Conference

What actually matters is designing the whole Spec → Plan → Test → Implementation cycle — not just writing better specs.

2.2 Specs Will Never Be Perfect (And That's OK)

Writing specs with zero ambiguity? Nearly impossible. Instead of fighting that battle head-on, control it through process.

The approach I recommend:

Don't chase perfect specs — run short cycles of spec → plan → test → implementation instead
Use LLMs to review specs from multiple angles (edge cases, ambiguities, missing assumptions, etc.)
Humans make the final call
Catch spec gaps early in the testing phase

Don't aim for perfect specs — aim for a process that catches problems before they get expensive.

2.3 Picking the Right Implementation Approach

There's no one-size-fits-all here — the right slicing strategy depends heavily on your project's maturity and architecture.

Vertical Slicing (Feature-Driven)

Build across all layers for each feature
Works when: features are mostly independent, you want to ship something usable early, or changes touch all layers anyway
How to validate: Does each slice deliver something real and usable to the end-user?

Horizontal Slicing (Foundation-Driven)

Build architecture layers one at a time
Works when: foundation stability matters most, multiple features share the same base, or you need to validate layer by layer
How to validate: Integration testing once all layers are ready

Hybrid (Mix and Match)

Combine approaches based on what the project needs
Works when: requirements are fuzzy, you need to shift gears between phases, or you're moving from prototype to production

For product work, I usually default to vertical slicing and switch to horizontal only for shared foundations. But honestly, the right answer depends on your context.

One thing I learned the hard way: without explicit guidance, LLMs tend to slip into horizontal slicing — building layer by layer — even when vertical makes more sense. They also tend to leave interfaces with existing code underspecified. When I open-sourced my framework, I had to document a step-by-step derivation process to get consistent vertical implementations. That documentation only exists because I ran into the wall first.

2.4 How Quality Assurance Changes

A typical QA flow might look like this:

Requirements →(Human Review)→ Specification
  ↓
Specification →(AI)→ Code
  ↓
Code →(AI + CI)→ Test Execution
  ↓
Test Results →(Human)→ Feedback

In practice, most of the important review work shifts to the specification stage. Code gets validated through test pass/fail. That said, for high-risk domains (security, finance, healthcare), code reviews may still have a role.

What About Test Quality?

If LLMs write both code and tests, how do you trust the tests? A few things help:

Generate tests from specs (design docs), so they're tied to requirements
Have humans review the specs — that indirectly validates test direction
Track coverage metrics for completeness
Use production monitoring to catch what slipped through

The point isn't to write perfect tests. It's to build a feedback loop that catches issues early, before they become expensive.

One pattern I kept hitting: LLMs love to generate tests. Too many tests. Without guidance, they'd create exhaustive test suites that were expensive to maintain and didn't add proportional value. I ended up documenting a filtering process that focused on ROI and critical user journeys — only then did test generation become manageable.

2.5 Keeping Code Maintainable

Aspect	How to Ensure It
Understandability	Specs and design docs
Modifiability	Clear boundaries
Testability	Test-first approach + testable boundaries
Operability	Mature CI/CD
Analyzability	Clear mapping back to specs

The way I see it:

Keeping things changeable usually costs less than trying to get them perfect upfront

2.6 Where Does Architecture Fit?

Specs drive design. Architecture is how you make specs happen. But for large, long-lived systems, specs change over time — so architecture that can handle change still matters.

In LLM-era development, spec clarity tends to outweigh architectural cleverness. But that's not "ignore architecture" — it's "unclear specs can't be saved by good architecture." Good architecture just makes spec changes less painful.

3. The Development Process, Step by Step

3.1 Overview

Step	What Happens	Who's Responsible
1. Spec (Why/What)	Define requirements clearly. Lock down interfaces.	Human
2. Plan (How)	Break work into the smallest useful slices	Human + AI
3. Test Generation	Generate tests from the plan. Catch spec issues early.	AI
4. Implementation	Generate code based on the plan	AI
5. Continuous Refactoring	Improve code while keeping contracts stable	AI + Human

3.2 Why Specs Matter So Much

If your specs are fuzzy, your architecture will just inherit that fuzziness

Specs are the source of truth
Architecture is just a means to fulfill specs
Ambiguity and contradictions need to be resolved at the spec stage

So what's important?

Clarify specs to a level LLMs can actually work with
Derive tests and implementation from specs consistently
Run this as a repeatable process

Here's something that surprised me: when I started defining clear process steps — spec, plan, test, implement — each step naturally became its own unit of work with distinct inputs and outputs.

That's when I realized these steps could be handled by separate, specialized "sub-agents" with fresh context windows. The main orchestrator doesn't do the work — it just coordinates. This approach solved two problems at once: context exhaustion (each agent starts fresh) and quality consistency (each agent has a single responsibility).

3.3 What Makes "Small Changes" Possible

You'll need:

Trunk-Based Development: Merge to main often
CI/CD Pipelines: Automated tests and deploys
Feature Flags: Roll out gradually in production

What DORA research tells us:

High-performing teams do "small batch changes" with "autonomy"
Deployment frequency and lead time are what set Elite teams apart

Kent Beck nailed it:

"Make it work, make it right, make it fast"

Don't aim for perfect on the first try. Get it working, then improve.

4. Data and Real-World Examples

4.1 Productivity Numbers

McKinsey (2024)

35-50% time savings for certain tasks
30-60% reduction for docs, new code, and refactoring

BairesDev Survey (2025)

92% of devs use AI-assisted coding
Average 7.3 hours saved per week

A word of caution: A METR experiment found that for experienced devs working on familiar OSS projects, AI tools sometimes decreased productivity. Results vary a lot depending on the task and the developer's experience level.

4.2 What Companies Are Seeing

Google (Code Migration Project)

80% of code changes successfully generated by AI
50% reduction in migration time

Airbnb

Migrated 3,500 test files in 6 weeks (originally estimated at 1.5 years)

4.3 Refactoring at Scale

Qodo Survey

60-80% time savings with AI refactoring tools
4x ROI by focusing on high-impact legacy components
70% fewer production incidents with systematic test protocols

"Consistent iteration, not one-time rewrites, transforms technical debt from an exponentially growing problem into a quantifiable, declining curve."

5. How Organizations Need to Change

5.1 Small Teams Win

Communication paths explode fast:

3 people → 3 paths
7 people → 21 paths (7x)
10 people → 45 paths (15x)

What research says:

Source	Finding
Hackman (2002)	4-5 people max if everyone needs to understand the whole
Google DORA (2023-2024)	Elite teams tend to be small and autonomous

On team size:

Research already says smaller is better. If LLMs boost productivity even more, 2-3 people might be enough. This isn't rigorously proven — it's a gut feel from practice. Some projects might need 3-4.

5.2 How to Scale

Traditional	LLM Era
Grow teams to scale	Keep teams small, add more teams
Managers manage people	Managers optimize delivery
Split by specialty	Split by product/feature
Design governance processes	Enable autonomous teams

What managers need to do now:

In the LLM era, managers connect product delivery with everything around it and maximize outcomes.

Specifically:

Optimize headcount and reform dev processes
Smooth out bottlenecks around dev (requirements, QA, etc.)

Implementation cost used to dominate. As LLMs shrink that, surrounding processes become the bottleneck. Managers need to optimize the whole system.

Also: managers need to actually use LLM-based development themselves. They become evangelists and enablers for the new way of working.

5.3 The Direction Forward

When efficiency improves, "fewer people per mission" vs. "more value with same people" looks like a choice. But Brooks's Law suggests large teams are the problem.

So: minimize people per mission first, get efficient, then grow thoughtfully. "More value with the same headcount" only makes sense once you're already efficient.

6. Risks and Limitations

6.1 Not All Bugs Are Equal

For typical app development:

Prioritize by impact × probability
Throwing resources at rare, low-impact bugs is over-engineering

Exceptions (these need different treatment):

Financial systems (regulatory, audit)
Healthcare/Aerospace (life-critical)
Security-critical (small-looking issues can be huge)
Distributed systems (race conditions, non-deterministic bugs)

For these, automated security scans and formal verification help more.

6.2 LLMs Miss Context (and Take Shortcuts)

Qodo Survey:

65% say AI "misses relevant context" during refactoring
Structural tasks are hit hardest

There's another failure mode I kept seeing: quality checks happen late in the implementation process, when the context window is already strained. At that point, LLMs start taking shortcuts — they'll produce code that passes lint and tests technically, but cuts corners in ways that matter. They want to "finish" the task.

My fix was to run quality checks in a separate context — a fresh sub-agent that only does quality validation. It doesn't carry the baggage of the implementation session.

Bonus: I found that LLMs are actually better at reviewing and fixing code they didn't write. A separate reviewer agent can be more objective than the one that generated the code in the first place.

How to deal with it:

Tests guarantee behavior
Small changes limit blast radius
Early refactoring shortens how long bad code sticks around
Run quality checks in isolated contexts

6.3 Spec-Driven Development Has Limits

Marmelab raised some fair critiques:

Agents don't always follow specs
As projects grow, specs start missing the point
"It's just Waterfall coming back"

My response: That's exactly why you need to design the whole process, not just write better specs. Build systems that catch and fix spec issues early.

6.4 Applying This to Existing Orgs

In Japan (and many places), legal and cultural constraints make team downsizing hard
Understand the ideal, then find realistic compromises
Incremental progress is the practical path

7. Making It Real

7.1 What Changes

Aspect	Traditional	With LLMs
Quality Assurance	Code review	Tests + AI review
What Gets Reviewed	Code	Specs, plans, rules
Prompt Knowledge	—	Share via pairing
Initial Quality vs. Refactoring	Initial quality focus	Continuous refactoring
Change Strategy	Big changes, carefully	Small changes, often

7.2 Document What Used to Be Implicit

What code reviews used to handle implicitly (readability, knowledge sharing) now needs to be written down for LLMs.

Write these down:

Coding standards (style, naming, patterns)
Architecture rules (separation of concerns, dependency direction)
Domain knowledge (business logic, terminology)

This stabilizes LLM output quality and reduces human review load.

When I open-sourced my framework, the biggest surprise was how many "implicit rules" had been living in my own head. I knew them, but I'd never written them down — because I didn't have to. Once I documented them for contributors (and for LLMs), output quality improved immediately. Even for me.

7.3 Sharing Prompt Know-How

The problem: Prompt context doesn't live in the code

Solutions:

Small teams: Share as tacit knowledge without formalizing everything
Pair/mob programming: Level up prompt skills across the team
Commit logs: Not ideal for lookups, but practical

7.4 Infrastructure You'll Need

CI/CD Pipelines: Automated tests and deploys
Trunk-Based Development: Merge often
Feature Flags: Gradual rollouts
Monitoring: Catch production issues early

8. Wrapping Up

These ideas didn't come from theory — they came from building real products with LLMs and hitting the limits repeatedly. Every rule I wrote down exists because I saw the LLM fail without it.

8.1 The Core Ideas

Design the whole Spec → Plan → Test → Implementation cycle, not just specs
Build processes that catch and fix spec issues early — not perfect specs
Pick your slicing strategy (vertical/horizontal/hybrid) based on context
Feedback on prompts and rules improves output — feedback on code doesn't
Document what used to be implicit

8.2 How Organizations Should Evolve

Team size: Small (2-3 people) by default — a gut feel from practice
Scaling: Add teams, not people
Order of operations: Minimize people per mission first, then grow
Independence: High autonomy, minimal dependencies

8.3 For Existing Organizations

Separate ideals from constraints
Aim for ideals in new projects
Make incremental progress on existing ones

Appendix: Choosing Your Implementation Strategy

Here's a framework I use when deciding how to slice work.

Phase 1: Analyze Current State

How is the existing codebase structured?
Look at: separation of concerns, data flow, dependencies, tech debt

Phase 2: Explore Strategies

What patterns should you reference?
Check: similar projects, OSS implementations, tech stack examples

Phase 3: Assess Risk

What could go wrong?
Evaluate: technical, operational, and project risks

Phase 4: Check Constraints

What are you working with?
Consider: technical, time, resource, and business constraints

Phase 5: Decide

Vertical / Horizontal / Hybrid?
Set verification levels per task (L1: Works → L2: Tests pass → L3: Build succeeds)

Phase 6: Document It

Write down why you chose this approach in a Design Doc

If this resonated with you, feel free to check out the repo — stars, issues, and PRs are all welcome:
👉 ai-coding-project-boilerplate

References

Academic Research

Hackman, J.R. (2002). Leading Teams: Setting the Stage for Great Performances
Google DORA Report (2023-2024). State of DevOps
McKinsey (2024). The Economic Potential of Generative AI
Becker, J. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developers. METR.

Industry Reports and Case Studies

Classic Works

Brooks, F.P. (1975). The Mythical Man-Month
Beck, K. (1999). Extreme Programming Explained