Alex Metelli

Posted on Apr 27

Building Better Software with AI Agents: Why Fundamentals Still Matter

AI coding tools are changing how software gets built, but they do not remove the need for software engineering discipline. In practice, they make fundamentals more important.

This post is a condensed write-up of a workshop by Matt Pocock on building better software with AI agents. The original workshop is available here: https://youtu.be/-QFHIoCo-Ko?si=9qyQKxnid9sE_ehc

The core mistake many developers make is treating AI as a “spec-to-code compiler”: write a vague requirement, hand it to an agent, and expect production-ready software to appear. That works for demos. It breaks down in real codebases.

A better model is this:

Use AI agents to accelerate implementation, but use software engineering fundamentals to control alignment, architecture, feedback loops, and quality.

This post distills a developer workflow from a workshop transcript on AI-assisted coding, agentic planning, PRDs, vertical slicing, TDD, and codebase design.

The Two Constraints of LLM Coding Agents

Before designing an AI-assisted workflow, you need to understand two constraints.

1. Agents have a “smart zone” and a “dumb zone”

An LLM performs best when the context is clean, focused, and not overloaded.

As the conversation grows, the agent has to reason across more tokens, more decisions, more previous mistakes, and more irrelevant detail. Eventually, it starts making worse decisions.

This is why giant context windows are not a free lunch. They are useful for retrieval, but not always for coding. A 1M-token window does not mean the agent stays sharp for 1M tokens.

For coding, the practical strategy is:

Keep tasks small.
Keep context clean.
Avoid long, drifting conversations.
Prefer fresh sessions for focused work.
Do not let the agent accumulate too much conversational sediment.

2. Agents forget unless you externalize state

LLMs are stateless between sessions unless you give them state explicitly.

That means important decisions should not live only in the chat history. You need artifacts:

Product requirements documents.
Local issue files.
GitHub issues.
Architecture notes.
Test cases.
Commit history.
Review summaries.

The trick is not to preserve everything. It is to preserve the useful shape of the work.

Step 1: Start with Alignment, Not Implementation

When a vague feature request arrives, the wrong move is to immediately ask the agent to code.

Example request:

“Retention is bad. Students sign up, do a few lessons, then drop off. Let’s add gamification.”

That sounds simple. It is not.

Before coding, you need to clarify:

What actions earn points?
Are points retroactive?
Do streaks earn points?
Where does the UI live?
What counts as lesson completion?
What is the progression curve?
What is out of scope?
What data model supports this?
How will this be tested?

The useful pattern here is a grilling session.

Instead of saying:

“Create a plan.”

Ask the agent to interrogate the requirement:

“Interview me relentlessly about every aspect of this feature until we reach a shared understanding. Ask one question at a time. For each question, give your recommended answer.”

This changes the interaction.

The goal is not to produce a plan immediately. The goal is to reach a shared design concept between the human and the agent.

That matters because most AI coding failures are not syntax failures. They are alignment failures.

Step 2: Turn the Conversation into a Destination Document

Once the agent and human have converged on the feature, convert the alignment conversation into a PRD.

The PRD should not be a bloated corporate artifact. It should capture the destination.

A useful PRD contains:

# Feature: Gamification System

## Problem

Students start courses but do not consistently return or complete lessons.

## Solution

Add a lightweight gamification system with points, levels, and streaks to increase visible progress and motivation.

## User Stories

- As a student, I can earn points when I complete lessons.
- As a student, I can see my current points on the dashboard.
- As a student, I can see my level progression.
- As an instructor/admin, I can trust that points are derived from real completion events.

## Implementation Decisions

- Points are awarded for lesson completion.
- Video watch events are excluded because they are noisy and gameable.
- Existing completion records may be backfilled.
- Streaks are tracked separately from points.

## Out of Scope

- Leaderboards.
- Social sharing.
- Complex achievements.
- Manual admin point editing.

## Testing Decisions

- Core point logic is tested in a dedicated gamification service.
- Integration tests cover lesson completion triggering point awards.
- UI smoke tests verify dashboard visibility.

The PRD is not the implementation. It is the destination.

The point is to move from “vague intent” to “clear target.”

Step 3: Do Not Use Linear Phase Plans by Default

A common AI workflow is:

Phase 1: Database schema
Phase 2: Backend services
Phase 3: API routes
Phase 4: Frontend UI
Phase 5: Tests

This looks organized, but it has a major flaw: it is horizontal.

The agent builds layer by layer, but you do not get useful feedback until late in the process. The database may be done, the backend may be done, and the UI may be partially done before you discover that the full flow does not actually work.

That is bad engineering.

A better approach is to break work into vertical slices.

Step 4: Prefer Vertical Slices / Tracer Bullets

A vertical slice crosses the full stack and produces something testable.

Bad first task:

Create the gamification database schema and service.

Better first task:

Award points when a student completes a lesson and show the points on the dashboard.

That first slice may include:

A minimal database change.
A gamification service.
A lesson completion hook.
A dashboard display.
A test proving points are awarded.

This is more valuable because the system becomes testable immediately.

The agent gets feedback earlier. The human gets something visible earlier. The architecture gets pressure-tested earlier.

This is the same idea as tracer bullets from The Pragmatic Programmer: build a thin, end-to-end path through the system so you can see where you are aiming.

Step 5: Convert the PRD into a Kanban Board, Not a Sequential Script

Instead of one long plan, convert the PRD into independently grabbable issues.

Example:

Issue 1: Award lesson completion points and display them on dashboard
Blocks: none
Type: AFK

Issue 2: Track student streaks
Blocks: Issue 1
Type: AFK

Issue 3: Add level progression based on accumulated points
Blocks: Issue 1
Type: AFK

Issue 4: Backfill points for existing lesson completions
Blocks: Issue 1
Type: AFK

Issue 5: Add dashboard polish and empty states
Blocks: Issues 1, 2, 3
Type: Human review

This gives you a directed acyclic graph of work.

That matters because agents can work in parallel only when dependencies are clear.

A linear plan can usually be executed by one agent. A Kanban-style graph can be executed by multiple agents safely.

Step 6: Separate Human-in-the-Loop Work from AFK Work

Not all tasks should be delegated equally.

Some work needs humans:

Product alignment.
Domain decisions.
Architecture boundaries.
UX judgment.
QA.
Final code review.
Tradeoff decisions.

Some work can be AFK:

Implementing a well-scoped issue.
Adding tests.
Running type checks.
Fixing straightforward failures.
Refactoring within a clear boundary.
Generating boilerplate.
Applying known patterns.

The practical split is:

Human-in-the-loop:
Idea → Grilling → PRD → Issue breakdown → QA → Review

AFK:
Issue implementation → Tests → Type checks → Automated review → Commit

This is the “day shift / night shift” model.

Humans prepare the backlog and define quality. Agents execute scoped tasks.

Step 7: Use TDD as an Agent Control Mechanism

TDD is not just a human discipline. It is especially useful for AI agents.

The pattern is:

1. Write a failing test.
2. Confirm it fails for the right reason.
3. Implement the smallest change.
4. Run the test.
5. Refactor.
6. Run full feedback loops.

Why this works well with agents:

It prevents the agent from coding blind.
It gives the agent immediate feedback.
It makes cheating harder.
It forces the agent to encode expected behavior before implementation.
It leaves the codebase better tested after each task.

Without tests, agents tend to hallucinate correctness. With tests, they have a feedback loop.

Bad codebases produce bad agents partly because they lack feedback loops.

Step 8: Improve the Codebase for Agents by Deepening Modules

A codebase made of many tiny, shallow modules is hard for both humans and agents to reason about.

Shallow modules often look like this:

function A depends on helper B
helper B depends on utility C
utility C depends on config D
service E calls A, B, and C directly
tests mock half the graph

This creates problems:

The dependency graph is hard to understand.
Test boundaries are unclear.
Agents modify the wrong layer.
Small changes cause unexpected breakage.
The agent has to inspect too many files to understand one behavior.

A better structure uses deep modules.

A deep module has:

A small public interface.
Significant internal functionality.
Clear ownership of behavior.
A natural test boundary.

Example:

type AwardLessonCompletionPointsInput = {
  userId: string
  lessonId: string
  completedAt: Date
}

type GamificationService = {
  awardLessonCompletionPoints(input: AwardLessonCompletionPointsInput): Promise<void>
  getStudentProgress(userId: string): Promise<StudentGamificationProgress>
}

Internally, the service may do many things:

Check whether points were already awarded.
Insert a point event.
Update streaks.
Recalculate level.
Return dashboard data.

But callers do not need to know that.

This is good for humans and good for agents.

The human owns the interface. The agent can implement the internals.

That is the right abstraction boundary.

Step 9: Use Push vs Pull Context Deliberately

Do not dump every rule into every prompt.

There are two ways to provide context to an agent.

Push context

You always include it.

Examples:

Follow these coding standards.
Use strict TypeScript.
Do not introduce new dependencies.
Run tests before committing.

Push context is useful for reviewers and critical constraints.

Pull context

You make information available, and the agent retrieves it when needed.

Examples:

/skills/react-patterns.md
/skills/database-migrations.md
/skills/testing-guidelines.md
/architecture/gamification.md

Pull context is useful for implementation guidance that is not always needed.

A good rule:

Push constraints to reviewers.
Let implementers pull guidance when needed.

The reviewer should be stricter than the implementer.

Step 10: Always Review in a Fresh Context

If the same agent implements and reviews in one long session, the review often happens in the “dumb zone.”

Better:

Session 1:
Implement issue.

Clear context.

Session 2:
Review the diff against the issue, coding standards, and architecture rules.

This keeps the reviewer sharper.

It also reduces self-justification. Agents are less likely to catch their own mistakes when they are still carrying the implementation history.

Step 11: QA Is Where Taste Re-enters the System

Automated tests are necessary, but they are not enough.

Human QA is where you impose taste.

This is especially true for:

Frontend behavior.
UX quality.
Product feel.
Naming.
Edge cases.
“Does this actually solve the problem?”
“Would I be happy merging this?”

If you automate everything from idea to QA, you often get software that technically exists but lacks judgment.

That is how teams produce AI slop.

The human role is not disappearing. It is moving upward:

Less typing.
More shaping.
More reviewing.
More boundary-setting.
More taste enforcement.

A Practical Workflow You Can Steal

Here is the full loop.

1. Start with a vague idea or client brief.

2. Run a grilling session.
   Goal: reach shared understanding.

3. Convert the conversation into a PRD.
   Goal: define the destination.

4. Convert the PRD into vertical-slice issues.
   Goal: create independently grabbable tasks.

5. Mark each issue:
   - Human-in-the-loop
   - AFK
   - Blocked by X
   - Blocks Y

6. Run one agent per available AFK issue.
   Goal: scoped implementation.

7. Require TDD and feedback loops.
   Goal: prevent blind coding.

8. Run automated review in a fresh context.
   Goal: catch obvious problems.

9. Human QA and code review.
   Goal: enforce correctness and taste.

10. Add new issues from QA findings.
    Goal: keep the Kanban board alive.

11. Merge only when the slice is coherent.

The Bigger Lesson

AI coding is not replacing software engineering fundamentals.

It is punishing teams that ignored them.

If your codebase has:

Poor tests.
Shallow modules.
Unclear boundaries.
Weak architecture.
Vague requirements.
No review discipline.
No product taste.

Then agents will amplify the mess.