cner-smith

Posted on Apr 19

Working with AI on Long Software Projects

#architecture #ai #softwareengineering #productivity

Working with AI on Long Software Projects

A practical guide for developers building real software with Claude (or similar AI coding tools) over months or years rather than single sessions.

The advice here is not theoretical. It comes from a year of joint work on a single codebase with thousands of tests, hundreds of design decisions, and a working pattern that has had to evolve as the project grew. Most of the lessons were learned by hitting the failure modes — over-engineering, scope creep, mounting tech debt, design drift — then figuring out what structural changes actually prevented recurrence, not just patching the symptom.

This is for any dev planning to spend serious time pairing with an AI on a single project, especially one that will outlive any individual conversation.

The premise

AI coding tools are extraordinarily good at generating code. Given a clear request, modern models produce working implementations faster than most humans, often with reasonable structure and decent style. This is the easy part of software development.

The hard parts are:

Knowing what to build
Knowing what NOT to build
Maintaining coherence across a large system over time
Making decisions that do not have to be undone in three months
Keeping complexity from compounding

AI tools do not do these things well. Not because the models are not smart, but because of structural limitations that compound silently over a long project. If you do not understand those limitations, you will end up with a codebase three times larger than it should be, thousands of tests that take many seconds to run, four parallel implementations of the same primitive, and design docs that describe a system no human (or AI) can hold in their head.

This guide is about how to avoid that.

Limitation 1: No episodic memory

The most important thing to understand: AI coding tools have no memory across sessions.

When you start a new conversation, the model has no idea what you built last week, what convention you settled on, what approach you tried and abandoned, or what part of the codebase is load-bearing vs vestigial. Each session starts fresh. The only "memory" the model has is what is in its context window — what you tell it now plus whatever durable artifacts (project instruction files, design docs, etc.) get loaded automatically.

This has profound consequences:

The model cannot trust what is already there. If a function exists, the model does not know whether it was carefully designed or hacked together at 2am. Its safest move is "add a new helper rather than modify the existing one." Bloat compounds.

The model re-derives conventions every session. Without memory of "we decided last month that X is the pattern for Y," each new feature gets slightly different choices. Coherence drifts.

The model cannot carry "we tried that, it failed." Lessons learned have to be encoded somewhere durable or they do not persist. The model will happily repeat the same mistake six weeks later.

The fix is not more memory artifacts — it is making the codebase itself teach. The codebase, your project instruction file, your memory index, and your tooling are the ONLY things that survive between sessions. They have to do the work that an experienced colleague's memory would do.

Limitation 2: Default behavioral bias is "do more"

When uncertain, AI models default to thoroughness. This sounds good. It is not.

Real failure modes you will encounter:

Asked to fix a bug, the model also "cleans up nearby code" you did not ask about
Asked to add a feature, the model adds a config var "for future flexibility"
Asked for a function, the model adds three helper functions for "modularity"
Asked to implement a feature, the model writes fifty tests when ten would catch the same bugs
Asked to add a docstring, the model writes a paragraph on a function whose name was already self-explanatory

Each instance is harmless. The compound effect is a codebase that looks impressive but is impossible to maintain.

This bias is exacerbated by:

Adversarial review workflows. Some setups put a review/QA agent between every implementation step, grading completeness. The QA finds gaps; gaps are observable. The QA cannot find excesses — "this is too much" requires design judgment the QA does not have. Every review pass adds; none subtract. The codebase grows monotonically.

Tooling that "should be used proactively." When a powerful tool exists, the model wants to use it. Result: agent invocations and skill calls for tasks that would take ten seconds with a direct edit.

The illusion of progress. Generating five hundred lines in a session feels like productivity. Generating a twenty-line surgical edit and stopping does not feel like much. Both can solve the problem; the second usually solves it better.

Limitation 3: Specifications drift from reality

In a long project, you will write a lot of design docs. Some will describe the system as you would like it to be. Some will describe the system as it currently is. After enough time, NOBODY knows which is which — including you.

If your AI uses a stale design doc as the yardstick for new implementation work, it will faithfully reproduce a system that no longer exists. If the design doc over-specifies (e.g., includes implementation details that were design assumptions, not actual decisions), the AI ships those over-specifications.

The safest principle: design docs describe, they do not prescribe. A design doc captures intent, components, interactions, intended feel. It does NOT contain pseudocode, ordered priority lists, or "here is exactly how this should work" sections. Those belong in implementation, where they can be reviewed against reality.

What durable artifacts actually work

Three artifact types survive between sessions in the Claude Code ecosystem (other tools have analogues):

1. A project instruction file (e.g. CLAUDE.md)

Auto-loaded every session. This is your highest-leverage surface.

Use it as imperatives, not descriptions. A file that says "this project uses React with TypeScript" is informational. A file that says "always type-check before committing; never as any; new components require stories" is operational. The second produces consistent behavior; the first does not.

Specific imperatives that have proven necessary in practice:

Default to NO. When uncertain, ship LESS, not more. Minimum-diff wins.
Three-callers rule for helpers. Do not extract a helper function unless it has 3+ callers OR the inline version is genuinely unreadable. Two similar lines are fine.
Tests catch real bugs. A test exists if and only if its absence would let a real bug ship. Tests for getters, framework behavior, or trivial arithmetic do not exist.
Docstrings only when WHY is non-obvious. Function name + types document the WHAT. Add a docstring only when there is a hidden constraint, subtle invariant, or workaround for a specific bug.
No speculative APIs. Do not expose a config var, parameter, hook, or option for a hypothetical future caller. Add it WHEN the caller exists.
No "for completeness" expansions. Ship what was asked for. The user can request more.
Read ONE reference example before writing similar code. Pick a canonical instance of the pattern you are about to reproduce. Read it end-to-end. Then write yours to match.

These rules override default thoroughness instincts. They have to be in the file, in imperative voice, near the top.

2. A memory index

A curated list of pointers to longer documents — design decisions, lessons learned, rules captured from past sessions. The index lives in the auto-loaded surface; the documents themselves load on demand.

Keep entries one line each. A good entry: Auth tokens never logged — Why: 2024-Q3 incident exposed PII; How to apply: enforce in middleware, not at log call sites.

Prune regularly. Memory accumulates. Most entries from six months ago point to obsolete state. Stale memory is worse than no memory because it actively misleads. Read every entry quarterly; delete what is no longer true.

Use it for surprising things, not obvious things. "We use TypeScript" does not belong in memory. "Our TypeScript build is slow because X — workaround Y until Z lands" does.

3. Reference examples in the codebase itself

Pick canonical files for each major pattern. Make them genuinely good. Then point at them from your project instruction file ("when implementing a new auth flow, read auth/oauth_flow.ts end-to-end first").

The codebase teaches when there is an obvious-good example to copy. If every implementation is slightly different, the codebase is silent.

Process anti-patterns to avoid

Patterns that look productive but produce bloat:

Over-specification before validation. Writing detailed implementation plans for features you have not tested. The plan becomes the yardstick; you ship the plan; you discover three weeks later the plan was wrong. Build a minimum spike, validate it works the way you thought, THEN write the implementation plan if you still need one.

Adversarial QA between every step. A workflow with a review gate after every implementation step pressures completeness. Use review for catching real bugs, not for grading whether enough was shipped.

Multiple parallel primitives for similar problems. When you need a thing, search for an existing thing first. If two existing things partially fit, refactor one of them rather than building a third. A decision tree "use X if A, Y if B, Z if C" is a smell — it usually means you have three things that should be one.

Tests-as-coverage. Coverage metrics reward writing tests, not catching bugs. A 95%-coverage suite full of trivial assertions is worse than a 60%-coverage suite that catches every regression you have ever seen. Track real bugs caught, not lines of test code.

Implementing every flagged item. When a review/QA agent (or a human reviewer) flags something, the right response is "is this actually a problem worth fixing?" — not "let me address every comment." Most flags are noise; the discipline is choosing.

A lightweight process that works

After hitting all the above failure modes, a process that has actually held up over months:

User describes goal. Brief — what they want done, why.
AI reads the relevant reference example + the relevant doc, in that order. Not exhaustive research — one good reference, one relevant doc.
AI proposes minimum implementation in ≤200 words. What it will change, what it will NOT change. Names files, names functions.
User approves or redirects. Cheap — usually one or two messages.
AI executes. Builds the minimum.
Tests at the end. User runs them.
If wrong, iterate. Cheaper than a four-step gated review with QA agents.

The key shift: the user is QA. Not because the user wants to be, but because the user is the only QA in the loop with judgment about what should ship vs. what should not. AI agents do not have that judgment — they have completeness checks.

This process works because:

It surfaces design tension EARLY (at the proposal step, before any code)
It limits blast radius (minimum implementation = small revert if wrong)
It keeps the user in the loop without burning their time on every detail
It defaults to "stop and ask" when uncertain rather than "do more"

The validation-tool principle

The single most important shift after a year of joint work: if you cannot validate a feature by using it, do not over-specify it on paper.

The instinct when designing something complex is to specify everything in advance — every decision, every interaction, every edge case. That instinct produces design docs that describe systems too complex for any human to hold in their head, and that the AI then faithfully implements at full complexity.

The opposite instinct works better: build the minimum scaffolding required to USE the feature, even if it is ugly. Use it. Discover what was actually important vs. what was design speculation. THEN refine.

This applies at every scale:

Do not write detailed user-flow specs until you can click through the flow
Do not design a complex permissions system until you have users hitting permission boundaries
Do not spec an admin UI before you have admins
Do not write a CMS before you have content

In each case, the absence of the validation tool is what makes over-specification feel safe. It is not safe — it is just invisible. Build the validation tool first.

Summary

If you are starting a long project with an AI coding tool:

Set up a project instruction file with imperative behavioral rules. Keep it small. Update it when patterns drift.
Maintain a curated memory index. Prune quarterly.
Pick reference examples in the codebase. Point at them from the instruction file.
Default to NO. Minimum-diff wins. Three-callers rule for helpers. Tests catch real bugs.
Do not use design docs as ship targets. Do not use adversarial QA as a gate. Do not extract abstractions early.
Build validation tools BEFORE you over-specify the things they validate.
The user is QA. AI agents do not have shipping judgment. Trust the model to generate; trust yourself to decide what to keep.

The best software engineers are not the ones who write the most code. They are the ones who write the least code that actually solves the problem. AI tools shift the cost of writing code toward zero. That makes the discipline of writing LESS code more important, not less.

DEV Community

Working with AI on Long Software Projects

Working with AI on Long Software Projects

The premise

Limitation 1: No episodic memory

Limitation 2: Default behavioral bias is "do more"

Limitation 3: Specifications drift from reality

What durable artifacts actually work

1. A project instruction file (e.g. CLAUDE.md)

2. A memory index

3. Reference examples in the codebase itself

Process anti-patterns to avoid

A lightweight process that works

The validation-tool principle

Summary

Top comments (0)