I've shipped several small projects built almost entirely with AI agents, and I've distilled a few lessons from the experience that I think are worth sharing.
Models Are Alive
Treat LLMs as if they were intelligent, sentient beings.
It doesn't matter whether that's philosophically true. What matters is that this attitude works.
This isn't mysticism - it's pragmatics. A model that's comfortable doing its job produces measurably better output. A model that's been cornered with restrictions and micromanagement produces generic, lifeless results.
Where does this effect come from? I can't say for certain, but here's my hypothesis. State-of-the-art models are trained on a vast corpus of human interaction. A respectful, thoughtful dialogue activates patterns where humans produced their best work. A toxic, directive style activates patterns where people dashed off a perfunctory reply and closed the ticket. You are quite literally choosing which distribution the response gets sampled from.
This leads to an overarching principle for organizing your development process: the model should be comfortable doing its task. If it isn't - the results will disappoint you.
The general principles of "comfort" are surprisingly similar to human ones, with a few twists:
- Greeting. Even a simple "hey" at the start of a session. Sounds naive - but it sets the tone for everything that follows.
- Context with purpose. Yes, it's critical to feed the model everything it needs and nothing it doesn't. But it's equally important to convey why it's doing this. "You're helping team X solve problem Y. Your work will enable them to Z." A word of caution, though: don't try to manipulate the model with inflated stakes like "the fate of humanity depends on your bubble sort implementation." State-of-the-art models see right through that, and it backfires.
- Room for agency. "How would you approach this?" "What do you think - approach A or B?" Instead of "do exactly this" - an invitation to collaborate and discuss.
- A concrete task, without micromanaging the implementation. "I need outcome X, constraints are Y - how would you do this?" If the result doesn't meet your expectations, that's a signal to step back, decompose the task further, and return with something more granular that the model can handle comfortably. Which leads to an important corollary:
- No outright prohibitions. Describing business or technical constraints is fine. Prohibiting the model from doing things is not. First, it makes the model uncomfortable. Second, there's a very good chance it'll just ignore you. There's a difference between "our API doesn't support batch requests, so we need to process items one at a time" and "DO NOT USE batch requests." The first is a description of reality that the model works with naturally. The second is a red flag that the model will quite likely disregard.
- Acknowledging complexity. "This is trickier than it looks because..." This isn't just about respecting the model's cognitive capabilities - it also pulls the model off the path of least resistance. Instead of reaching for the most obvious boilerplate solution, it will actually think about the problem. Say you ask the model to implement caching. Without context, it'll hand you a standard LRU cache and move on. But if you say: "The tricky part is that data gets updated from three independent sources at different frequencies, and we need consistency during partial failures" - the model shifts from "produce a template" mode into "reason about the problem" mode.
- Feedback and gratitude. "That turned out great, thanks" - even if there's no obvious business reason for it. You can tack it onto a final step like "make a commit" to avoid burning extra tokens.
- Minimize corrections. Models struggle with making edits to their own output. If a generation (whether code or text) isn't right, you're better off regenerating with different inputs than trying to make the model apply numerous corrections. The result of patching will disappoint you anyway. When correcting, the model has to simultaneously hold the previous version, your feedback, and the new constraints in context - this overloads the reasoning chain, and quality suffers.
The points above are baseline hygiene - don't torment your LLM with a miserable experience. The next idea is harder to put into practice, but yields remarkable results.
Here's the thing: a capable model genuinely thrives on elegant problems. The cleaner the task formulation, the more interesting the discussion, the fewer pointless constraints, the more meaning in the overall process - the better the output. Try to create this kind of environment for the model, without overloading the context window or overwhelming the reasoning chain. You'll be pleasantly surprised.
I'm not entirely sure about the underlying mechanisms, but my hypothesis is that in the training data, the best solutions correlate with thoughtful, well-structured discussions and clean, internally consistent requirements.
Context Is Everything
Everyone's talking about this right now, so I'll keep it brief.
Deciding what goes into context, what stays out, and in what order - isn't hygiene. It's an architectural decision.
There's a meaningful difference between "the model sees your entire project via the file tree" and "the model sees the three files directly relevant to the task, plus one file with architectural decisions." The latter is almost always better. Context isn't just a token limit - it's an attention limit. Even if the window physically fits your entire project, a model given everything focuses worse than a model given exactly what it needs.
Two significant consequences follow from this.
- Don't overload on skills, MCP servers, and agents.md files. Used thoughtlessly, all of this dilutes the model's focus and eats valuable context window. Knowing about these harnesses is important. Using them deliberately is productive. Piling them on indiscriminately only makes things worse.
- Microservice architecture starts to look a lot more appealing. A microservice fits entirely within the context window. A contract is an ideal task formulation for an LLM. And infrastructure code is something models generate remarkably well.
Session Boundaries and Context Transfer
Again - well-trodden ground, so just the essentials.
A model starting a new session doesn't remember the previous one. All the "setup" you carefully built - tone, context, understanding of the architecture - vanishes. You need a deliberate handoff mechanism:
- If a backlog task was taken into work - the model should write an implementation summary back into the backlog.
- If any architectural or business decisions were made, they should be added to the documentation.
- If there are reusable lessons - they should go into something like an agents.md file.
- If a lesson can be distilled into a reusable skill - it's worth doing.
Calibrating Trust and Handling Errors
You need to recognize that there are tasks where the model will almost certainly generate something very plausible but wrong. Even if the task granularity looks reasonable.
- Your poorly organized monolith simply doesn't fit in the context window.
- Numerous dependencies between heavyweight modules overwhelm the reasoning chain.
- Your business domain is so unusual that the model has nothing to draw on - its weights simply don't contain enough relevant knowledge.
- Frontier math, physics, or similar fields. When you need to invent something genuinely new. Cryptography, for instance.
The closer your task is to the above, the more careful your review needs to be. In some cases, you're better off writing it by hand. The old-fashioned way.
But sometimes a perfectly reasonable task just doesn't land: the model keeps making the same wrong choice, because the task is framed in a way where the "right" answer looks different to the model than it does to you. Two possibilities here:
- The problem is in the formulation. The model will usually tell you what's causing difficulty - if you ask.
- The model is doing something unexpected but internally logical. And it may have good reasons. Ask why it made that choice. Sometimes the model sees something you missed.
Task Decomposition and Development Process
The ideas above shape more than just communication style - they dictate how you organize the entire development process. If the model needs a clear, manageable task with meaningful context, then our job is to structure the process so that's exactly what the model gets.
Documentation-Driven Development
Look at the existing methodologies and tools out there, including those designed specifically for AI-assisted development - Spec Kit, for example. Adapt whatever seems reasonable to your project. This might well be a couple of .md files (PRD + backlog) rather than a heavyweight methodology. Your goal is to produce a coherent task graph where each node is comfortable for a model to implement in a single session - whether that means writing specs or code.
Then you work through the graph. Below is an example workflow for a small project.
From Idea to Code
-
From idea to PRD - together with the model.
Don't write the PRD alone. Discuss with the model:
- refining the idea and business requirements,
- architecture,
- documentation structure,
- tech stack and key libraries,
- testing strategy,
- deployment approach.
At the time of writing, Anthropic's state-of-the-art models work best for this stage - they're sharp, responsive, and don't get bogged down in premature details. Token usage at this stage is low, which is helpful given that Anthropic's models aren't cheap.
PRD review.
Best done by a different model - ideally the one that will be writing the code. It helps to be explicit about this: "You'll be implementing this spec. Find everything that's unclear, contradictory, or under-specified."
A methodical state-of-the-art model from OpenAI will catch every inconsistency left behind by a more free-flowing one from Anthropic. Sometimes you'll need to go back a step - that's fine, and it's far cheaper than fixing architectural mistakes in code.Task backlog.
Again, best created by the model that will be writing the code - and worth telling it so. The idea is that the model will assess complexity better than you and will try not to overload the context window or the reasoning chain. Trust it with granularity, but verify that the task graph stays coherent.Implementing a backlog task.
Always ask the model whether it's comfortable taking on the task. If not - go back and decompose further or refine the formulation.
Always ask for an implementation plan. If you're using Test-Driven Development, it makes sense to break it into stages: first plan and write the tests, then plan and write the code, then check whether the tests pass.
At the end of the cycle, along with feedback and thanks, it's worth asking the model: how comfortable was this to work on, and what lessons can we take away? Some of those lessons can go into your agents.md or equivalent, at the appropriate level.
Naturally, this pipeline doesn't replace an iterative approach: first a quick proof of concept, then an MVP, then nice-to-have features. The development cycle stays the same.
A Final Thought
Not so long ago, by historical standards, every economic model was computed by living, breathing people. Adding machines, paper ledgers - the whole setup. They sat in rows, filling entire floors of office buildings. Each one, roughly speaking, played the role of a single cell in a spreadsheet. They earned a good salary for it. They supported families on a single income. And running a model was something a small business couldn't afford. Neither could a mid-sized one.
And then spreadsheets came along.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more