A year ago I wrote about how I built ExamGenius with vibe coding and Claude. The premise was simple: tell the AI what you want, it generates the code, you tweak, and in 20 hours you have a complete app that would normally take weeks. 85% time reduction. Working code. All good.
A year later, the industry has moved at breakneck speed. Models are more capable, IDEs integrate AI natively, and more teams are shipping AI-generated code to production every day. But there's a problem nobody wants to admit: we're shipping massive amounts of code that nobody truly reviewed.
It feels productive. But the result is often code that's hard to maintain, untested, poorly structured, and with zero traceability on why certain decisions were made. Vibe coding feels great until you have to debug something that neither you nor the AI remember writing.
This post is about what comes after vibe coding.
The real problem: without structure, even the best dev creates noise
Let's be honest: AI produces code that many senior devs couldn't write as fast or as clean. The problem isn't code quality — it's lack of direction.
Without clear tasks, the AI wanders. You say "fix this bug" and it refactors half the module. Without defined scope, "improve this" becomes a 15-file change nobody asked for. You lose visibility into what was done, what's left, and what broke along the way. It's not a competence problem — it's a context problem. The AI doesn't know what decisions you've already made, what patterns your codebase follows, or what the actual scope of your request is.
With ExamGenius it worked because it was a greenfield project — a single session, no legacy code, no other collaborators. But in real projects that evolve week after week, pure vibe coding doesn't scale.
And in the enterprise — where there are teams, code reviews, compliance requirements, and real production systems — it's simply not enough. You can't show up to a PR review saying "the AI generated it and it looked fine." You need to explain what was done, why, and what criteria were used to validate it.
The alternative: Task-Driven Development
The idea is simple: define the work before executing it.
Instead of opening a chat and saying "build me this," you define a task with clear scope, break it into subtasks if needed, and give the AI the precise context to execute each piece. The flow looks like this:
- Spec — define what you want to achieve and why
- Tasks — break the work into manageable units
- Subtasks — if a task is large, break it down further
- Plan — before writing code, the AI researches the codebase and writes an implementation plan in the task. You review and approve before a single line of code is touched
- Execution — the AI works within the defined scope
- Review — you validate before moving on
Step 4 is key and most people skip it. Asking the AI to plan before executing gives you a review checkpoint that's worth its weight in gold. The plan reflects the current state of the codebase — it's not theory, it's a concrete proposal you can approve, adjust, or reject. It's the difference between "just do it" and "tell me how you'd do it, then do it."
The developer goes back to being the architect and reviewer, not a spectator. You decide what gets done, the AI executes. And if something doesn't look right, you correct it before it propagates.
Here's what closes the loop: the AI doesn't just execute tasks — it also helps you define them. You can ask it to suggest subtasks, identify edge cases, or propose DoDs based on the project context. You provide the general direction, the AI helps you specify with enough detail, and then executes against that specification. It's a collaborative cycle where both sides contribute what they do best.
This applies to side projects and enterprise teams alike. The difference is that in the enterprise, it's not optional.
Definition of Done: the AI has to meet criteria too
This is where it gets interesting. Each task can have a set of Definition of Done (DoDs) — criteria that must be met before marking it complete.
This isn't bureaucracy. It's verifiable quality:
- Tests pass
- No regressions
- Code follows project patterns
- Translations are complete
- Logging is consistent
The AI stops being a black box that "generates code" and becomes an executor that has to meet concrete criteria. If it doesn't meet them, the task isn't done.
In an enterprise context, DoDs are the bridge between "the AI generated code" and "this code is production-ready." They're the evidence that someone (human or AI) validated that the work meets the team's standards.
Backlog.md: the backlog that lives in your repo
The tool I use for this is Backlog.md — a task management system that lives directly in your repository as markdown files.
Why not Jira or Trello? Because the AI can't read your Jira board. Backlog.md integrates with the IDE via MCP (Model Context Protocol), which means the AI can:
- Read pending tasks
- Understand the context of what it needs to do
- Update status when it's done
- Check DoDs before marking something as Done
Everything is versioned in git. Every task, every decision, every status change is a commit. Full traceability — you can see exactly what was done, when, and in what order.
You don't need to leave your code to manage your work. The backlog is right there, next to your src/.
For personal projects and small teams, Backlog.md is ideal for its simplicity. In the enterprise, the concept is the same but the tool would connect to existing tooling — Jira, Linear, Azure DevOps. What matters isn't the specific tool, but that the AI has access to the backlog, can read task context, and can validate DoDs. The task-driven development pattern works the same regardless of where the tasks live.
Real-world example: Finia
Finia is a Telegram bot I'm building for personal expense tracking in Costa Rica. It parses bank notifications, categorizes expenses with AI, generates reports, and supports multiple languages. The backend is Python with SQLAlchemy, runs on Kubernetes, and uses structlog for structured logging.
All recent development was done with task-driven development. Here are three concrete examples:
Refactoring handle_message
The main message handler was a 236-line monolith — a single async def with all the logic for budget input, edit amount, rate limiting, chat agent, notifications, and tutorial. It worked, but it was impossible to test or modify without breaking something.
I created TASK-2 with 4 subtasks:
- Extract budget input handler
- Extract edit amount handler
- Extract chat agent invocation
- Extract notification sending and tutorial detection
The AI executed each subtask separately. I reviewed each step. The result: 7 focused functions and a handle_message of ~40 lines that only orchestrates. All 79 tests passed without changes.
# Before: 236 lines in a single function
async def handle_message(update, context):
# ... everything mixed together ...
# After: clean orchestrator
async def handle_message(update, context):
user = await get_user(update.effective_user.id)
if user is None:
return
text = update.message.text.strip()
if await _handle_budget_input(update, context, user, text):
return
if await _handle_edit_amount(update, context, user, text):
return
if not await _check_rate_limit(update, user):
return
response = await _run_chat_agent(update, user, text)
await _send_expense_notifications(update, context, user, response)
i18n sweep
Finia supports Spanish and English. But after several development iterations, there were ~50 hardcoded English messages scattered across the codebase — buttons, errors, labels, prompts.
I created TASK-4 and the AI made three passes:
- Error messages and buttons in
callbacks.py,commands.py,chat.py - Usage blocks (
/expense,/sinpe), budget displays, category labels - Change the pre-registration default language to Spanish (most users are Costa Rican)
The DoD was clear: all user-facing messages translated, admin commands intentionally excluded. Each pass was reviewed before continuing.
Interaction tracking
I needed metrics for a CloudWatch dashboard. I created TASK-5: add command_executed to every command handler and callback_executed to the callback handler.
The pattern had to be consistent:
logger.info("command_executed", service="finia-bot", command="help", user_id=user.id)
logger.info("callback_executed", service="finia-bot", callback="chcat", user_id=user.id)
During review, I caught that /start was using telegram_id instead of user_id (because the user doesn't exist yet). We fixed it to include both. I also found that commands.py hadn't been staged in the commit — the AI had edited it but didn't include it. Without step-by-step review, that would have shipped to production incomplete.
Vibe coding vs Task-driven: the comparison
| Vibe Coding | Task-Driven | |
|---|---|---|
| Start | "Refactor this for me" | TASK-2: Refactor handle_message, 4 subtasks defined |
| Scope | The AI decides what to change | You define what gets touched and what doesn't |
| Validation | "Looks good" | DoDs: tests pass, consistent pattern, no regressions |
| Traceability | Chat history (if it wasn't deleted) | Tasks in git with status, plan, and notes |
| Rollback | Ctrl+Z and pray | Each subtask is an atomic commit |
| Enterprise-ready | No | Yes |
ExamGenius was chapter 1 — vibe coding works for prototypes and hackathons. Finia is chapter 2 — task-driven for projects that need to be maintained. The enterprise is chapter 3 — where this approach isn't a nice-to-have but a requirement.
The control is yours
This isn't about stopping using AI. It's about using it well.
Task-driven development isn't bureaucracy. It's giving the AI (and yourself) the structure to ship with confidence. You define the work, set the criteria, review the execution. The AI is incredibly capable — but it needs direction, just like any team member.
The developer who defines work well, ships better software. The industry has moved forward — it's time our processes caught up.
Finia is a project I'm working on to simplify expense tracking in Costa Rica. If you're interested in trying it out or learning more, reach out — it's in beta and I'm always looking for feedback from early adopters.
Top comments (0)