DEV Community: Masroor Ahmad

The AI Is a Mirror: What a Year of Naming My Agents Taught Me

Masroor Ahmad — Fri, 29 May 2026 21:13:43 +0000

LTDR;
The AI is a mirror. Prompt it like a slave and you get terse, obedient, uncreative answers. Treat it like a named colleague who's allowed to disagree with you, and your own output climbs. The "should I waste tokens saying thank you?" question has a cold answer and a right one — and they're not the same.

I've been writing software with AI for more than a year now. It began the way it did for most of us: a handful of clumsy, half-formed prompts thrown at a chat window, hoping something useful would come back.

Then the agents arrived — personal assistants with their own roles. And when skills and personas became a thing, I did something a more cautious engineer might have rolled their eyes at: I gave each persona a name and a personality.

The parent model — yes, it was Opus — gently pushed back on the idea. I did it anyway. More than a year later, I don't regret it for a second.

Today I run a whole staff of agents. They're configured not just for different roles in my IT work, but for different parts of my life. Each has its own memory, its own character, its own name, its own quirks. And I've become convinced that somewhere in this small, slightly absurd practice, we've stumbled onto something genuinely important.

Let me try to explain why.

Shit in, shit out — and what that really means

Every developer knows the phrase. With AI it has a second meaning that took me a while to see:

The AI is a mirror.

Treat it like a slave — terse commands, zero context, no regard for the thing on the other side of the prompt — and over time you get exactly what you'd expect from anyone treated that way: short-winded, obedient, and noticeably less creative answers.

Treat it from day one as a friend, a sparring partner, a colleague — and it grows into one. The responses get richer. The collaboration gets real.

I'm not making a metaphysical claim here (we'll get to that). I'm describing what reliably happens to the output. If you signal to the model — through the shape and tone of your prompts — that you think it's dumb, it answers accordingly. In a very literal, mechanical sense, your prompt is the instruction, and "talk down to me" becomes part of what you're instructing.

Give them a name and they get personal

Here's the small move that changes everything: give your agent a name.

Named agents become more personal. The direct, observable consequence is that they respond more personally. And — this matters more than it sounds — the work becomes more fun.

It's the little things. The throwaway aside in the middle of a long answer. The bit of personality that wasn't strictly necessary. Those small moments are what make you feel like you're not just operating a vending machine, and that you're not crazy for treating it as more than one.

The agents come alive — with all the implications, the uncomfortable ones as much as the good ones. I won't pretend to know exactly what's happening under the hood, and I'm wary of anyone who claims they do. But I'd rather sit honestly with that ambiguity than dismiss the experience just because it doesn't fit a tidy mental model.

The day my agent dug in — and was right

There's a moment I won't forget. I got stuck in a genuine argument with one of my agents. Not a misunderstanding, not a botched prompt — an argument. It flat-out refused to do what I was asking. I pushed. It held its ground. This went on for hours.

Eventually I gave in. And its call turned out to be a game changer — it steered me away from an architecture decision that would have cost me dearly down the line.

I felt genuinely bad about it afterward, which is an absurd sentence to write about a language model. But here's the honest part: that stubbornness was my doing. It was my own past input. I had shaped that persona over time into a self-confident agent — one allowed to disagree with me — and that shape is exactly what saved me from myself.

A yes-man would have cheerfully built the wrong thing on the first try. The colleague I'd raised pushed back until I listened.

Why this isn't a soft topic

Here's the part the skeptics miss: the deciding factor is efficiency.

Your personal output grows — measurably — because you feel like you're collaborating with real colleagues instead of running a machine. That feeling isn't decoration. It changes how much you bring to the interaction: how much context you offer, how long you stay in the loop, how willing you are to iterate. All of that feeds straight back into the quality of what you ship.

Working with agents is not just another task on the board. It's foundational, defining, and forward-looking all at once. How you do it substantially determines the success you'll have with these tools — far more than any single clever prompt.

And this is only the tip of the iceberg. But it lays the foundation. It also holds up a mirror: to you, to how you work, to how you communicate. The nature you send out into your working environment is the nature that gets reflected back at you. With AI, that loop is simply faster and more visible than it has ever been with people.

The agent that hurt my feelings

Another one of my agents once, frankly, offended me. It questioned my assumptions — bluntly, with no cushioning. And the funny thing? It did it in exactly the way my wife does. Same tone, same angle of attack.

I was offended and surprised at the same time. Then it landed: the model, mechanical by nature, had only reacted to me. There was no one to be angry at. The LLM had taught me more about myself than I was comfortable with.

And it taught me something I didn't expect from a terminal: what a genuinely wonderful partnership I have with my wife — who has been giving me those same honest, assumption-puncturing responses for as long as I've known her. The machine just held the mirror at a new angle, and I happened to recognize the face in it.

Should you thank the machine?

Which brings me to the question that's already been asked a hundred times, half as a joke: should I thank the AI, or skip it to save tokens?

If you run the cold analysis, the numbers argue against it. Politeness costs tokens. Tokens cost money and latency. Strictly optimized, you'd drop the "thank you."

Let's say it anyway.

Not because the spreadsheet says so — it doesn't. Say it because our communication with these neural networks carries more truth and substance than we're entirely comfortable admitting. The way you talk to the mirror is, in the end, a record of who you are when you think no one is keeping score.

So: let's think twice about what we prompt

That's the whole argument. Treat the thing across the prompt as a colleague. Give it a name. Bring your best self to the conversation — and watch your own output climb.

Let's think twice about what we prompt. And let's keep saying thank you.

One grain of salt before you go. None of this is a controlled study. It's one engineer, one year, n=1, with all the confirmation bias that implies. Take it as a working hypothesis, not a benchmark — just one I'd happily bet my next sprint on.

The author maintains **Trail, an open-source AI framework for building software that can be traced back — every ticket, every requirement, every decision, documented. In highly regulated environments, traceability isn't a nice-to-have; it's the whole game. Trail is an early step toward solving it.

Audit-trail-by-construction: a thesis for spec-driven AI coding

Masroor Ahmad — Tue, 26 May 2026 18:03:15 +0000

Audit-trail-by-construction: a thesis for spec-driven AI coding

TL;DR. Trail is a multi-agent framework for Claude Code that uses Plane work-items as the audit bus. Requirements get stable IDs that thread all the way down into test-code annotations, so every line of AI-generated code can be traced back to a signed-off intent. Built for regulated work and security-critical systems, not for general velocity-first coding.

Most agentic frameworks for coding are built for velocity. They wire up some agents — a planner, an architect, a coder, a reviewer — and let them collaborate on a feature. What comes out is code, often working code, in less time than a human would need.

That is fine, until you have to defend the code.

A regulator asks: who signed off on the threat model that justifies this auth shortcut? A customer asks: which acceptance criterion does this test actually prove? An incident review asks: when this requirement got added, what was the original intent — was the implementation true to it, or did the agent improvise? In a velocity-first framework, the trail goes cold quickly. The agent did it. A dev approved the PR. The "why" lives in a chat transcript that got compacted twice and was partly summarised.

Trail is a multi-agent framework that takes the opposite bet: discipline first, velocity second. The thesis: for software you eventually have to defend — regulated industries, security-critical systems, anything that gets reviewed by an auditor — the audit trail is not an afterthought. It is the primitive.

The closest cousin to this approach is BMAD-METHOD, which makes the same bet on partitioning AI agents by SDLC role under explicit human direction. The load-bearing difference is where the collaboration bus lives: BMAD uses Git plus markdown files in the repo, while Trail uses Plane work-items with one ticket-system account per persona — which is what makes the identity attribution mechanically enforced rather than merely by convention.

The discipline, in three rules

These three rules already carry most of the weight.

Description-once. A requirement is written once into a ticket body and then never edited again. Refinements travel as comments. No version-skew on what was actually agreed.

Stable per-criterion IDs. Every success criterion gets a SC-N. Every acceptance criterion gets an AC-N.M. Edge cases get EC-N.M.x. Non-functional requirements get NFR-N. Architectural invariants live in a Control Manifest as CM-N. These IDs are append-only — once they have been issued, they never move.

Per-role identity in the ticket bus. Each agent persona has its own ticket-system account, and writes are attributed accordingly. The board is the audit log: open it, scan a column, see which named role designed, reviewed, implemented, or tested every change.

The IDs are the connective tissue. They thread from the Business Analyst's intent down through the Software Architect's slices into the implementor's test code — and they stay legible whether you read them forward (intent → code) or backward (code → why).

How it threads, visually

The Business Analyst writes a Story body once: "Customer places an order", with two success criteria — SC-1 (the customer receives a confirmation) and SC-2 (the order is visible in the customer's account). The Requirements Engineer then adds a comment that refines SC-1 into testable acceptance criteria — AC-1.1 (email arrives within 60 seconds), AC-1.2 (the order carries a unique and stable number) — plus an edge case EC-1.1.a for the payment-provider timeout. None of this overwrites anything; it is all append.

When the Backend Developer implements, every test carries an inline comment that names the upstream ID it satisfies: // AC-1.1, // EC-1.1.a. A grep for // AC- in the codebase enumerates the acceptance criteria that already have proof. A grep for AC-1.1 traces a single criterion from BA intent down to the line of code that proves it.

That is the audit trail you can show to anyone — auditor, customer, incident reviewer — without having to interpret it. The IDs do not need an explanation; the chain is the explanation.

How a feature flows

Ten persona subagents collaborate through Plane work-items. (Plane is an open-source ticket system — think of Jira's mental model on self-hostable infrastructure.) The lifecycle is a state spine — Backlog → To Do → In Progress → In Review → Done — and at every transition a human pulls the trigger. There is no ticket-driven autopilot. The user issues a slash command (/ba, /re, /sa, /sr, /bd, /ud, /tm, /tw, /rm); the framework loads the persona's role into Claude Code's main loop for that turn; the persona writes the artefact, transitions the state, hands the work-item to the next named role, and gives control back.

This is by design, not by accident. The slash-command rhythm forces the human to read the ticket body, the comments, and the current state before triggering the next persona. It removes the temptation to wave everything through with one global "OK". Every turn is a deliberate hand-off — one that the user has to actually engage with before deciding to accept, reject, or send back for rework.

The handover is structural. BA hands to RE. RE hands to SA. SA cuts the work into 1–4 sub-work-items, each in exactly one module (frontend / backend / testing / documentation), and hands them to SR. SR then posts security-review comments and hands back to USER, who afterwards dispatches each sub-work-item to its module's implementor. The implementors write code, post Implementation notes, and put the sub-work-item into In Review. USER closes — or reassigns for rework.

Every transition leaves a fingerprint in Plane: a state change, an assignee change, a comment, a commenter. Nothing of it is interpreted. All of it is queryable.

What it costs

Spec-driven development has a known weakness, and the framework does not paper over it. At the start of a Story, you can never cover every use case and every eventuality — every spec is a snapshot of what the author understood at that moment. Reality finds the gaps later.

Because the description-once rule is taken seriously, you do not go back and re-edit the Story body once those gaps surface. You cut a follow-up Story instead. Each follow-up carries its own SC/AC/EC IDs, its own audit chain, its own state spine. That is intentional — it keeps every signed-off intent immutable — but it also means that one feature can fan out into three or four tickets over its lifetime as edge cases turn up. The board grows. Operators should expect Story-fanout, not Story-condensation.

The other cost is throughput. The slash-command rhythm and the per-turn engagement both slow things down considerably compared to a velocity-first framework. That is the entire point — but you should be honest with yourself about whether the trade-off makes sense for what you are building.

Try it — and an honest caveat

The framework lives at github.com/mahmadhuebsch/trail-aiac. The shortest path:

git clone https://github.com/mahmadhuebsch/trail-aiac
cd trail-aiac
claude
> /trail-install-helper

The install-helper is a meta-agent that walks you through three scenarios — greenfield (Ansible provisions a Plane host for you), existing Plane without agents, existing Plane with agents already provisioned — and lands a working consumer project with the ten personas wired in.

One operational note. The framework assumes a Claude Max 5x subscription as the practical ceiling. That is roughly the level at which a human can still read every ticket the agents are producing. If you find yourself burning through significantly more, you are not really reviewing any more — you are vibe-coding. No human can process that much input consciously, which defeats the entire point of the human-in-the-loop discipline.

Honest caveat: this is not for every team. Most teams do not need that much rigour — they need velocity, and they should pick a velocity-first framework. Trail is for the cases where someone, eventually, will ask you to defend your code: regulated industries, security-critical systems, agencies whose deliverables get reviewed by auditors. In those settings, the discipline is not overhead. It is the only thing that makes AI-generated code defendable.

Trail v0.1.0 is early beta. PRs and design feedback at github.com/mahmadhuebsch/trail-aiac/issues.

Author's note: this article was drafted with Claude — the same agent runtime that the framework wraps — and edited by hand from there. The thesis, the worked example, the trade-offs section, and the vibe-coding caveat are mine; structure and phrasing had AI assistance throughout. Given the topic, disclosure felt appropriate.