Olexandr Uvarov

Posted on Apr 28

I Built an AI Agent to Do My Pre-Refinement. It Turned Into a Mirror of How We Wrote Tickets.

#agents #ai #llm #productivity

1 hour. Seven hours. Same ticket, same prompt, two days apart.

The agent wasn't broken. It was showing me what my team had been doing silently for years.

But I'm getting ahead of myself.

The setup

Most teams I've worked on have a stage between "product writes a ticket" and "the team estimates it" — pre-refinement, grooming, technical intake, different names for the same homework. A developer reads the ticket before the team meeting, opens the design tool, searches the repos for similar components, and leaves a technical comment: what already exists, what needs to be built, files involved, an hour estimate, open questions. Then the team meets and estimates.

Without that homework, the meeting becomes the homework. Fifteen minutes stretches to forty-five.

That homework cost me about a day and a half per sprint. Not glamorous. Not particularly hard. Mostly reading, searching, context-switching, and a lot of "I swear we built something like this six months ago, where is it."

Perfect thing to hand to an AI agent.

The first version

The setup was intentionally boring. Nothing exotic under the hood — an LLM with tool access to our ticket system, the design tool, and both repositories, plus a prompt that said read these sources, produce this output, post it back as a comment. That's it.

What the agent was supposed to produce, per ticket:

What parts of the feature already exist in code
What needs to be built from scratch
A short plan with files involved
An hour estimate
Open questions

I ran it across the backlog. Comments appeared. Plans looked reasonable. I moved on, happy to have my afternoons back.

Then I started actually reading what it produced.

Discovery one: the agent was blind to what we already had

The agent kept confidently recommending we build things we'd already built.

A typical shape: a ticket would describe a feature — say, a section that displays a list of items with specific formatting, calculations, and localization rules. The agent would read it, search the code, and recommend building it from scratch. Meanwhile the exact calculation lived in another component, built months ago for a different flow. The agent missed it because the component name didn't match the language of the ticket, and because we have hundreds of components with similar-sounding names.

This wasn't the agent being dumb. This was me asking it to navigate our codebase the way a new hire does on day one — by keyword search and guesswork.

Why this matters more than it sounds

Our system is split across two repositories — a frontend one, and a CMS-style one. The frontend holds the actual UI components. The CMS-style repo holds configurable blocks that reference those components, and it's where product writes tickets from — picking blocks from a list, configuring their content, wiring them into flows.

The problem isn't that components are missing. They're there. Both in the frontend and registered in the CMS. The problem is that there are a lot of them, built over years, and nobody remembers all of them. Someone writing a ticket for a new section rarely scrolls through the full list of existing blocks to check. They describe what they want in their own words, and the ticket enters the backlog as "new feature" — even though the block they want already exists three dropdowns away.

The fix: give the agent institutional memory

The agent needed a map of what already existed. Not the code itself — the vocabulary of it. A living document listing our section types, screen types, their variants, and what each one actually does, in plain language.

So I wrote a second, smaller command: a sync script. It scans the frontend repo, pulls all the component schemas, and maintains a markdown file — something like component-patterns.md — that describes what exists, grouped by category, with short plain-language notes on what each component does. The main agent reads that file before doing anything else. Before searching. Before planning. Before estimating.

That single change turned the agent from "a confident outsider" into "someone who's at least read the team wiki." It started flagging overlaps it used to miss — tickets where the real answer wasn't "build this," but "most of this already exists, we just need to extract it and reuse it."

A small but consistent pattern appeared after that: two to five tickets per sprint would collapse. Not get simpler — the scope itself would change, from "build a new section" to "reuse the existing one with minor tweaks." Work that, measured honestly, shouldn't have existed.

That was the first uncomfortable finding. The slow part of pre-refinement wasn't reading the ticket. It was the institutional forgetfulness around what we'd already built.

Discovery two: the estimates drifted

Once the agent had a map of the codebase, the plans got better. But the estimates started doing something weird.

I ran the same ticket through the agent on two different days. No changes to the ticket. No changes to the prompt. First run: one hour. Second run: seven hours.

I read both outputs carefully. In one run, the agent had quietly assumed mobile wasn't in scope, the text was hardcoded, and the amounts didn't need locale-specific formatting. In the other run it had assumed all three. The ticket didn't specify any of them — so the agent filled the gaps differently each time. Gaps are gaps.

Then came the uncomfortable thought: this is exactly what humans on my team had been doing, too. Silently. Each person filling the same blanks with their own assumptions, and the aggregate came out looking like "team judgement" at the refinement meeting.

When estimates disagree in a room of people, we call it a discussion. When estimates disagree between two runs of the same model, it's glaringly obvious they're both guesses. The agent wasn't adding noise. It was surfacing noise we had been absorbing without noticing.

Discovery three: the same word meant different things

The same drift showed up in language, not just numbers.

A ticket uses the word "static." Everyone nods. Nobody stops to check: static in what sense? No animation? No saved user state? No editable copy in the CMS? Three different questions, one word.

The agent would hit tickets like that and — because it has no social cost for asking — just ask. "You wrote 'static.' Do you mean not animated, not interactive, or not editable? The design shows a button, which suggests interactivity, so I'm unsure."

Sometimes the PM meant one thing, the designer meant another, and whoever eventually built it would have read it as a third. Without the agent, those disagreements surfaced halfway through implementation. The ticket bounced around chat for a day, the developer rewrote the screen, everyone moved on. With the agent, they surfaced before anyone started coding. Same question. Vastly cheaper to ask.

The part I got wrong

My first instinct, after seeing these drifts, was to tune the agent harder. Add constraints. Force it to flag missing context instead of guessing. Make it more deterministic.

It helped a little. It didn't fix the root issue.

Because the agent was being asked to produce a technical plan from a ticket that wasn't ready to be turned into one. You can't tune your way out of that. You're asking the wrong question of the wrong document.

The weak link wasn't the agent. It was the input. And the input came from a different person — a product manager — who needed a different kind of feedback than the technical agent was producing. Files, gap analyses, hour estimates — useful to a developer. A PM reads that and has nowhere to hook in. They wrote requirements; they didn't write code. A technical review doesn't give them anything they can fix.

If I wanted the ticket to reach the technical agent in usable shape, I needed a different review — aimed at the ticket itself, written in the language of requirements.

Same access, opposite voice

So I built a second agent. A task quality reviewer.

Its plumbing is almost identical to the first one. Same ticket system access. Same repository search. Same design tool integration. Same component-patterns map.

The personality is the opposite. The system prompt has a hard rule near the top — roughly:

You have full access to both codebases. Use it internally to understand
what already exists and what's feasible.

STRICTLY FORBIDDEN in your output:
- File paths, code, schemas, APIs
- Technical jargon (component, props, endpoint, schema)
- Implementation details or options
- "How to build" anything
- Architecture suggestions

Your job is NOT to plan implementation.
Your job is to ask clarifying questions about REQUIREMENTS.

Transform technical findings into requirement questions.

And a translation table that turns things the agent finds in the code into questions a PM can answer:

What the agent finds	What it actually asks
The CMS can't store this text	"Do editors need to change this wording, or is it fixed?"
A similar feature already exists elsewhere	"Should this behave the same way as that one, or differently?"
No error state is shown in the design	"What should the user see if something goes wrong?"
No mobile design exists	"Does this need to work on phones? Same layout, or a different one?"
A timing behavior isn't specified	"If the user has spent five minutes on the previous screen, does this start fresh or pick up where they left off?"

The reviewer reads the ticket, checks the design, searches the code to know what's feasible — and produces a list of human-language questions about what the ticket doesn't yet say. The PM gets a comment that feels like a thoughtful peer asking for clarification. They update the ticket. Run it again. In practice it's usually one cycle — one round of questions, one round of answers, done. Occasionally a second pass for edge cases, but rarely a long loop.

What actually fixed things: the gate between them

Having two agents didn't fix anything by itself. The fix was the order — and the rule that the second agent doesn't run on a ticket the first one hasn't cleared.

Here's the flow:

  PM writes ticket
        │
        ▼
  ┌──────────────────────┐
  │  Quality reviewer    │  ← same tools, PM-facing voice
  └──────────┬───────────┘
             │
     ┌───────┴───────┐
     │ Questions?    │
     └───────┬───────┘
    yes      │       no
    ◄────────┤
  PM updates │
  the ticket ▼
  ┌──────────────────────┐
  │  Technical agent     │  ← same tools, dev-facing voice
  └──────────┬───────────┘
             │
             ▼
    Team refinement
             │
             ▼
       Execute step

Nothing clever in the sequence itself. It's just a gate. But without the gate, the system falls back to the old problem: the technical agent producing confident plans from under-specified tickets, and estimates that silently disagree.

The value turned out not to be in having more agents. The value was in preventing each one from running on an input the previous step was supposed to clean up.

What this actually changed

A few concrete things, after running this setup for a while:

Pre-refinement vanished from my calendar. The day and a half per sprint that used to go to reading tickets, opening the design tool, and grepping through repos is roughly zero now. I read the agent's comment on my way to the meeting; if something is unclear, the comment points me to the exact files where the relevant logic lives.
Refinement meetings got shorter. The team walks in with the write-up already done. Discussions focus on trade-offs and priorities instead of "what exactly does this ticket even mean."
2–5 tickets per sprint stop being new work — they collapse to "reuse the existing component with minor tweaks" once the agent surfaces what's already there.
Estimates stopped drifting on the same ticket. Because the ticket arriving at the technical agent is now complete, the agent doesn't have gaps to fill with guesses.

Honestly, the agent's write-ups are more thorough than mine ever were, because it doesn't get tired halfway through the backlog.

One more step (with caveats)

Once the flow was stable, I added one more thing — not an agent, a small script. It reads the technical plan from the ticket comment, creates a branch in the right repo, and implements the ticket following local conventions.

Quality is rough — closer to "a junior following clear instructions" than a senior who's seen the problem before, and not always even that. Small, well-scoped tickets it closes fine. Anything more complex it struggles with. Real room to improve, and it's the next thing on my list.

But the reason the rough version works at all is that by the time a ticket reaches the execute step, the plan is already a plan. Requirements are in the ticket. The existing-code check has been done. The "is this actually new work" check has been done. The script isn't being asked to decide anything big — it's being asked to follow a document.

When it fails, it tends to fail in small, specific ways that point back to a gap one of the earlier steps missed. Even in its current rough shape, it ends up acting as a final validator of the work that came before.

Three takeaways

An AI agent isn't a model. It's an output contract for a specific reader. The same access, same data, same tools can drive two agents that produce completely different — and equally useful — artifacts, just because they're written for different people. Trying to fit two readers into one agent is a design problem, not a prompt-tuning problem.

Automation reveals process. If an automated step produces unstable output, before blaming the model, look at the step before it. Humans may have been quietly absorbing that instability for years. The agent does the same work without the silent patching — and suddenly the noise becomes visible. That's a diagnostic, not a failure.

A chain of agents isn't valuable because there are more agents in it. It's valuable because of the gates between them. The order matters. The refusal of the next step to run on unready input matters. Remove the gates and adding agents just compounds the same mess in more places.

If I had built the pre-refinement agent and stopped there, I'd have reclaimed my afternoons and convinced myself I'd automated a process. Instead I accidentally ran an audit of how we wrote tickets — and that turned out to be worth more than the hours the agent itself saved.

Which is maybe the most honest thing I can say about using LLMs inside a team workflow: the first thing a well-built agent usually shows you is what the team was actually doing before you automated it. Sometimes that's the real story.

This is the opening article in my AI Agents series. It builds on lessons from my earlier CSS in a Team series — the same principle of "make the system enforce itself" shows up here, just in a different medium.

DEV Community