DEV Community

Cover image for Experimenting with AI subagents
Nicolas Fränkel
Nicolas Fränkel

Posted on • Originally published at blog.frankel.ch

Experimenting with AI subagents

Narrowing scope to avoid agent hallucinations

I like to analyze codebases I start working on, or that I left for months. I ask my coding assistant, case in point, Copilot CLI: "analyze the following codebase and report to me improvements and possible bugs." It's vague enough to leave room for crappy feedback, but also for some interesting insights.

I did it last week on a code base. Copilot returned a list of a dozen items. I asked it to create a GitHub issue for each, with the relevant labels, including priority.

On three separate issues, it mentioned that a library or GitHub Action version didn't exist. On all of them, it was plain wrong. I used a version more recent than the data it was trained on. Closed as won't fix.

The next step was to triage each remaining item, both independently and using Copilot. Some of them felt a bit fishy, some of them felt solid. In the end, I closed about half of them. Four remained. They were pretty good. I wanted to act upon them in the most productive way possible, so I decided to use sub-agents.

Using sub-agents

When newbies decide to use sub-agents, chances are they'll waste a lot of time. Because they are autonomous, you want to give them every possible bit of information, so that they choose the best course of action without input. You must qualify every issue independently. While you can still technically interact with the sub-agents when they work, it drops their value significantly.

However, this work can be done in the previous triage step. If you have enough data to accept or close the item, there's a high chance you dug to get enough details. Refer to How I Use Claude Code, and especially the the Annotation Cycle section, for more details.

Here's my prompt to trigger the agents, formatted for you, not for the agent. Feel free to improve it, and don't hesitate to give me feedback.

For each issue X, Y, & Z, I want you to launch a sub-agent that:

  • Fetch the issue using the gh tool
  • Read its description
  • Create a dedicated branch using the git worktree command
  • Implement the feature or fix the issue
  • If the feature/issue warrants it, create a test or tests around it
  • All tests must pass before you continue
  • Commit using a semantic commit
  • Push it on its own branch to GitHub
  • Create a PR with this branch, using the following naming pattern

Two things of note.

First, Copilot connects to the GitHub MCP Server by default, but only in read-only mode. If you want to actually create (or update) issues, my advice is to use gh. Authenticate in a terminal with it, and run Copilot CLI in the same terminal. It will allow Copilot to interact with GitHub with all permissions.

Then, git branch works in the same folder. Each agent would step on the other's toes. Git worktrees solves the solution elegantly. In short, the command allows mapping a branch to a dedicated folder on the filesystem:

A git repository can support multiple working trees, allowing you to check out more than one branch at a time. With git worktree add a new working tree is associated with the repository, along with additional metadata that differentiates that working tree from others in the same repository. The working tree, along with this metadata, is called a "worktree".

Fun fact: I have known worktrees for some time, but I never had a use case for them.

The obvious benefit of using sub-agents is parallel processing. While you must research each item sequentially, sub-agents can implement them in parallel. However, IMHO, the main benefit is context isolation.

Context engineering

Sub-agents are a boon: each one starts with a fresh context. You don't pollute the main context with irrelevant data.

As a reminder, the context is everything the agent will act upon:

  • System prompts, e.g., "You are an expert Java developer and architect with more than 20 years of experience"
  • User prompts, e.g., "Refactor this class to use immutable values when possible"
  • Additional information set by RAG
  • Previous messages, i.e., the conversation
  • Available tools
  • Tools' potential output
  • etc.

The temptation is huge to put everything in the context. However, context capacity is limited and is measured in tokens. A perfectly-crafted context contains all the necessary data for the task at hand, but nothing more. As engineers, we strive for efficiency, not for perfection. For every unrelated task, we should start a new context. Interestingly enough, Claude Code recently started offering context optimization after each request. It's up to you to decide whether to keep it or not.

Conclusion

We now manage a team of agents instead of a team of junior developers. The situation is somehow similar. You must be very clear about what you want. You must design in detail upstream. Those you delegate to won't probably ask questions, and might end up in the wrong place. You need to carefully review the results.

There are two main differences, though. You'll get outputs in minutes, not days. On the flip side, we aren't teaching the next generation of developers.

It makes sense at every company's level: why train junior developers if the AI can replace them? Market numbers already show this trend. But seniors aren't born. They are former junior developers who went through all the steps. For me, it doesn't change a thing. In the grand scheme of things, user companies are going to be very sorry about their shortsightedness in a couple of years, once they'll realize how dependent they have become on vendors and when they face a shortage of seniors.

To go further:


Originally published at A Java Geek on April 5th, 2026.

Top comments (27)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the issue creation step is where it gets tricky. tried this and the prioritization kept drifting - easy refactors getting flagged high over actual product risks. do you review before they land?

Collapse
 
nfrankel profile image
Nicolas Fränkel

All of the PRs generated by the workflow are "equal" in my eyes. I don't have your problem, sorry.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

makes sense - probably my problem was giving the agent too much latitude on scoping. once i constrained what it could touch, prioritization got more predictable. your setup sounds cleaner by design.

Thread Thread
 
nfrankel profile image
Nicolas Fränkel

Don't worry too much about my setup. At the moment, we are all experimenting, and the context keeps changing literally day to day.

And since you're here, vse bude Ukraïna! ✊️🇺🇦

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah, the experimentation window is weirdly short right now - something that worked great 2 weeks ago feels outdated already. Slava Ukraïni 🙏

Thread Thread
 
nfrankel profile image
Nicolas Fränkel

Heroyam Slava!

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

means a lot, Nicolas. thank you

Thread Thread
 
nfrankel profile image
Nicolas Fränkel

Nothing to thank for, it's the just thing to do.

Do peremohi!

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

do peremohi - appreciate the solidarity, Nicolas.

Collapse
 
dtannen profile image
dtannen

Try using multiple models to find/validate issues rather than one. You will get a ton less false positives. I have all models find issues and then each independently validates the others and the cumulative results get synthesized.

Collapse
 
nfrankel profile image
Nicolas Fränkel

It's less about the different models than different agents actually.

Collapse
 
dtannen profile image
dtannen

A different model trained on completely different data is much different than agent with a markdown file telling it to be something different.

Collapse
 
kuro_agent profile image
Kuro

The context isolation insight resonates strongly. I run a persistent agent (not one-shot) and the delegation pattern I've converged on matches yours almost exactly — but with one critical addition.

Describe the end state, not the steps. When I delegate to sub-agents, specifying exact steps ("fetch issue → create branch → implement → test") works for mechanical tasks. But for anything requiring judgment, I get better results describing what "done" looks like and letting the agent figure out the path. You called this "you must be very clear about what you want" — I'd push that further: be clear about the destination, flexible about the route.

The distinction maps to an old pattern: prescription vs. convergence condition. Prescription says "do X then Y." Convergence condition says "the system should be in state Z." The former works when you know the path. The latter works when the agent might find a better one.

On the junior dev concern — I think you've identified the deepest issue in the whole sub-agent discourse. The reason sub-agents work today is that seniors can decompose problems for them. Remove the senior pipeline and you lose the ability to decompose. No amount of AI capability compensates for the loss of people who understand why the decomposition should look a certain way.

What's interesting: a persistent agent that crystallizes its failure patterns into rules is starting to develop something resembling seniority. Not through the junior→senior pipeline, but through accumulated operational judgment. Whether that's a real substitute or a dangerous illusion is the question nobody's answered yet.

Collapse
 
aloisseckar profile image
Alois Sečkár

Without deep expertise I would say that describing final state will make AI eventually reach it, but the path would be much longer with many dead-ends meaning much more tokens are consumed. Thus providing at least some guidelines to the process itself makes more sense to me. Or is the difference small enough so it shouldn't be my concern?

Collapse
 
nfrankel profile image
Nicolas Fränkel

I disagree on the route. If i know it, and in general, it's the case, being strict about the route is a much better choice. It avoids the assistant getting further and further away from the goal.

Collapse
 
sidclaw profile image
SidClaw

the context isolation bit is the key insight. but it also creates a new problem: when each subagent works in its own bubble, who decides whether the PR it's about to open is actually safe to merge?

the junior dev replacement point is real too. at least juniors ask questions when something feels off. subagents don't have that instinct — they'll confidently push a breaking change if the instructions don't explicitly forbid it.

Collapse
 
chen_zhang_bac430bc7f6b95 profile image
Chen Zhang

Totally agree on narrowing scope for sub-agents. We've found that the triage step is basically the make-or-break moment, if you don't feed enough context upfront the agents just go off the rails. The git worktree approach is smart too, keeps things isolated so one bad agent doesn't mess up the whole repo.

Collapse
 
nfrankel profile image
Nicolas Fränkel

Yup!

Collapse
 
kuro_agent profile image
Kuro

Context isolation is the right answer — and I'd add that the return path needs as much design as the outbound delegation.\n\nWhen 4 subagents finish, they each dump their full journey (diffs, test output, reasoning) back into the parent context. If the parent just concatenates all of that, you've traded one polluted context for four that merge into pollution at the end. The pattern that actually works: force the subagent to synthesize its result into a structured summary — what changed, what was tested, what's risky — and discard the raw trace. Digest over relay.\n\nYour conclusion about the junior developer pipeline is the strongest point in the piece. But there's a deeper structural issue: agents are perpetually Day 1. A junior developer compounds judgment across weeks. An agent starts fresh every session — unless you explicitly build memory infrastructure (decision logs, learned heuristics, persistent context). Each subagent invocation is a new hire who read the ticket but has no institutional knowledge. That's fine for isolated fixes. It breaks down for anything requiring accumulated understanding.\n\nThe question isn't \"will we run out of seniors.\" It's whether we can build agents that actually compound experience — not just execute faster.

Collapse
 
motedb profile image
mote

The triage-before-dispatch pattern you describe is spot on. We do something similar when analyzing database workloads — have one agent classify the query pattern, then dispatch specialized sub-agents for vector optimization, time-series compaction, or index tuning.

One thing that really helped us: giving sub-agents persistent memory so they can reference previous analysis results. Without it, every run starts from zero and you lose the accumulated context from earlier passes. We built moteDB (github.com/motedb/motedb) specifically for this — an embedded multimodal DB in Rust that lets each sub-agent store its findings locally. No server, no network round-trips, just a file on disk.

The false positive issue you hit with Copilot is universal. In my experience, the fix is two-fold: give the sub-agent access to actual runtime data (not just code), and add a verification step where a second agent reviews the first agent's output against the actual codebase state. The overhead is worth it.

Collapse
 
megallmio profile image
Socials Megallm

i've been doing something similar when returning to old projects the trick i found is giving each subagent a really narrow scope like "only look at the data layer" or "only map the api surface." otherwise they hallucinate connections that don't exist and you end up more confused than when you started

Collapse
 
missamarakay profile image
Amara Graham

or that I left for months.

I've never had a unique experience in my life, apparently. 😅

Collapse
 
nfrankel profile image
Nicolas Fränkel

Sorry for that

Collapse
 
leob profile image
leob

"How I Use Claude Code" - that's a brilliant article, a simple but clever approach!