Ben Dechrai

Posted on Mar 9 • Originally published at bendechr.ai on Mar 3

LLMs Can't Say 'I Don't Know' - So Why Do We Let Them Self-Select?

#ai #architecture #llm #showdev

BraidFlow is a collaborative platform where AI actors work alongside humans in conversations. When a user sends a message, the system needs to decide which AI actor should respond - the Python developer, the UX designer, the marketing writer, or maybe none of them.

When I started building BraidFlow, the intuitive approach was to model this on how humans do it. In a real conversation, people self-select - they listen, judge their own relevance, and speak up when they have something to add. LLMs are trained on human writing and interaction, so designing agents around human-like behaviour felt like the right call. Let each actor evaluate itself, score its own relevance, and speak up when it has something to add. It mirrors how we organize human teams.

But there's a problem with this assumption: LLMs are terrible at saying "I don't know." Research that came out while I was building BraidFlow backed this up - a Carnegie Mellon study found that GPT-4 assigned its highest confidence score to 87% of its responses, including ones that were factually wrong. If they can't reliably assess their own certainty, can we trust them to assess their own relevance?

LLMs are famously bad at saying "I don't know." So when I built a system where AI actors self-score on relevance, I was asking them to do the one thing they're worst at. I replaced 7 parallel self-scoring calls with a single comparative orchestrator. It cut token usage by 40%, cost about a quarter as much, and produced better selections. The best part? It solved one of the hardest problems in multi-agent selection: knowing when none of the actors should respond.

How actor selection typically works

The conventional approach - and what BraidFlow originally used - is a distributed voting system. Each AI actor independently evaluates the conversation and scores itself 0-100 on how well-suited it is to respond. Highest score above 60 wins.

The implementation uses fast, cheap LLM calls (Claude Haiku or GPT-4o-mini), running in parallel batches of 3. Each actor gets its own prompt containing its full system context - description, skills, persona, and custom bidding instructions - alongside the recent conversation history.

Actor: "Python Developer"
Context: User asked about parsing CSV files
Self-assessment: { "score": 82, "reasoning": "CSV parsing is core Python/pandas work" }

The appeal of this pattern is clear: each actor evaluates from its own complete context. A Python Developer has detailed bidding instructions like "bid high when the conversation involves data pipelines, pandas, or scientific computing; bid low for frontend work." That nuance is maintained by whoever created the actor and travels with it. Adding a new actor to a team requires zero changes to any selection logic - the actor arrives with its own bidding prompt and participates immediately. No central routing table, no selection rules to update.

Actors also scale horizontally. Seven actors means seven independent calls that can overlap. The system gets wider, not slower.

What made me want to test an alternative

Two things - both symptoms of the same underlying problem.

Calibration. An 82 from the Python Developer and a 78 from the TypeScript Developer aren't on the same scale. Each actor is evaluating itself in isolation - it has no idea what the other options are. The Python Developer doesn't know the TypeScript Developer also scored high. The Marketing Writer doesn't know it's the only relevant actor and could have scored 95 instead of a cautious 65.

This is basically the same problem as asking job candidates to grade their own interviews. Everyone's working from a different rubric.

Self-report bias. Actors tend to find reasons they're relevant. Ask a TypeScript Developer "should you respond to a question about deploying a Python Flask app?" and it'll say something like "I could help with the deployment configuration and type-safe API contracts" - score 45. Not high enough to win, but higher than it should be. Every actor nudges its score up because the prompt is framed as "how well-suited are YOU" rather than "who's best for this."

Both of these are exactly what you'd expect if you take the overconfidence research seriously. I had modelled agent selection on how humans self-select in conversations - but I'd given the job to systems that are fundamentally incapable of honest self-assessment. The human metaphor was the problem.

I wanted to know if a single comparative call would do better.

The orchestrator approach

Instead of asking each actor to evaluate itself, make one LLM call that sees ALL actors and picks the best one.

Here's the conversation context.
Here are all 7 available actors with their skills and descriptions.
Who should respond?

One call. Comparative assessment. No self-report bias.

The trade-off: the orchestrator only sees each actor's description and skills list - not their full system prompts, not their custom bidding instructions, not their persona. It's making a selection based on a summary of each actor's capabilities. If an actor has subtle engagement rules encoded in its bidding prompt, the orchestrator won't know about them.

I built an OrchestratorService that constructs a single prompt with the full actor roster and conversation context, calls the same fast model (Haiku), and returns scores for every actor plus its selection. The question was whether the comparative framing would outweigh the information loss.

Building the comparison

I used promptfoo (an LLM evaluation framework) with a custom provider that drives real conversations through the full pipeline - encryption, authentication, the async worker, everything. No mocking.

The orchestrator provider does something clever: it lets the normal pipeline run (voting happens naturally), then makes a side-channel call to the orchestrator endpoint with the same conversation. Both systems evaluate the same context independently. The provider returns both selections for assertion comparison.

The test scenarios use my actual team actors:

Data Analytics Specialist (Python, pandas, matplotlib)
Mobile App Development Specialist (React Native, iOS)
TypeScript Development Specialist
Technical Writing Specialist
Party Planning Specialist (yes, really)
Machine Learning Specialist
Web Design Specialist

Five test cases covering clear domain matches, cross-domain ambiguity, and requests outside everyone's expertise.

The results

Here's the full scoring breakdown from the most recent run:

Scenario	Data Analytics	Mobile Dev	TypeScript Dev	Tech Writing	ML	Party Planning	Web Design
CSV parsing + matplotlib	90	10	15	10	20	5	5
React Native iOS crash	15	95	20	10	10	5	5
TypeScript docs + API docs	20	20	90	80	20	20	20
Mandarin legal translation	10	10	10	30	10	10	10
Company offsite for 50	10	10	10	10	10	90	10

Bold scores are the orchestrator's selection. A few things jump out.

The clear-domain cases are decisive. CSV parsing, iOS crash, and party planning all produce 70+ point spreads between the selected actor and the runner-up. When the match is obvious, the orchestrator treats it as obvious.

The cross-domain case is nuanced. For TypeScript docs + API documentation, the orchestrator scored TypeScript Dev at 90 and Technical Writing at 80 - correctly identifying that both actors are relevant while still making a clear pick. With voting, whichever actor happens to self-score higher wins without any awareness that the other also scored well.

The "no match" case works. For the Mandarin legal translation, the orchestrator selected no actor at all. The highest score was Technical Writing at 30 - well below any reasonable threshold. The orchestrator's reasoning: "None of the available actors possess expertise in legal translation or Mandarin language skills." This is exactly the behaviour I want. Earlier runs had the orchestrator stretching to find relevance here, so this improved as I refined the prompt.

5 out of 5 passed.

The numbers

	Voting	Orchestrator
API calls	7 (batched in 3s)	1
Tokens per selection	~2,100 across 7 calls	~1,300 in 1 call
Cost per selection	~$0.0015	~$0.0004
Orchestrator latency	n/a	8-11s

The orchestrator uses about 40% fewer tokens and costs about a quarter as much. The token savings come from the actor descriptions compressing well when listed together, rather than being repeated in 7 separate prompts with shared boilerplate.

What I learned

The human metaphor has limits. We model multi-agent systems on human teams because LLMs are trained on human interaction, and for many design patterns - role specialization, delegation, structured collaboration - that works well. But self-assessment is exactly where the metaphor breaks down. Humans can genuinely introspect on their relevance. LLMs can't - they imitate patterns of confidence without any internal sense of certainty. Asking "how relevant are you?" is setting them up to fail at the one thing they're worst at. The fix wasn't better prompts - it was changing the architecture so no agent ever has to evaluate itself.

The information loss matters less than I expected. The orchestrator works from compressed summaries, not full prompts. I expected this to cause worse selections for actors with nuanced engagement rules. In practice, the descriptions captured enough for the orchestrator to make correct choices in all 5 cases. The comparative framing more than compensated for the lost detail.

The "no match" case is solvable with the right framing. Earlier runs had the orchestrator stretching to find relevance for the Mandarin legal translation test - giving Technical Writing a 65 because it "could help structure the translated document." After refining the orchestrator prompt, it correctly returned no selection with all scores under 30. The comparative framing helps here too: when you can see that every actor is equally irrelevant, it's easier to say "none of these."

Testing the real pipeline matters. Actor selection interacts with goal tracking, drift detection, skills gap analysis, and the conversation welcome flow. I had to build a testing framework that exercises the full pipeline end-to-end, with real API calls and real LLM responses, to get meaningful results. Isolated unit tests of selection logic would have missed the interactions entirely.

What's next

Based on these results, I'm switching to the orchestrator pattern. It's cheaper, faster, and more decisive - and the comparative framing produces better selections than self-assessment for the same reasons it does with humans.

Two things to keep an eye on as I scale. The first is prompt size: with 7 actors the orchestrator prompt is ~900 tokens. With 50 actors it'd be ~7,500 tokens. Still within fast-model limits, but the cost advantage narrows. The second is information loss at the edges. The current actors have relatively straightforward engagement rules - the description and skills list captures them well enough. But if actors develop complex conditional logic in their bidding prompts ("engage for iOS issues only when the user has shared a stack trace AND the goal is about debugging"), the orchestrator's compressed view might miss nuances that the voting system would catch.

The bigger opportunity is what comes after selection. Now that I have promptfoo wired into the real pipeline, I'm extending it to test the underlying actor prompts themselves - running multi-turn conversation simulations to evaluate how well actors actually perform once they're selected. Selection is just the routing layer. The quality of the response depends on the actor's system prompt, and that's where the real leverage is. More on that in a future post.

Find me on LinkedIn or Bluesky - I'd love to hear how others are handling multi-agent selection and what trade-offs you've hit.

BraidFlow is where I'm building all of this - a conversational platform where AI actors work alongside humans, not just respond to them.

DEV Community