I watched a recruiter share their screen on a client call and realized the worst possible thing was happening: the assistant’s raw “thinking” was spilling onto the screen like debug logs. The content wasn’t wrong—it was just the kind of internal narration you never want a client to read while you’re trying to sound decisive.
This is Part 9 of my series “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”. In Part 8, I talked about routing search across Azure AI Search, pgvector, and the CRM as a live fallback. This post is about what happened next: once the answers got good, the delivery became the product.
The core decision: progressive disclosure via dual-channel streaming (thinking + results) with an interruptible UX. I stream the model’s THINKING tokens on one channel, stream QUERY_RESULT events on another, and build candidate cards from structured events—not from text.
The key insight (and why the naive approach fails)
The naive approach to streaming a chatbot is: open a socket, forward tokens as they arrive, and call it “real-time.” That works for demos.
It fails in a recruitment workflow for two reasons:
- Client-call UX is fragile. When someone is screen-sharing, you need the UI to be calm: candidate cards, match scores, and crisp summaries. A scrolling wall of chain-of-thought is visual chaos.
- Power users still want transparency. Recruiters doing deep research do want to see why the AI ranked someone highly—keyword matches, percentile rankings, and concerns—but only when they ask for it.
So I built a system that streams two different truths at the same time:
- THINKING: collapsible, optional, “for operators.”
- QUERY_RESULT: structured, progressive, “for the room.”
The trick is that these aren’t just two text streams. The result channel is events that the UI can render deterministically into candidate cards.
Architecture: two streams, one conversation
Here’s the mental model I use: THINKING is like listening to the assistant talk to itself in the hallway; QUERY_RESULT is what it says when it walks into the meeting room. Same brain, different audience.
flowchart TD
userQuery[User query] --> chatUi[SearchChatbot UI]
chatUi --> signalr[SignalR connection]
signalr -->|THINKING tokens| thinkingPanel[Collapsible thinking panel]
signalr -->|QUERY_RESULT events| candidateCards[Candidate cards]
signalr --> chatService[Chat service]
chatService -->|asyncio.gather| toolCalls[Parallel tool calls]
toolCalls --> searchTools[Search tools]
searchTools --> chatService
chatService --> signalr```
That separation is what lets me keep the main UX clean while still exposing the “why” to the people who care.
## How it works under the hood
### 1) The UI listens for two different message types
In `src/pages/SearchChatbot.tsx`, the client subscribes to SignalR streaming and treats THINKING and QUERY_RESULT as different beasts.
The important pattern isn’t “we use SignalR.” The important pattern is: **I don’t render everything as text**. I route the message by type and update different parts of state.
```tsx
// src/pages/SearchChatbot.tsx
// SignalR streaming with dual THINKING / QUERY_RESULT channels
import React, { useEffect, useState } from 'react'
type StreamMessage = {
type: 'THINKING' | 'QUERY_RESULT'
payload: any
}
export default function SearchChatbot() {
const [thinking, setThinking] = useState<string>('')
const [results, setResults] = useState<any[]>([])
const [isThinkingOpen, setIsThinkingOpen] = useState<boolean>(false)
useEffect(() => {
// SignalR subscription lives here; messages arrive as StreamMessage
const onMessage = (message: StreamMessage) => {
if (message.type === 'THINKING') {
// Collapsible operator view
setThinking(prev => prev + String(message.payload ?? ''))
return
}
if (message.type === 'QUERY_RESULT') {
// Structured events build candidate cards deterministically
setResults(prev => [...prev, message.payload])
return
}
}
// Hook up SignalR handlers here
// (Connection creation / handler wiring is implemented in this file section)
return () => {
// Cleanup SignalR handlers here
}
}, [])
return (
<div>
<button onClick={() => setIsThinkingOpen(v => !v)}>
{isThinkingOpen ? 'Hide thinking' : 'Show thinking'}
</button>
{isThinkingOpen ? (
<pre>{thinking}</pre>
) : null}
<div>
{results.map((r, idx) => (
<div key={idx}>
<pre>{JSON.stringify(r, null, 2)}</pre>
</div>
))}
</div>
</div>
)
}
What surprised me when I first wired this up is how quickly the UI becomes trustworthy once QUERY_RESULT is event-shaped. Even if the model’s narration is messy, the cards stay clean because the rendering path isn’t “whatever text came in.”
2) Candidate metadata arrives as events, not prose
I made a deliberate choice: candidate metadata arrives as structured events (not text). That means the UI can progressively populate cards with stable fields—while the assistant can still stream THINKING separately.
This is where progressive disclosure becomes real: the default view is not an essay; it’s a set of cards.
The user-facing transparency comes from per-candidate explanations—things like keyword matches, percentile rankings, and concerns—but those explanations belong to the candidate card model, not to the token stream.
3) The agentic tool loop streams THINKING, then executes tools in parallel
On the backend, the chat service runs an agentic loop. The key detail from chat_service.py is that _run_agentic_loop() streams the model response first — emitting THINKING tokens in real time — then executes tool calls in parallel and emits structured events for each result.
That matters because once you commit to progressive streaming, latency isn't just "time to first token." It's "time to first useful card." Parallel tool execution is how I avoid serial waits — and the loop structure means the model can request more tools after seeing the first batch of results.
# chat_service.py — agentic loop (simplified from production)
async def _run_agentic_loop(self, messages, tool_executor, emit, state):
iteration = 0
while iteration < MAX_TOOL_ITERATIONS:
iteration += 1
# Stream the LLM response — THINKING tokens emit here,
# while tool call arguments buffer in chunks
response_text, tool_calls = await self._call_model_with_tools(
messages=messages, emit=emit
)
if not tool_calls:
return response_text # No tools requested — done
# Execute all tool calls in parallel
tasks = [
asyncio.wait_for(
tool_executor.execute(tc["function"]["name"],
json.loads(tc["function"]["arguments"])),
timeout=TOOL_CALL_TIMEOUT
)
for tc in tool_calls
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Emit structured events per tool result
for tc, result in zip(tool_calls, results):
func_name = tc["function"]["name"]
if func_name in SEARCH_TOOLS and not isinstance(result, Exception):
candidates = result.get("candidates", [])
await emit("candidate_metadata", {
"candidates": [
{
"full_name": c.get("full_name", ""),
"location": c.get("location", ""),
"designation": c.get("designations", ""),
"match_score": c.get("search_score", 0),
}
for c in candidates
],
"total": result.get("total", len(candidates))
})
# Feed tool results back to messages for next iteration
messages.append({
"role": "assistant", "content": response_text,
"tool_calls": [
{"id": f"call_{i}", "type": "function", "function": tc["function"]}
for i, tc in enumerate(tool_calls)
]
})
for i, result in enumerate(results):
messages.append({
"role": "tool", "tool_call_id": f"call_{i}",
"content": json.dumps(
result if not isinstance(result, Exception)
else {"error": str(result)}
)
})
The key thing this reveals: THINKING tokens stream during _call_model_with_tools — that's when the model is deciding which tools to invoke. While it "thinks out loud," the UI's collapsible panel fills up. Then tools execute in parallel, and each result emits a typed candidate_metadata event that the UI renders as a card. The narration and the artifacts never share a rendering path.
4) Tools are a real surface area, so I keep them explicit
I have a dedicated module, search_tools.py, that contains 16+ tools spanning search, CRM, research, analytics, and workflow. That tool surface area is exactly why I don’t want “results” to be free-form text.
When a tool returns a candidate list, I want the UI to receive a candidate-card-shaped event. When a tool returns analytics, I want a different event type. The stream becomes a typed event bus, not a transcript.
The interruptible UX: why “collapsible thinking” is not a gimmick
In this system, the collapsible thinking panel isn’t a toy feature. It’s a safety boundary.
- On a client call, the recruiter keeps it closed and gets a calm UI.
- During internal research, a power user opens it to see the reasoning trail.
That’s progressive disclosure: the system is transparent, but it doesn’t demand attention.
I built it because the audience changes minute-to-minute. A recruiter can be in “operator mode” at 9:58 AM, then in “presentation mode” at 10:00 AM. The UI has to support that flip without changing the underlying pipeline.
Mobile nuance: reconnection delays aren’t one-size-fits-all
One of the details I had to bake in early was mobile behavior. In the SignalR service module, I implemented mobile-specific reconnection delays.
That’s not about being fancy; it’s about respecting the reality that phones drop connections differently than desktops.
// signalrService.ts
// Mobile-specific reconnection delays
export function getReconnectDelays(isMobile: boolean): number[] {
if (isMobile) {
return [2000, 5000, 10000]
}
return [1000, 2000, 5000]
}
What I learned here is that “reconnect fast” can be counterproductive on mobile: you can end up thrashing the radio and making the experience feel worse. Different retry delays let the same streaming architecture behave politely across devices.
A concrete walkthrough: a search query that streams both channels
When someone asks a search question like “find candidates” with constraints (location, designations, limits), the experience splits:
- The THINKING channel streams the assistant’s internal narration—useful for recruiters who want to audit ranking.
- The QUERY_RESULT channel begins emitting structured candidate events as soon as tool calls return.
The UX effect is subtle but important: the user doesn’t wait for a monolithic answer. They see candidate cards appear progressively, with match scores and explanations attached to each card.
And because the result stream is structured, I can keep the “why” close to the candidate it applies to—keyword matches, percentile rankings, and concerns—without turning the entire chat into a forensic transcript.
Tradeoffs and limitations
Dual-channel streaming isn’t free.
- You now have two contracts to maintain. THINKING is “best effort text,” while QUERY_RESULT is a schema you can’t casually break.
- Event ordering becomes a product decision. Once results are progressive, you have to decide whether to render partial cards, how to update them, and how to handle late-arriving explanations.
- Transparency is a dial, not a switch. The collapsible panel helps, but you still have to decide what belongs in THINKING versus what belongs as structured explanation.
I still take this trade every time because the alternative is worse: either you hide reasoning entirely (and lose trust with power users), or you expose it everywhere (and sabotage client-call usability).
Closing
The moment I separated THINKING from QUERY_RESULT, the assistant stopped feeling like a streaming text generator and started feeling like an instrument panel: calm defaults for the room, sharp detail when the operator leans in. In Part 10, I’m taking the same philosophy—“make the safe path the default”—and applying it to privacy mode, where anonymization becomes a runtime toggle instead of a one-way data transformation.
🎧 Listen to the Enterprise AI Architecture audiobook
📖 Read the full 13-part series with an AI assistant
Top comments (0)