Guyoung Studio

Posted on Jun 13

BoxAgnts Tool System (6) — Multi-Provider Adaptation and the Agent Query Loop

#ai #agents #rust #webassembly

BoxAgnts' tool system, from the bottom-level WASM sandbox to the top-level Tool trait, has solved "how tools run safely." But tools ultimately need to be called by AI models — which introduces two engineering problems: the complete incompatibility of API formats across AI vendors, and the interleaved orchestration of conversation flow and tool execution. These two problems are solved by the Provider abstraction layer and the Agent query loop, respectively.

Provider Abstraction: Being an LLM Vendor Agnostic

Different types of AI model APIs differ significantly in request format, response format, and error handling.

Let's start with the request side. Anthropic splits roles into user and assistant, with the system prompt as an independent top-level system field; OpenAI treats the system prompt as a role: "system" message; Google Gemini places system_instruction at the top level of the request body but with yet another format. If the upper-layer Agent loop had to handle these differences directly, the code would become a giant match provider_id { ... } branch.

BoxAgnts' solution introduces three layers of abstraction:

Layer 1: ProviderRequest / ProviderResponse Unified Data Model

// provider_types.rs
pub struct ProviderRequest {
    pub messages: Vec<ApiMessage>,
    pub system: Option<String>,
    pub tools: Vec<ApiToolDefinition>,
    pub max_tokens: u32,
    pub temperature: Option<f32>,
}

pub struct ProviderResponse {
    pub content: Vec<ContentBlock>,
    pub usage: UsageInfo,
    pub stop_reason: String,
}

The Agent loop only deals with these two structures, never needing to know whether the user has configured Anthropic or OpenAI.

Layer 2: LlmProvider trait

pub trait LlmProvider: Send + Sync {
    fn id(&self) -> &ProviderId;
    async fn create_message_stream(
        &self, request: ProviderRequest
    ) -> Result<Pin<Box<dyn Stream<Item = Result<StreamEvent, ProviderError>> + Send>>>;
    async fn list_models(&self) -> Result<Vec<ModelInfo>>;
}

create_message_stream returns a Pin<Box<dyn Stream>> — the standard idiom in Rust's async ecosystem for unifying multiple stream types (analogous to Java's Stream<T> or Python's AsyncIterator). Each Provider implementation internally handles its own HTTP request construction, authentication, and SSE parsing, exposing a unified StreamEvent externally.

Layer 3: Transformer (Message Format Conversion)

Transformers handle the "last mile" of eliminating vendor format differences:

// transformers/anthropic.rs
pub fn to_anthropic_request(req: &ProviderRequest) -> AnthropicMessagesRequest { ... }

// transformers/openai_chat.rs
pub fn to_openai_request(req: &ProviderRequest) -> OpenAIChatRequest { ... }

Transformers are pure functions — unified format in, vendor format out. Adding a new Provider only requires implementing a new Transformer and corresponding LlmProvider implementation. The shared ProviderRegistry looks up implementations by Provider ID:

pub struct ProviderRegistry {
    providers: HashMap<ProviderId, Arc<dyn LlmProvider>>,
    default_provider_id: ProviderId,
}

Streaming Protocols and SSE Parsing

All Providers' streaming interactions rely on SSE (Server-Sent Events). But each vendor's SSE event granularity and semantics differ:

Anthropic's content_block_start / content_block_delta / content_block_stop form a three-level event hierarchy; a single ContentBlock spans multiple SSE messages from start to stop
OpenAI's choices[0].delta is a flat delta with no explicit block start/stop
Google Gemini uses the gRPC-web protocol with its own streaming format

BoxAgnts' stream_parser module digests all these differences and exposes a unified StreamEvent enum:

pub enum StreamEvent {
    TextDelta { text: String },
    ToolUseStart { id: String, name: String },
    ToolUseDelta { id: String, json: String },
    ToolUseEnd { id: String },
    ThinkingDelta { text: String },
    UsageUpdate { input_tokens: u32, output_tokens: u32 },
    MessageStop,
}

Each Provider's stream parser internally is a finite state machine. Taking Anthropic as an example:

Wait for message_start
  │
  ├── message_start ──► extract model, initial usage
  │
  ├── content_block_start
  │     │ type = "text"        → create TextBlock state
  │     │ type = "tool_use"    → create ToolUseBlock state, emit ToolUseStart
  │     │ type = "thinking"    → create ThinkingBlock state
  │
  ├── content_block_delta
  │     │ text_delta           → append to current TextBlock, emit TextDelta
  │     │ input_json_delta     → concatenate JSON fragment to ToolUseBlock, emit ToolUseDelta
  │     │ thinking_delta       → append to ThinkingBlock, emit ThinkingDelta
  │
  ├── content_block_stop
  │     │ corresponding tool_use block → emit ToolUseEnd
  │
  └── message_stop ──► emit MessageStop, accumulate final usage

StreamAccumulator maintains the state of all ContentBlocks in the current message:

pub struct StreamAccumulator {
    text_blocks: Vec<TextBlock>,
    tool_use_blocks: HashMap<String, ToolUseBlock>,
    thinking_block: Option<String>,
    usage: UsageInfo,
}

When MessageStop arrives, finish() assembles all accumulated blocks into a complete Message, returning stop_reason and final UsageInfo.

The Agent Query Loop

The stream parser has converted SSE events into structured Message. Next, query::run_query_loop() hands this Message to the tool system.

Core flow:

loop {
    // 1. Send message history + system Prompt + tool list to the AI model
    let request = CreateMessageRequest::builder(model, max_tokens)
        .messages(messages)
        .tools(all_tools_as_definitions(tools))
        .build();

    // 2. Initiate streaming request, parse SSE events
    let mut rx = client.create_message_stream(request).await?;
    let mut acc = StreamAccumulator::new();

    while let Some(evt) = rx.recv().await {
        acc.on_event(&evt);
        match evt {
            StreamEvent::ToolUseStart { .. } | StreamEvent::ToolUseDelta { .. } => {
                // Send to frontend in real time (via WebSocket) so users can see what tools the model is using
            }
            StreamEvent::MessageStop => break,
            _ => {}
        }
    }

    // 3. Assemble the completed Message, check stop_reason
    let (msg, usage, stop_reason) = acc.finish();

    match stop_reason {
        "end_turn" => return QueryOutcome::EndTurn { message: msg, usage },
        "tool_use" => {
            // 4. For each tool_use ContentBlock, call the corresponding tool
            for block in msg.content.iter() {
                if let ContentBlock::ToolUse { name, input, .. } = block {
                    let tool = find_tool(tools, name);
                    let result = tool.execute(input, &ctx).await;
                    messages.push(result_to_message(result));
                }
            }
            // Return to loop top, continue to next round
        }
        "max_tokens" => {
            // 5. MaxTokens recovery: inject a hint message so the model can continue
            messages.push(UserMessage("Output token limit hit. Resume directly."));
            max_tokens_count += 1;
            if max_tokens_count > 3 { return MaxTokens { ... }; }
        }
        _ => return Error(...),
    }

    turn += 1;
    if turn >= config.max_turns { break; }
}

Several details worth noting:

Tool list injection strategy. Each API call round sends the complete tool list (all tools' name, description, and input_schema) as the tools field to the AI model. This incurs a fixed token overhead — the more tools, the higher the per-round "tool description tokens." When tools exceed 20, this overhead becomes significant (potentially several thousand tokens/round). BoxAgnts' current strategy is full injection; future consideration includes tool selection and grouping mechanisms (similar to Anthropic's tool_choice).

MaxTokens recovery. If a model exhausts its output token limit mid-response, it hasn't truly "failed" — it just hasn't finished speaking. BoxAgnts automatically injects a recovery message ("Output token limit hit. Resume directly...") to let the model continue. This loop executes at most 3 times — if after 3 attempts max_tokens is still hit, the task is genuinely too long; the system gives up and returns partial results.

Cancellation mechanism. CancellationToken is borrowed from the tokio ecosystem. When the user clicks the "Stop" button in the frontend, the WebSocket handler cancels the corresponding token, and run_query_loop returns QueryOutcome::Cancelled at its next check.

Cost tracking. After each API call round, CostTracker accumulates the current model's pricing (separately priced by input/output token; different models have different prices). If cumulative costs exceed budget_limit_usd, QueryOutcome::BudgetExceeded is returned. Cost information is pushed in real time to the frontend Dashboard via WebSocket.

Error Handling and Retry Strategy

AI API calls have several typical failure modes:

Error Type	Typical HTTP Code	Strategy
Rate Limit	429	Exponential backoff retry, respect Retry-After header
Overloaded	529	Exponential backoff retry, optional fallback model
Auth Failure	401/403	No retry, return error immediately
Bad Request	400	No retry (retrying parameter errors is pointless)
Server Error	500+	Limited retry (max 3 times)
Network Timeout	—	Limited retry

Exponential backoff uses intervals of 1s → 2s → 4s → 8s, multiplying on Duration. For 529 (Overloaded), model switching is additionally supported — if the user has configured a fallback model (e.g., claude-sonnet-4-5 overloaded, switching to claude-haiku-4-5), subsequent calls automatically use the fallback.

Provider Extensibility

The steps for adding a new Provider are clear:

Add a new module under providers/, implement the LlmProvider trait
Implement the corresponding Transformer (if format conversion is needed)
Register in registry.rs's provider_from_key()
Add the Provider's supported model list in model_registry.rs

The openai_compat_providers module is a shortcut: for services using the OpenAI API format (DeepSeek, OpenCode, various domestic models), only API base URL and API key configuration is needed — no Provider code needs to be written. These services share the same OpenAI-compatible SSE parser and Request builder; only the configuration differs.

// Configuration example
"deepseek": {
    "provider_id": "deepseek",
    "api_base": "https://api.deepseek.com/v1",
    "api_key": "sk-...",
    "provider_type": "openai_compat"
}

Summary

The Provider abstraction and Agent query loop constitute BoxAgnts' tool system "engine":

Provider abstraction solves the problem of integrating 12 AI APIs through three-layer decoupling (ProviderRequest/Response unified data model → LlmProvider trait → Transformer format conversion). Adding a new Provider requires only implementing the trait + registration; the shared SSE parser and Request builder further reduce integration costs through the openai_compat module.
Agent query loop achieves interleaved orchestration of conversation and tool execution through a closed loop of SSE state machine parsing, ToolUse detection, tool dispatch, and result feedback. MaxTokens automatic recovery (max 3 attempts) and exponential backoff retry strategy ensure reliability for long tasks.
The common feature of these two layers is dependency inversion — the Agent loop doesn't depend on a specific AI vendor, and the Provider implementation doesn't depend on specific conversation orchestration logic. All coupling is decoupled through trait interfaces.

Cost tracking (CostTracker + AtomicF64) and cancellation mechanism (CancellationToken) provide necessary operational observability and user control for production environments.

References

BoxAgnts source code: https://github.com/guyoung/boxagnts
Anthropic Messages API documentation: https://docs.anthropic.com/en/api/messages
OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat
Server-Sent Events specification: https://html.spec.whatwg.org/multipage/server-sent-events.html
Codex CLI Agent Loop design: https://openai.com/index/unrolling-the-codex-agent-loop/
Claude Code architecture analysis: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/
tokio-cron-scheduler: https://docs.rs/tokio-cron-scheduler

Top comments (3)

Mehmet Can Farsak • Jun 13

Great deep-dive on the agent query loop and tool orchestration. One pattern I've run into: agents in the brainstorming phase often skip straight to tool calls instead of ideating. Built Brainstorm-Mode (mehmetcanfarsak on GitHub) which uses PreToolUse hooks to intercept that — adds a mode layer (divergent/actionable/academic) so the agent stays in ideation instead of jumping to execute. Fits nicely into the hook system pattern you're describing.

Mehmet Can Farsak • Jun 13

Great deep-dive on the agent query loop. One gap I've noticed in agent systems like this: when the query loop receives a brainstorming/ideation request, agents tend to jump straight into tool execution instead of thinking first. I built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that hooks into this exact loop pattern — uses PreToolUse hooks to intercept tool calls during ideation phases. Three modes (divergent, actionable, academic) help keep the agent in the right headspace.

Mehmet Can Farsak • Jun 13

Great deep-dive into the agent query loop. The Provider abstraction layer is clever. I noticed most agent loops have the same issue: they do not distinguish between thinking and acting turns. When an agent is brainstorming, it should generate ideas, not execute tools. Built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) that hooks into that query loop via PreToolUse — blocks tool execution during ideation modes. Three modes (divergent, actionable, academic) give the loop a clear signal about what the agent is allowed to do at each step.