BoxAgnts' tool system, from the bottom-level WASM sandbox to the top-level Tool trait, has solved "how tools run safely." But tools ultimately need to be called by AI models — which introduces two engineering problems: the complete incompatibility of API formats across AI vendors, and the interleaved orchestration of conversation flow and tool execution. These two problems are solved by the Provider abstraction layer and the Agent query loop, respectively.
Provider Abstraction: Being an LLM Vendor Agnostic
Different types of AI model APIs differ significantly in request format, response format, and error handling.
Let's start with the request side. Anthropic splits roles into user and assistant, with the system prompt as an independent top-level system field; OpenAI treats the system prompt as a role: "system" message; Google Gemini places system_instruction at the top level of the request body but with yet another format. If the upper-layer Agent loop had to handle these differences directly, the code would become a giant match provider_id { ... } branch.
BoxAgnts' solution introduces three layers of abstraction:
Layer 1: ProviderRequest / ProviderResponse Unified Data Model
// provider_types.rs
pub struct ProviderRequest {
pub messages: Vec<ApiMessage>,
pub system: Option<String>,
pub tools: Vec<ApiToolDefinition>,
pub max_tokens: u32,
pub temperature: Option<f32>,
}
pub struct ProviderResponse {
pub content: Vec<ContentBlock>,
pub usage: UsageInfo,
pub stop_reason: String,
}
The Agent loop only deals with these two structures, never needing to know whether the user has configured Anthropic or OpenAI.
Layer 2: LlmProvider trait
pub trait LlmProvider: Send + Sync {
fn id(&self) -> &ProviderId;
async fn create_message_stream(
&self, request: ProviderRequest
) -> Result<Pin<Box<dyn Stream<Item = Result<StreamEvent, ProviderError>> + Send>>>;
async fn list_models(&self) -> Result<Vec<ModelInfo>>;
}
create_message_stream returns a Pin<Box<dyn Stream>> — the standard idiom in Rust's async ecosystem for unifying multiple stream types (analogous to Java's Stream<T> or Python's AsyncIterator). Each Provider implementation internally handles its own HTTP request construction, authentication, and SSE parsing, exposing a unified StreamEvent externally.
Layer 3: Transformer (Message Format Conversion)
Transformers handle the "last mile" of eliminating vendor format differences:
// transformers/anthropic.rs
pub fn to_anthropic_request(req: &ProviderRequest) -> AnthropicMessagesRequest { ... }
// transformers/openai_chat.rs
pub fn to_openai_request(req: &ProviderRequest) -> OpenAIChatRequest { ... }
Transformers are pure functions — unified format in, vendor format out. Adding a new Provider only requires implementing a new Transformer and corresponding LlmProvider implementation. The shared ProviderRegistry looks up implementations by Provider ID:
pub struct ProviderRegistry {
providers: HashMap<ProviderId, Arc<dyn LlmProvider>>,
default_provider_id: ProviderId,
}
Streaming Protocols and SSE Parsing
All Providers' streaming interactions rely on SSE (Server-Sent Events). But each vendor's SSE event granularity and semantics differ:
- Anthropic's
content_block_start/content_block_delta/content_block_stopform a three-level event hierarchy; a single ContentBlock spans multiple SSE messages from start to stop - OpenAI's
choices[0].deltais a flat delta with no explicit block start/stop - Google Gemini uses the gRPC-web protocol with its own streaming format
BoxAgnts' stream_parser module digests all these differences and exposes a unified StreamEvent enum:
pub enum StreamEvent {
TextDelta { text: String },
ToolUseStart { id: String, name: String },
ToolUseDelta { id: String, json: String },
ToolUseEnd { id: String },
ThinkingDelta { text: String },
UsageUpdate { input_tokens: u32, output_tokens: u32 },
MessageStop,
}
Each Provider's stream parser internally is a finite state machine. Taking Anthropic as an example:
Wait for message_start
│
├── message_start ──► extract model, initial usage
│
├── content_block_start
│ │ type = "text" → create TextBlock state
│ │ type = "tool_use" → create ToolUseBlock state, emit ToolUseStart
│ │ type = "thinking" → create ThinkingBlock state
│
├── content_block_delta
│ │ text_delta → append to current TextBlock, emit TextDelta
│ │ input_json_delta → concatenate JSON fragment to ToolUseBlock, emit ToolUseDelta
│ │ thinking_delta → append to ThinkingBlock, emit ThinkingDelta
│
├── content_block_stop
│ │ corresponding tool_use block → emit ToolUseEnd
│
└── message_stop ──► emit MessageStop, accumulate final usage
StreamAccumulator maintains the state of all ContentBlocks in the current message:
pub struct StreamAccumulator {
text_blocks: Vec<TextBlock>,
tool_use_blocks: HashMap<String, ToolUseBlock>,
thinking_block: Option<String>,
usage: UsageInfo,
}
When MessageStop arrives, finish() assembles all accumulated blocks into a complete Message, returning stop_reason and final UsageInfo.
The Agent Query Loop
The stream parser has converted SSE events into structured Message. Next, query::run_query_loop() hands this Message to the tool system.
Core flow:
loop {
// 1. Send message history + system Prompt + tool list to the AI model
let request = CreateMessageRequest::builder(model, max_tokens)
.messages(messages)
.tools(all_tools_as_definitions(tools))
.build();
// 2. Initiate streaming request, parse SSE events
let mut rx = client.create_message_stream(request).await?;
let mut acc = StreamAccumulator::new();
while let Some(evt) = rx.recv().await {
acc.on_event(&evt);
match evt {
StreamEvent::ToolUseStart { .. } | StreamEvent::ToolUseDelta { .. } => {
// Send to frontend in real time (via WebSocket) so users can see what tools the model is using
}
StreamEvent::MessageStop => break,
_ => {}
}
}
// 3. Assemble the completed Message, check stop_reason
let (msg, usage, stop_reason) = acc.finish();
match stop_reason {
"end_turn" => return QueryOutcome::EndTurn { message: msg, usage },
"tool_use" => {
// 4. For each tool_use ContentBlock, call the corresponding tool
for block in msg.content.iter() {
if let ContentBlock::ToolUse { name, input, .. } = block {
let tool = find_tool(tools, name);
let result = tool.execute(input, &ctx).await;
messages.push(result_to_message(result));
}
}
// Return to loop top, continue to next round
}
"max_tokens" => {
// 5. MaxTokens recovery: inject a hint message so the model can continue
messages.push(UserMessage("Output token limit hit. Resume directly."));
max_tokens_count += 1;
if max_tokens_count > 3 { return MaxTokens { ... }; }
}
_ => return Error(...),
}
turn += 1;
if turn >= config.max_turns { break; }
}
Several details worth noting:
Tool list injection strategy. Each API call round sends the complete tool list (all tools' name, description, and input_schema) as the tools field to the AI model. This incurs a fixed token overhead — the more tools, the higher the per-round "tool description tokens." When tools exceed 20, this overhead becomes significant (potentially several thousand tokens/round). BoxAgnts' current strategy is full injection; future consideration includes tool selection and grouping mechanisms (similar to Anthropic's tool_choice).
MaxTokens recovery. If a model exhausts its output token limit mid-response, it hasn't truly "failed" — it just hasn't finished speaking. BoxAgnts automatically injects a recovery message ("Output token limit hit. Resume directly...") to let the model continue. This loop executes at most 3 times — if after 3 attempts max_tokens is still hit, the task is genuinely too long; the system gives up and returns partial results.
Cancellation mechanism. CancellationToken is borrowed from the tokio ecosystem. When the user clicks the "Stop" button in the frontend, the WebSocket handler cancels the corresponding token, and run_query_loop returns QueryOutcome::Cancelled at its next check.
Cost tracking. After each API call round, CostTracker accumulates the current model's pricing (separately priced by input/output token; different models have different prices). If cumulative costs exceed budget_limit_usd, QueryOutcome::BudgetExceeded is returned. Cost information is pushed in real time to the frontend Dashboard via WebSocket.
Error Handling and Retry Strategy
AI API calls have several typical failure modes:
| Error Type | Typical HTTP Code | Strategy |
|---|---|---|
| Rate Limit | 429 | Exponential backoff retry, respect Retry-After header |
| Overloaded | 529 | Exponential backoff retry, optional fallback model |
| Auth Failure | 401/403 | No retry, return error immediately |
| Bad Request | 400 | No retry (retrying parameter errors is pointless) |
| Server Error | 500+ | Limited retry (max 3 times) |
| Network Timeout | — | Limited retry |
Exponential backoff uses intervals of 1s → 2s → 4s → 8s, multiplying on Duration. For 529 (Overloaded), model switching is additionally supported — if the user has configured a fallback model (e.g., claude-sonnet-4-5 overloaded, switching to claude-haiku-4-5), subsequent calls automatically use the fallback.
Provider Extensibility
The steps for adding a new Provider are clear:
- Add a new module under
providers/, implement theLlmProvidertrait - Implement the corresponding Transformer (if format conversion is needed)
- Register in
registry.rs'sprovider_from_key() - Add the Provider's supported model list in
model_registry.rs
The openai_compat_providers module is a shortcut: for services using the OpenAI API format (DeepSeek, OpenCode, various domestic models), only API base URL and API key configuration is needed — no Provider code needs to be written. These services share the same OpenAI-compatible SSE parser and Request builder; only the configuration differs.
// Configuration example
"deepseek": {
"provider_id": "deepseek",
"api_base": "https://api.deepseek.com/v1",
"api_key": "sk-...",
"provider_type": "openai_compat"
}
Summary
The Provider abstraction and Agent query loop constitute BoxAgnts' tool system "engine":
Provider abstraction solves the problem of integrating 12 AI APIs through three-layer decoupling (ProviderRequest/Response unified data model → LlmProvider trait → Transformer format conversion). Adding a new Provider requires only implementing the trait + registration; the shared SSE parser and Request builder further reduce integration costs through the
openai_compatmodule.Agent query loop achieves interleaved orchestration of conversation and tool execution through a closed loop of SSE state machine parsing, ToolUse detection, tool dispatch, and result feedback. MaxTokens automatic recovery (max 3 attempts) and exponential backoff retry strategy ensure reliability for long tasks.
The common feature of these two layers is dependency inversion — the Agent loop doesn't depend on a specific AI vendor, and the Provider implementation doesn't depend on specific conversation orchestration logic. All coupling is decoupled through trait interfaces.
Cost tracking (CostTracker + AtomicF64) and cancellation mechanism (CancellationToken) provide necessary operational observability and user control for production environments.
References
- BoxAgnts source code: https://github.com/guyoung/boxagnts
- Anthropic Messages API documentation: https://docs.anthropic.com/en/api/messages
- OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat
- Server-Sent Events specification: https://html.spec.whatwg.org/multipage/server-sent-events.html
- Codex CLI Agent Loop design: https://openai.com/index/unrolling-the-codex-agent-loop/
- Claude Code architecture analysis: https://blog.promptlayer.com/claude-code-behind-the-scenes-of-the-master-agent-loop/
- tokio-cron-scheduler: https://docs.rs/tokio-cron-scheduler
Top comments (1)
Great deep-dive on the agent query loop and tool orchestration. One pattern I've run into: agents in the brainstorming phase often skip straight to tool calls instead of ideating. Built Brainstorm-Mode (mehmetcanfarsak on GitHub) which uses PreToolUse hooks to intercept that — adds a mode layer (divergent/actionable/academic) so the agent stays in ideation instead of jumping to execute. Fits nicely into the hook system pattern you're describing.