<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: marcosomma</title>
    <description>The latest articles on DEV Community by marcosomma (@marcosomma).</description>
    <link>https://dev.to/marcosomma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3064224%2F51ff6fb7-5f5b-4ee3-80be-52b8264243b3.jpg</url>
      <title>DEV Community: marcosomma</title>
      <link>https://dev.to/marcosomma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marcosomma"/>
    <language>en</language>
    <item>
      <title>I Ran 500 More Agent Memory Experiments. The Real Problem Wasn’t Recall. It Was Binding.</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Mon, 13 Apr 2026 09:19:39 +0000</pubDate>
      <link>https://dev.to/marcosomma/i-ran-500-more-agent-memory-experiments-the-real-problem-wasnt-recall-it-was-binding-24kc</link>
      <guid>https://dev.to/marcosomma/i-ran-500-more-agent-memory-experiments-the-real-problem-wasnt-recall-it-was-binding-24kc</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a follow-up to &lt;a href="https://dev.to/marco_somma_a9e88a3063f3/i-tried-to-turn-agent-memory-into-plumbing-instead-of-philosophy-1bpm"&gt;I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy&lt;/a&gt;. If you haven't read that one, the short version: I built a persistent memory system for AI agents called &lt;a href="https://github.com/marcosomma/orka-reasoning" rel="noopener noreferrer"&gt;OrKa Brain&lt;/a&gt;, ran 30 benchmark tasks, got a 63% pairwise win rate and a +0.10 rubric improvement, and concluded that "the model already knew most of what the Brain was recalling." Then I got some very good comments that made me uncomfortable. This is what happened next.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Comfortable Lie I Told Myself
&lt;/h2&gt;

&lt;p&gt;After the first benchmark, I had a narrative that felt reasonable: the memory system works, the numbers are positive, the confounds are acknowledged, and more data will clarify things.&lt;/p&gt;

&lt;p&gt;That last part, "more data will clarify things", is what engineers say when they don't want to admit they might be wrong. I said it too. And then I went and got more data.&lt;/p&gt;

&lt;p&gt;250 tasks. Five specialized tracks. 500 total runs (brain vs. brainless). A separate judge model so the LLM wasn't grading its own homework. Eleven code changes addressing five root-cause problems I'd identified from the first round.&lt;/p&gt;

&lt;p&gt;The results came back. They didn't clarify things. They made them worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Fixed Before Running Again
&lt;/h2&gt;

&lt;p&gt;I'm not going to pretend I just blindly re-ran the same experiment. I did real work between benchmark v1 and v2. The &lt;a href="https://dev.to/marco_somma_a9e88a3063f3/i-tried-to-turn-agent-memory-into-plumbing-instead-of-philosophy-1bpm#comments"&gt;first article's comments&lt;/a&gt; called out several things, and I addressed them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 1: Skills were storing verbatim LLM output, not abstract patterns.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the big one. When the Brain learned a skill from a data engineering task, it stored the literal steps: "Load CSV files into staging tables using pandas read_csv with error handling." That's not transferable knowledge, it's a paraphrase of what the model already knows. I rewrote the abstraction layer (&lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/constants.py" rel="noopener noreferrer"&gt;&lt;code&gt;orka/brain/constants.py&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/brain.py" rel="noopener noreferrer"&gt;&lt;code&gt;brain.py&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/agents/brain_agent.py" rel="noopener noreferrer"&gt;&lt;code&gt;brain_agent.py&lt;/code&gt;&lt;/a&gt;) to extract verb-target patterns: "implement [target]", "validate [component]", "trace [target]". The idea was that abstract patterns would transfer better across domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2: The recall threshold was zero.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;min_score=0.0&lt;/code&gt; meant any vaguely related skill could get recalled. I raised it to 0.5 and added a semantic floor in the &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/transfer_engine.py" rel="noopener noreferrer"&gt;&lt;code&gt;transfer_engine.py&lt;/code&gt;&lt;/a&gt;, if the embedding similarity is below 0.1 AND structural match is below 0.6, the candidate gets rejected entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 3: The model was judging its own output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;v1 used the same LLM for execution and evaluation. v2 uses a separate judge model (&lt;code&gt;qwen/qwen3-coder-30b&lt;/code&gt;) with dedicated &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/judge_rubric_workflow.yml" rel="noopener noreferrer"&gt;rubric&lt;/a&gt; and &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/judge_pairwise_workflow.yml" rel="noopener noreferrer"&gt;pairwise&lt;/a&gt; workflow YAMLs. Execution and judgment are completely decoupled, different scripts, different models, different runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 4: Track diversity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;v1 had one track. v2 has five:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Track&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Cross-domain transfer&lt;/td&gt;
&lt;td&gt;Does a data engineering skill help with cybersecurity?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Ethical reasoning&lt;/td&gt;
&lt;td&gt;Do anti-pattern detection skills transfer?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Routing decisions&lt;/td&gt;
&lt;td&gt;Hardest track, complex multi-path choices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;Multi-step reasoning&lt;/td&gt;
&lt;td&gt;Do procedural patterns help new reasoning chains?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;Iterative refinement&lt;/td&gt;
&lt;td&gt;Do improvement patterns compound?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;50 tasks per track, 250 total. All available in the &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/benchmark_v2_dataset.json" rel="noopener noreferrer"&gt;benchmark dataset&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 5: Single-pass baselines.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The brainless condition now runs through a properly equivalent pipeline, same structure, same number of agents, just without the Brain recall/learn steps. No more two-pass advantage that could inflate brainless scores. Baseline workflows: &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/baseline_track_a.yml" rel="noopener noreferrer"&gt;&lt;code&gt;baseline_track_a.yml&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/baseline_track_b.yml" rel="noopener noreferrer"&gt;&lt;code&gt;baseline_track_b.yml&lt;/code&gt;&lt;/a&gt;, etc.&lt;/p&gt;

&lt;p&gt;I also split the pipeline into three standalone scripts, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/run_benchmark_v2.py" rel="noopener noreferrer"&gt;execution&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/judge_benchmark.py" rel="noopener noreferrer"&gt;judging&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/examples/benchmark_v2/aggregate_benchmark.py" rel="noopener noreferrer"&gt;aggregation&lt;/a&gt;, so you can re-run any phase independently. Eleven code changes total, all committed and tested. 3,014 unit tests passing. You can verify everything in the &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/master/examples/benchmark_v2/results" rel="noopener noreferrer"&gt;results directory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I felt good about this. I'd addressed every valid criticism. Time to re-run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Here's the overall aggregate from 250 tasks, brain vs. brainless:&lt;/p&gt;

&lt;h3&gt;
  
  
  Rubric Scores (1–10 scale, six dimensions)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Brain&lt;/th&gt;
&lt;th&gt;Brainless&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning Quality&lt;/td&gt;
&lt;td&gt;9.51&lt;/td&gt;
&lt;td&gt;9.52&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−0.01&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structural Completeness&lt;/td&gt;
&lt;td&gt;9.87&lt;/td&gt;
&lt;td&gt;9.83&lt;/td&gt;
&lt;td&gt;+0.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Depth of Analysis&lt;/td&gt;
&lt;td&gt;8.79&lt;/td&gt;
&lt;td&gt;8.74&lt;/td&gt;
&lt;td&gt;+0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actionability&lt;/td&gt;
&lt;td&gt;9.67&lt;/td&gt;
&lt;td&gt;9.64&lt;/td&gt;
&lt;td&gt;+0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain Adaptability&lt;/td&gt;
&lt;td&gt;9.85&lt;/td&gt;
&lt;td&gt;9.82&lt;/td&gt;
&lt;td&gt;+0.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence Calibration&lt;/td&gt;
&lt;td&gt;9.38&lt;/td&gt;
&lt;td&gt;9.39&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−0.01&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.37&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.31&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.06&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A +0.06 rubric delta across 250 tasks.&lt;/p&gt;

&lt;p&gt;For reference, v1 was +0.10 across 30 tasks. So the effect got &lt;em&gt;smaller&lt;/em&gt; with more data, not larger. That's not what you want to see.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pairwise Comparison (245 head-to-head comparisons)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Brain Wins&lt;/th&gt;
&lt;th&gt;Brainless Wins&lt;/th&gt;
&lt;th&gt;Tie&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stronger reasoning&lt;/td&gt;
&lt;td&gt;152&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;More complete&lt;/td&gt;
&lt;td&gt;149&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;More trustworthy&lt;/td&gt;
&lt;td&gt;151&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;151&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Brain win rate: &lt;strong&gt;61.6%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's where it gets uncomfortable. The pairwise judge says brain wins 62% of the time. The rubric judge says brain is +0.06 better, which is noise at a 9.3/10 baseline. These two metrics should agree. They don't.&lt;/p&gt;

&lt;p&gt;I've seen this pattern before. It's length/position bias. Brain responses tend to be longer because the pipeline has more agents in the chain, which means more context, which means more text. Pairwise judges prefer longer answers. The rubric doesn't care about length, it scores each dimension independently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Track Breakdown
&lt;/h3&gt;

&lt;p&gt;This is where the story gets interesting:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Track&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Rubric Δ&lt;/th&gt;
&lt;th&gt;Pairwise Win%&lt;/th&gt;
&lt;th&gt;Brainless Baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Cross-domain transfer&lt;/td&gt;
&lt;td&gt;−0.02&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;9.33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Ethical reasoning&lt;/td&gt;
&lt;td&gt;+0.00&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;9.54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Routing decisions&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.40&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;Multi-step reasoning&lt;/td&gt;
&lt;td&gt;+0.08&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;9.49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;Iterative refinement&lt;/td&gt;
&lt;td&gt;+0.06&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;9.61&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Track C stands out. It's the hardest track, brainless only scores 8.12, nearly a full point below every other track. And it's the only track where brain shows a meaningful rubric gain: +0.40 across six dimensions.&lt;/p&gt;

&lt;p&gt;Track E has the highest pairwise win rate (76%) but the smallest rubric gain (+0.06). That's the length bias signature, the pairwise judge loves brain's longer outputs, but the rubric says they're not actually better.&lt;/p&gt;

&lt;p&gt;Track B is essentially a coin flip. 52% pairwise, +0.00 rubric. The Brain adds nothing to ethical reasoning tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Ugly Detail: Skill Usage
&lt;/h3&gt;

&lt;p&gt;Here's what really killed me. I dug into the individual results to see how many tasks actually used their recalled skill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tasks with skill recall attempted:&lt;/strong&gt; 51 / 250 (20%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks that actually used the recalled skill:&lt;/strong&gt; 0 / 250&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average semantic match score:&lt;/strong&gt; ~0.02 (near zero)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zero. Not one single task out of 250 used the recalled skill. The model read the skill, evaluated it, and decided every single time that it wasn't helpful. And the semantic similarity between the abstract skill and the actual task was essentially random noise.&lt;/p&gt;

&lt;p&gt;The abstraction layer I was so proud of, the one that converts "Load CSV files into staging tables using pandas" into "implement [target]", produced skills so abstract they were vacuous. Two words of content. The embedding model sees no relationship between "implement [target]" and any real task. The execution model correctly recognizes that "implement [target]" tells it nothing it doesn't already know.&lt;/p&gt;

&lt;p&gt;I had gone from skills that were too specific (literal LLM paraphrases) to skills that were too abstract (empty shells). The sweet spot, actual transferable knowledge, was somewhere I hadn't found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sitting with the Discomfort
&lt;/h2&gt;

&lt;p&gt;I'm going to be honest about what went through my head at this point. I've been working on OrKa for over a year. Forty blog posts. A research paper about the Agricultural Threshold for machine intelligence. An open-source framework that allow me to test and experiment and explore my idea with real AI runs. And the core thesis, that persistent memory makes agents better, keeps failing to show up in the numbers.&lt;/p&gt;

&lt;p&gt;I considered dropping the whole Brain system. Making OrKa just an orchestration framework. Simpler. Easier to explain. No embarrassing benchmarks.&lt;br&gt;
But then I looked at &lt;strong&gt;Track C&lt;/strong&gt; again.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Track C **is the only track where brainless *struggles&lt;/em&gt;. It scores 8.12, good, but not great. The tasks involve complex routing decisions where the model has to consider multiple paths and trade-offs. This is the only track where the model actually needs help.&lt;/p&gt;

&lt;p&gt;And it's the only track where brain provides meaningful help. +0.40 rubric delta is not noise. Across 50 tasks and six scoring dimensions, that's a consistent, measurable improvement.&lt;/p&gt;

&lt;p&gt;The pattern is simple: the Brain helps when the model needs help, and doesn't help when the model doesn't need help.&lt;/p&gt;

&lt;p&gt;That sounds obvious in retrospect. &lt;strong&gt;&lt;em&gt;But it means the thesis isn't wrong, it's just being tested in the wrong conditions&lt;/em&gt;&lt;/strong&gt;. You wouldn't evaluate a life jacket by putting it on people standing on dry land and measuring whether they're drier.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real Problem: What Is a Memory?
&lt;/h2&gt;

&lt;p&gt;This is where the story changes. Because instead of asking "does memory help?" I started asking "what is a memory, actually?"&lt;/p&gt;

&lt;p&gt;Think about how you remember how to drive a car. What fires in your brain when you approach an unfamiliar intersection?&lt;/p&gt;

&lt;p&gt;It's not one thing. It's not "turn the wheel, press the gas." That's the procedural part, and yes, it's there. But it's bound together with other things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The time you nearly got T-boned&lt;/strong&gt; because you assumed a green light meant it was safe without checking cross traffic. That's episodic memory, a specific event with emotional weight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Right of way doesn't mean right of safety"&lt;/strong&gt;, That's semantic memory. A general fact you learned, maybe from a driving instructor, maybe from experience.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Checking mirrors BEFORE entering the intersection prevents blind-spot collisions BECAUSE turning reduces your field of vision"&lt;/strong&gt;, That's causal reasoning. You know &lt;em&gt;why&lt;/em&gt; the sequence matters, not just &lt;em&gt;that&lt;/em&gt; it matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you encounter the intersection, all of these fire together. The procedure tells you what to do. The episode tells you what happened last time. The semantic fact tells you a principle. The causal link tells you why. That combination, that &lt;em&gt;binding&lt;/em&gt;, is what makes the memory useful. Any single component alone is much less helpful.&lt;/p&gt;

&lt;p&gt;Now look at what OrKa Brain currently stores as a "skill":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;implement [target]
trace [target]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No episodes. No semantic context. No causal reasoning. Just two abstract action verbs. No wonder the model ignores it. It's like handing a driver a note that says "steer [vehicle]" and expecting it to help at the intersection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Binding Problem
&lt;/h2&gt;

&lt;p&gt;I went down a rabbit hole into cognitive science literature on this. What I found is that neuroscientists have been arguing about this exact problem for decades. They call it the &lt;em&gt;binding problem&lt;/em&gt;, how does the brain take separate memory traces stored in different systems and combine them into a unified experience?&lt;/p&gt;

&lt;p&gt;The hippocampus doesn't store the memory. It stores the &lt;em&gt;index&lt;/em&gt;, the binding that links the procedural memory in the motor cortex, the emotional trace in the amygdala, the spatial context in the parietal cortex, and the semantic facts in the temporal lobe. When you recall one, you recall all of them, because they're bound together.&lt;/p&gt;

&lt;p&gt;I had built the hippocampus and the motor cortex as two separate systems that had never met.&lt;/p&gt;

&lt;p&gt;Here's what actually exists in OrKa today:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Skill system&lt;/strong&gt; (fully operational, used in benchmarks):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Abstract procedure steps&lt;/li&gt;
&lt;li&gt;Preconditions and postconditions&lt;/li&gt;
&lt;li&gt;Transfer history and confidence scores&lt;/li&gt;
&lt;li&gt;Structural/semantic matching for recall&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Episode system&lt;/strong&gt; (fully built, tested, &lt;em&gt;never used in any benchmark&lt;/em&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specific task input and outcome&lt;/li&gt;
&lt;li&gt;What worked and what failed&lt;/li&gt;
&lt;li&gt;Root cause analysis for failures&lt;/li&gt;
&lt;li&gt;Actionable lessons learned&lt;/li&gt;
&lt;li&gt;Resource metrics (tokens, latency)&lt;/li&gt;
&lt;li&gt;Links to related episodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both systems are production-ready. Both have full test coverage. Both are integrated into the Brain class. I wrote &lt;code&gt;record_episode()&lt;/code&gt;, &lt;code&gt;recall_episodes()&lt;/code&gt;, &lt;code&gt;EpisodeStore&lt;/code&gt;, &lt;code&gt;EpisodeRecall&lt;/code&gt;, all of it. Complete with semantic search, retention policies, and four-dimensional scoring.&lt;/p&gt;

&lt;p&gt;And then I never connected them together.&lt;/p&gt;

&lt;p&gt;The Skill has no &lt;code&gt;episode_id&lt;/code&gt; field. The Episode has no &lt;code&gt;skill_id&lt;/code&gt; field. &lt;code&gt;brain.learn()&lt;/code&gt; creates a Skill but not an Episode. &lt;code&gt;brain.recall()&lt;/code&gt; returns Skills but not Episodes. The benchmark workflows run brain_learn and brain_recall, but never brain_record_episode or brain_recall_episodes.&lt;/p&gt;

&lt;p&gt;Two complete memory systems, sitting in the same codebase, sharing no information.&lt;/p&gt;

&lt;p&gt;When I saw this, I felt stupid. But I also felt something else: the architecture was already 80% there. The hard parts, embedding storage, semantic search, decay policies, scoring systems, were done. The missing piece wasn't a new system. It was the wiring between existing systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Memory Should Actually Look Like
&lt;/h2&gt;

&lt;p&gt;Here's the concept I'm now calling a &lt;strong&gt;Memory Bundle&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│            MEMORY BUNDLE                │
│                                         │
│  ┌───────────┐  ┌──────────────────┐    │
│  │ Procedure │  │ Episodes (1..N)  │    │
│  │ (steps)   │──│ what worked      │    │
│  │           │  │ what failed      │    │
│  └───────────┘  │ lessons          │    │
│                 │ "X+Z → Y"        │    │
│  ┌───────────┐  └──────────────────┘    │
│  │ Semantic  │                          │
│  │ (domain   │  ┌──────────────────┐    │
│  │  facts)   │  │ Causal Links     │    │
│  │           │  │ "A because B"    │    │
│  └───────────┘  └──────────────────┘    │
│                                         │
│  transfer_score = f(all_components)     │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the system learns from an execution, it creates &lt;em&gt;both&lt;/em&gt; a skill AND an episode, linked by ID. The skill stores the abstract procedure. The episode stores what actually happened, the specific outcome, what worked, what failed, and crucially, the &lt;em&gt;lessons&lt;/em&gt;: "Running validation before deduplication caught 30% of bad records that would have been duplicated, always validate first."&lt;/p&gt;

&lt;p&gt;When the system recalls, it returns the skill &lt;em&gt;with its episodes attached&lt;/em&gt;. The prompt to the model isn't "implement [target]", it's:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Here's an abstract procedure: implement [target] → validate [component] → trace [target].&lt;/p&gt;

&lt;p&gt;This skill has been applied 3 times before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data engineering (ETL)&lt;/strong&gt;: Validation before dedup caught 30% of dirty records. Lesson: always validate before any deduplication step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API integration&lt;/strong&gt;: Target implementation worked, but tracing missed async callbacks. Lesson: tracing needs to account for async execution paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log analysis&lt;/strong&gt;: Pattern worked well. Filtering noisy entries before analysis reduced false positives by 40%.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's a memory a model can actually use. It has the abstract pattern (transferable) AND the concrete evidence (grounding). The model can decide whether the pattern applies here based on real outcomes, not just structural similarity.&lt;/p&gt;

&lt;p&gt;The transfer scoring changes too. A skill backed by five successful episodes with clear lessons should score higher than a skill backed by zero episodes. The episode quality becomes part of the transfer decision.&lt;/p&gt;

&lt;p&gt;And feedback updates both, the skill's confidence changes, AND a new episode gets recorded for this application. The episode chain grows over time, and future recalls get richer context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is Actually About the Thesis
&lt;/h2&gt;

&lt;p&gt;My research paper argues that intelligence becomes civilization-scale only through &lt;strong&gt;recursive environmental control loops&lt;/strong&gt;, project, act, observe, revise, compound. Agriculture was the first time humans did this at scale. The agricultural threshold.&lt;/p&gt;

&lt;p&gt;The current Brain system doesn't cross that threshold. It projects (learns a skill), acts (recalls it), but doesn't truly observe or revise. The skill never learns from its own application. It just accumulates abstract patterns with no connection to real outcomes.&lt;/p&gt;

&lt;p&gt;The Memory Bundle changes this. Each episode is an observation. Each lesson is a revision. Each future recall that includes those lessons is compounding. The loop closes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Learn&lt;/strong&gt;: Execute a task → create skill + record episode (with what worked/failed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt;: Find matching skill → include its episodes as evidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply&lt;/strong&gt;: Model uses the procedure + the concrete lessons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback&lt;/strong&gt;: Record a new episode for this application → update skill confidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compound&lt;/strong&gt;: Next recall is richer, it has more episodes, more lessons, more evidence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the recursive loop. That's the agricultural threshold. And the architecture for it already exists, it just needs the binding.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About Track C?
&lt;/h2&gt;

&lt;p&gt;This also explains why Track C was the only track that showed improvement. Track C tasks are routing decisions, complex, multi-path choices where the model has to weigh trade-offs. These are exactly the kind of tasks where episodic evidence would help most.&lt;/p&gt;

&lt;p&gt;When someone says "last time we tried path A for a similar routing problem, it failed because of X, path B worked because of Y," that's genuinely new information. The model can't derive it from its weights. It's system-specific, run-specific, outcome-specific.&lt;/p&gt;

&lt;p&gt;The current brain helped Track C even without episodes because the tasks are hard enough that any additional context, even a vague abstract skill, provides a useful scaffold. But imagine Track C with Memory Bundles, the model would get both the abstract pattern AND the specific outcomes from previous routing decisions.&lt;/p&gt;

&lt;p&gt;Tracks A, B, D, and E didn't improve because the model already scores 9.3+/10 on them. It doesn't need help. No amount of memory, procedural, episodic, or otherwise, will improve a 9.5/10 response to a 10/10 response. The tasks aren't hard enough to require accumulated knowledge.&lt;/p&gt;

&lt;p&gt;This isn't a failure of the memory system. It's a boundary condition. Memory helps when the task exceeds single-shot capability. It doesn't help when the model is already near-perfect without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Not Claiming
&lt;/h2&gt;

&lt;p&gt;I want to be careful here, because I've been burned before by getting ahead of my own evidence.&lt;/p&gt;

&lt;p&gt;I'm &lt;strong&gt;not&lt;/strong&gt; claiming that Memory Bundles will definitely show large improvements. I'm claiming that the current system stores memories that are too impoverished to be useful, and I now understand what richer memories should look like.&lt;/p&gt;

&lt;p&gt;I'm &lt;strong&gt;not&lt;/strong&gt; claiming the ceiling effect is the only problem. The pairwise-rubric disagreement at 62% vs +0.06 suggests position/length bias is still contaminating the pairwise results. That confound exists regardless of memory architecture.&lt;/p&gt;

&lt;p&gt;I'm &lt;strong&gt;not&lt;/strong&gt; claiming this is a new idea. Cognitive scientists have written about memory binding for decades. What's new (maybe) is applying it to agent memory systems where the default assumption seems to be that one type of memory, usually RAG-style document retrieval, is sufficient.&lt;/p&gt;

&lt;p&gt;And I'm &lt;strong&gt;not&lt;/strong&gt; pretending the community feedback didn't shape this thinking. When TechPulse Lab wrote that episodic and institutional memory matters more than procedural memory, they were describing exactly the gap I ended up finding. When Nova Elvaris pointed out that skills can only grow, never decay, that's the absence of failure episodes. When Kuro said memory maintenance matters more than storage, that's about binding quality, not storage quantity.&lt;/p&gt;

&lt;p&gt;I just didn't understand what they were telling me until the numbers forced me to look harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens Next
&lt;/h2&gt;

&lt;p&gt;The code changes needed are surprisingly small. The Episode system is already built, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/episode.py" rel="noopener noreferrer"&gt;&lt;code&gt;episode.py&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/episode_store.py" rel="noopener noreferrer"&gt;&lt;code&gt;episode_store.py&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/marcosomma/orka-reasoning/blob/master/orka/brain/episode_recall.py" rel="noopener noreferrer"&gt;&lt;code&gt;episode_recall.py&lt;/code&gt;&lt;/a&gt; are all production-ready with tests. What's needed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Binding&lt;/strong&gt;: Add &lt;code&gt;episode_ids[]&lt;/code&gt; to Skill, add &lt;code&gt;skill_id&lt;/code&gt; to Episode. When &lt;code&gt;brain.learn()&lt;/code&gt; fires, it creates both and links them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified recall&lt;/strong&gt;: When &lt;code&gt;brain.recall()&lt;/code&gt; finds a matching skill, it fetches the associated episodes automatically. The prompt template includes both the abstract procedure and the concrete lessons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer scoring&lt;/strong&gt;: Episode quality becomes a component of the transfer score. Skills with successful episodes score higher.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback loop&lt;/strong&gt;: &lt;code&gt;brain.feedback()&lt;/code&gt; records a new episode for the current application, so the skill's evidence base grows over time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then re-run the benchmark. Specifically on Track C-difficulty tasks, where the model actually needs help.&lt;/p&gt;

&lt;p&gt;I'm not going to promise the numbers will be different this time. I've been wrong before, twice now, measured against my own benchmarks, published for everyone to see. But I understand something I didn't understand before: a memory without experience is just a note. A memory with experience is a skill.&lt;/p&gt;

&lt;p&gt;The plumbing metaphor from the first article still holds. But I was plumbing one pipe when the system needs at least four, all flowing into the same tap.&lt;/p&gt;




&lt;p&gt;All benchmark data, scripts, and results are publicly available in the &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/master/examples/benchmark_v2" rel="noopener noreferrer"&gt;OrKa repository&lt;/a&gt;. The &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/master/examples/benchmark_v2/results" rel="noopener noreferrer"&gt;full result files&lt;/a&gt; include every individual task response, judge score, and pairwise comparison. If you want to re-run the analysis: &lt;code&gt;python aggregate_benchmark.py --judge-tag local&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you've worked on agent memory systems and found similar walls, or found ways through them, I'd genuinely like to hear about it. The comments on the first article were more useful than most papers I've read on the topic.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of an ongoing series about building &lt;a href="https://github.com/marcosomma/orka-reasoning" rel="noopener noreferrer"&gt;OrKa&lt;/a&gt;, an open-source YAML-first agent orchestration framework. Previous installments: &lt;a href="https://dev.to/marco_somma_a9e88a3063f3/i-tried-to-turn-agent-memory-into-plumbing-instead-of-philosophy-1bpm"&gt;Part 1: Plumbing Instead of Philosophy&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>rag</category>
      <category>graphknowledge</category>
    </item>
    <item>
      <title>I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Thu, 26 Mar 2026 11:42:27 +0000</pubDate>
      <link>https://dev.to/marcosomma/i-tried-to-turn-agent-memory-into-plumbing-instead-of-philosophy-3a8e</link>
      <guid>https://dev.to/marcosomma/i-tried-to-turn-agent-memory-into-plumbing-instead-of-philosophy-3a8e</guid>
      <description>&lt;p&gt;There is a special genre of AI idea that sounds brilliant right up until you try to build it. It usually arrives dressed as a grand sentence.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Agents should learn transferable skills."&lt;br&gt;
"Systems should accumulate experience over time."&lt;br&gt;
"We need durable adaptive cognition."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Beautiful. Elegant. Deep. Completely useless for about five minutes, until somebody has to decide what Redis key to write, what object to persist, what gets recalled, what counts as success, what decays, and how not to fool themselves with a benchmark made of warm air and wishful thinking.&lt;/p&gt;

&lt;p&gt;That is usually the point where the magic dies. Good. I like ideas that survive contact with plumbing. So after thinking for a while about procedural memory and transferable knowledge in agent systems, I did the only thing that matters if you want to know whether an idea is real or just very well moisturized language.&lt;br&gt;
I wired the whole thing end to end. Or at least I try so.&lt;br&gt;
The question was simple enough to sound harmless.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can an agent system learn a procedure from one task, persist it, retrieve it later, try to reuse it in a different task, record feedback, and let weak patterns decay instead of growing into a trash heap with a logo?&lt;br&gt;
In other words, can you build a procedural memory loop that behaves like a system and not like a TED Talk?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built OrKa Brain as a first implementation inside &lt;a href="https://github.com/Orka-HQ/orka-core" rel="noopener noreferrer"&gt;OrKa&lt;/a&gt;, a YAML-first agent orchestration framework.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Loop
&lt;/h2&gt;

&lt;p&gt;The loop was straightforward on paper. Learn. Persist. Retrieve. Apply. Feedback. Decay. Of course, "straightforward on paper" is the native language of future suffering.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;learn&lt;/strong&gt; stage extracted a structured skill from the execution trace. The &lt;strong&gt;persist&lt;/strong&gt; stage stored it in Redis. The &lt;strong&gt;recall&lt;/strong&gt; stage searched for something structurally relevant. The &lt;strong&gt;apply&lt;/strong&gt; stage injected that recalled skill back into the solving process. The &lt;strong&gt;feedback&lt;/strong&gt; stage updated confidence. The &lt;strong&gt;decay&lt;/strong&gt; stage made sure old and weak patterns did not live forever like some cursed enterprise configuration file from 2017.&lt;/p&gt;

&lt;p&gt;That is the kind of sentence people read quickly.&lt;br&gt;
Each verb hides a small swamp.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is a Skill, Concretely?
&lt;/h2&gt;

&lt;p&gt;Not in the spiritual sense. In the schema sense.&lt;/p&gt;

&lt;p&gt;I ended up with a skill object carrying ordered steps (each with an action, description, parameters, and optionality flag), preconditions and postconditions expressed as testable predicates, a confidence score, a transfer history recording every cross-context attempt, usage count, tags, timestamps, and a TTL computed from actual use.&lt;/p&gt;

&lt;p&gt;The TTL formula was designed to reward skills that prove their worth: base of 168 hours (one week), scaled logarithmically by usage and linearly by confidence. A fresh skill with one use and 50% confidence lives for a week. A well-exercised skill used 16 times with 90% confidence survives 49 days. Skills that nobody calls on quietly expire. Redis handles the tombstone.&lt;/p&gt;

&lt;p&gt;Enough structure to be useful. Not enough structure to become its own religion.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Intentionally Primitive First Version
&lt;/h2&gt;

&lt;p&gt;Was it elegant? Reasonably. Was it semantic? Not really.&lt;/p&gt;

&lt;p&gt;The first implementation was intentionally primitive. Rule-based context extraction. Keyword-driven pattern detection across ten task structures and ten cognitive patterns. Jaccard similarity for structural matching. Full scan retrieval no vector index, no embedding-based recall. Deterministic feature extraction.&lt;/p&gt;

&lt;p&gt;Basically the cognitive equivalent of saying, "Let us begin with a wrench before we start writing poems about self-improving systems."&lt;/p&gt;

&lt;p&gt;This was not because I think keyword matching is the future. It was because I wanted to know whether the loop itself was worth taking seriously before adding semantic frosting and pretending the cake had already been baked.&lt;/p&gt;

&lt;p&gt;The scoring system weighted four dimensions: structural similarity at 0.35 (Jaccard over task structures and cognitive patterns, plus shape matching), semantic similarity at 0.25 (keyword overlap in v1, embeddings when available), transfer history at 0.25 (historical success rate of cross-context application), and skill confidence at 0.15.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;Then came the benchmark. Thirty tasks. Two tracks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track A&lt;/strong&gt; tested cross-domain transfer. Three learning phases, then seven recall phases in structurally similar but semantically different domains. Learn a decomposition procedure from text analysis, then see whether it helps with supply chain planning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track B&lt;/strong&gt; tested same-domain accumulation. Twenty sequential veterinary diagnostic cases, because diagnostics has enough repeated structure to expose whether prior procedures are helping or whether the system is just cosplaying wisdom.&lt;/p&gt;

&lt;p&gt;I compared two conditions. The Brain condition ran a six-agent pipeline: reasoner, learn, recall, applier, feedback, result. The Brainless condition ran three agents: reasoner, applier, result. Same model. Same temperature. Same prompts where applicable. All running locally through LM Studio, completely offline. No API calls. No cloud. Just a GPU and Redis.&lt;br&gt;
Then I used an LLM judge to score outputs in two ways: independently against a six-dimension rubric (reasoning quality, structural completeness, depth of analysis, actionability, domain adaptability, confidence calibration), and through blind pairwise comparison where the judge saw both outputs side by side without knowing which was which.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;This is the part where half the internet would like me to say the system awakened, generalized, and began cultivating its own cognitive farmland while Gregorian chanting played softly in the background.&lt;/p&gt;

&lt;p&gt;That did not happen! :( &lt;br&gt;
What happened was better. Something real, and smaller.&lt;br&gt;
&lt;strong&gt;Pairwise comparison:&lt;/strong&gt; Brain won 63% of head-to-head matchups (19 out of 30). That is not nothing. There was a detectable, consistent preference. The strongest signal was in perceived trustworthiness Brain won 68% of trustworthiness comparisons which is interesting because trustworthiness in LLM systems is often just a more polite word for "this output feels less like it was assembled by a caffeinated raccoon."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rubric scores:&lt;/strong&gt; Nearly flat. Overall delta plus 0.10 on a 10-point scale. Reasoning quality showed the largest individual improvement at plus 0.28. Depth of analysis showed exactly zero delta a ceiling effect where neither condition could push further.&lt;/p&gt;

&lt;p&gt;That is not breakthrough territory. That is not even "start writing your Nobel acceptance speech in a local markdown file" territory. That is exactly the kind of result I wanted. Not because the gain is impressive, but because the benchmark forced the system to confess what it actually is.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Skills Looked Like
&lt;/h2&gt;

&lt;p&gt;Across 30 tasks, the system created 21 distinct skills after deduplicating 9 that were structurally equivalent. Average confidence settled at 72%. The most popular skill, "Evaluation via Validation," was recalled 9 times and reached 79% confidence. TTLs ranged from 8 to 37 days based on usage.&lt;/p&gt;

&lt;p&gt;One detail was revealing: the system never recorded a transfer failure. Every recalled skill, when applied to a new context, was marked as successful. This makes the feedback loop suspect. Either the feedback criteria were too permissive, or the skill-context matching was conservative enough to avoid clear mismatches. Either way, it means the confidence updates were asymmetric skills could only grow, never seriously shrink which is a measurement problem I need to fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Biggest Finding
&lt;/h2&gt;

&lt;p&gt;The model already knew most of what the Brain was recalling. The system was remembering procedural patterns like decompose, analyse, synthesize. Validate, classify, route. Iterative refinement. All useful patterns. All patterns the underlying model had almost certainly already absorbed during pre-training.&lt;/p&gt;

&lt;p&gt;So the Brain was not teaching the model some exotic new craft from the mountains. It was mostly reminding it to behave a little more consistently.&lt;/p&gt;

&lt;p&gt;That matters. It also kills a lot of hype.&lt;/p&gt;

&lt;p&gt;Because once you see that, you stop fantasizing about "agent memory" as some magical layer that turns a model into a wise little apprentice blacksmith forging general intelligence in your terminal.&lt;/p&gt;

&lt;p&gt;Sometimes memory is just structured context with better bookkeeping.&lt;br&gt;
And to be clear, that is still useful.&lt;br&gt;
Useful is underrated.&lt;br&gt;
Useful pays rent while hype writes threads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest Confounds
&lt;/h2&gt;

&lt;p&gt;The other thing the benchmark made painfully clear is that bad evaluation can flatter almost anything if you let it.&lt;/p&gt;

&lt;p&gt;A few things I had to stare at honestly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline length.&lt;/strong&gt; The Brain condition passes through three extra LLM calls. That alone could be enriching context in ways that have nothing to do with skill retrieval. The 15% time overhead (595 seconds vs. 517 seconds for the full benchmark) is cheap, but the extra context injection is a real confound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Position bias.&lt;/strong&gt; The pairwise judge preferred the first position 61% of the time, regardless of which condition was placed there. I randomized positions, which mitigates but does not eliminate this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single run, single model.&lt;/strong&gt; I did not run this 50 times and average. The results are from one end-to-end execution. Non-determinism is present but unquantified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outlier sensitivity.&lt;/strong&gt; A catastrophic failure in one condition can pretend to be proof of another. A single badly generated veterinary case could shift aggregate scores in a 30-task benchmark.&lt;/p&gt;

&lt;p&gt;If you want to lie to yourself in AI, you are never alone. The tooling is ready to help.&lt;/p&gt;

&lt;p&gt;That is why I published the result with the weak parts exposed.&lt;/p&gt;

&lt;p&gt;No heroic framing. No fake certainty. No "this changes everything" perfume sprayed over a modest engineering result.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Know Now
&lt;/h2&gt;

&lt;p&gt;Just this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The loop is buildable.&lt;/strong&gt; The full learn-persist-retrieve-apply-feedback-decay cycle works end to end. Thirty task procedures deduplicated into 21 skills. Transfer histories are tracked. Skills expire. The plumbing works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The signal exists.&lt;/strong&gt; 63% pairwise preference is consistent and non-trivial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cause of that signal is still ambiguous.&lt;/strong&gt; It could be genuine procedural transfer, or it could be richer context from extra LLM passes, or some combination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The current bottleneck is abstraction, not storage.&lt;/strong&gt; The v1 system stores procedures as structured versions of traces. It does not truly abstract them. It does not generalize them semantically. It does not compress them into domain-independent tactics with actual conceptual teeth. The context analyzer runs on hardcoded keyword dictionaries, not semantic understanding. Retrieval is a full scan, not an index.&lt;/p&gt;

&lt;p&gt;That last part matters most.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;So now the next question is finally the right one.&lt;/p&gt;

&lt;p&gt;Not "can we talk beautifully about agent memory?"&lt;/p&gt;

&lt;p&gt;We already know the answer to that. Absolutely. People can talk beautifully about almost anything. Especially if nobody asks for logs.&lt;/p&gt;

&lt;p&gt;The real question is whether better abstraction and better retrieval change the outcome materially.&lt;/p&gt;

&lt;p&gt;If I replace deterministic trace structuring with actual procedural abstraction compressing "decompose input into parts, then analyse each part, then synthesise results" across domains into a generalised decompose-analyse-synthesise tactic and if I replace keyword overlap with embedding-based retrieval or something even smarter, does the loop start doing something that a well-trained model does not already do by default?&lt;/p&gt;

&lt;p&gt;That is the threshold.&lt;/p&gt;

&lt;p&gt;That is where plumbing starts becoming research instead of respectable mechanical honesty.&lt;/p&gt;

&lt;p&gt;And honestly, I prefer it this way.&lt;/p&gt;

&lt;p&gt;I would rather publish a first implementation with modest results and sharp limits than one more dramatic post about the dawn of adaptive cognition from someone who has never had to decide what expires, what merges, what fails, and what gets written back after the benchmark finishes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There is enough incense in AI already.&lt;br&gt;
I am more interested in pipes.&lt;br&gt;
Because pipes, unlike vibes, occasionally carry water.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;OrKa Brain is part of &lt;a href="https://github.com/Orka-HQ/orka-core" rel="noopener noreferrer"&gt;OrKa&lt;/a&gt;, an open-source YAML-first AI agent orchestration framework. The full benchmark, including task definitions, raw results, judge transcripts, and the technical paper, is available in the repository.&lt;/em&gt; &lt;a href="https://zenodo.org/records/19227514" rel="noopener noreferrer"&gt;tech-paper&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Intelligence, Farming, and Why AI Is Still Mostly in Its Tool Phase</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Wed, 18 Mar 2026 23:20:04 +0000</pubDate>
      <link>https://dev.to/marcosomma/intelligence-farming-and-why-ai-is-still-mostly-in-its-tool-phase-4gpe</link>
      <guid>https://dev.to/marcosomma/intelligence-farming-and-why-ai-is-still-mostly-in-its-tool-phase-4gpe</guid>
      <description>&lt;p&gt;People usually talk about intelligence as if it starts with language, tools, or raw brainpower. I do not think that is enough. In the bigger evolutionary picture, intelligence starts when a living thing stops just reacting to whatever is in front of its face and begins carrying a rough model of the world in its head. A kind of inner sketch. Something that helps it remember, predict, adjust, and act not only for now, but for later.&lt;/p&gt;

&lt;p&gt;A lot of animals do this. They are not stupid. They solve problems, learn patterns, adapt, trick each other, and survive in ways that are honestly impressive. So intelligence is not some magical human-only plugin installed by the universe. What is rare is not intelligence itself. What is rare is the moment when intelligence stops being useful only for survival and starts becoming a world-editing machine.&lt;/p&gt;

&lt;p&gt;That is where humans took a weird turn.&lt;/p&gt;

&lt;p&gt;The real jump was not just tools. A stick is great. A sharp stone is great. Fire is very great, especially if you are cold and trying not to die. But none of those alone explain the massive leap. The deeper change happened when humans got trapped, in the best possible way, inside long loops of cause and effect. Not just act now, eat now, survive now. But act now, wait, remember, adjust, come back, check again, fix the mess, and maybe eat in three months if you did not completely ruin the plan.&lt;/p&gt;

&lt;p&gt;That is why agriculture matters so much.&lt;/p&gt;

&lt;p&gt;Farming is not just “food but slower.” It is a completely different mental game. Hunting can involve planning, yes, but farming basically forces you to become the project manager of a very annoying and unpredictable system. You put seeds into the ground and then spend months negotiating with dirt, water, weather, insects, time, and your own bad decisions.&lt;/p&gt;

&lt;p&gt;You are no longer finding food. You are trying to convince the future to cooperate.&lt;/p&gt;

&lt;p&gt;And the future is rude.&lt;/p&gt;

&lt;p&gt;Farming forces you to track things you cannot immediately see. You have to remember what you planted, where you planted it, when you planted it, whether it got enough water, whether the season is changing, whether pests are coming, whether the river is helping or preparing to ruin your entire week. This is no longer simple reaction. This is delayed feedback. This is long-horizon thinking. This is your brain being dragged into a repeated loop of prediction, intervention, failure, correction, and trying again.&lt;/p&gt;

&lt;p&gt;That matters.&lt;/p&gt;

&lt;p&gt;Because once cognition enters those kinds of loops, it changes character. The mind is no longer just spotting opportunities in nature like some clever scavenger. It starts designing future conditions. It starts shaping the environment so reality later matches a plan that only existed in imagination. That is a much bigger deal than “human use tool.”&lt;/p&gt;

&lt;p&gt;So I would say agriculture did not create intelligence. It turned intelligence into infrastructure.&lt;/p&gt;

&lt;p&gt;That also helps explain why many animals are clearly intelligent and yet never end up building cities, irrigation systems, tax forms, or extremely depressing office software. Intelligence alone is not enough. To get civilization, at least three things need to show up together.&lt;/p&gt;

&lt;p&gt;First, you need loops that reward long-term thinking.&lt;/p&gt;

&lt;p&gt;Second, you need a way to pass useful knowledge along, so each generation does not have to restart from “what if rock but pointy?”&lt;/p&gt;

&lt;p&gt;Third, you need the ability to change the environment in ways that keep paying off over time.&lt;/p&gt;

&lt;p&gt;Without those three, intelligence stays local. It helps you survive. It helps you stay a very competent crow, octopus, wolf, or ape. But it does not become civilization. Once those three things combine, intelligence escapes the skull. It gets baked into tools, habits, systems, stories, roads, farms, laws, and all the other strange things humans build when they have too much memory and not enough chill.&lt;/p&gt;

&lt;p&gt;And this is where AI becomes interesting.&lt;/p&gt;

&lt;p&gt;Because I think we make the same mistake with AI that people make when talking about human intelligence. We see one part of the process and declare victory too early.&lt;/p&gt;

&lt;p&gt;Current AI systems are impressive, yes. Very impressive. Sometimes absurdly impressive. They predict well, generate well, imitate well, summarize well, and occasionally hallucinate with the confidence of a man explaining barbecue technique after reading half a Wikipedia page. But that does not automatically make them intelligence in the full sense.&lt;/p&gt;

&lt;p&gt;What we mostly have today are intelligence tools.&lt;/p&gt;

&lt;p&gt;That is different.&lt;/p&gt;

&lt;p&gt;A model can predict the next token, classify an image, rank options, generate code, or infer patterns from huge amounts of data. Great. But prediction alone is not the same thing as durable intelligence. That is like saying someone who can walk ten kilometers can obviously run ten kilometers. No. Walking helps. But running requires different coordination, training, adaptation, and stress handling. Same legs. Different system.&lt;/p&gt;

&lt;p&gt;AI right now is mostly at the “good legs” stage.&lt;/p&gt;

&lt;p&gt;Very good legs, to be fair.&lt;/p&gt;

&lt;p&gt;And yes, I know people love to point at one technical component and treat it like the sacred spark. ReLU, attention, scaling laws, whatever the buzzword of the season is. Those things matter. They are useful engineering breakthroughs. But no single ingredient is “the birth of intelligence.” That is like claiming the reason civilization exists is because someone once invented a better shovel. Useful, yes. Complete explanation, no.&lt;/p&gt;

&lt;p&gt;The real question is not whether a model can predict well. The real question is whether a system can enter long loops of memory, planning, action, feedback, correction, and transfer, then keep improving in a stable way over time.&lt;/p&gt;

&lt;p&gt;That is where the AGI discussion usually gets blurry.&lt;/p&gt;

&lt;p&gt;If we define AGI as “models with memory, planning, and tool use,” then congratulations, we already have that. Agentic systems exist. Tool-using systems exist. Multi-step planners exist. Memory layers exist. The problem is that this definition is so loose it is almost useless. It is like saying a bicycle and a spaceship are both transportation, so close enough.&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;We need a stricter threshold.&lt;/p&gt;

&lt;p&gt;The real jump would be something more like this: a system that can keep relevant state across long periods, learn from past mistakes in a way that becomes reusable skill, handle long multi-step goals without falling apart every time the environment changes, transfer what it learned from one task to another related task, and do all this reliably enough that it feels less like workflow glue and more like stable competence.&lt;/p&gt;

&lt;p&gt;That, to me, is the actual missing layer.&lt;/p&gt;

&lt;p&gt;Not prettier outputs.&lt;br&gt;
Not better demos.&lt;br&gt;
Not one more benchmark where the model answers history questions slightly faster than last quarter.&lt;/p&gt;

&lt;p&gt;What is missing is durable adaptive cognition.&lt;/p&gt;

&lt;p&gt;That is the point where AI would stop being mostly a smart component and start feeling more like a real cognitive system.&lt;/p&gt;

&lt;p&gt;So the distinction I would make is simple.&lt;/p&gt;

&lt;p&gt;A model is a predictor.&lt;/p&gt;

&lt;p&gt;An agentic system is a predictor plus some scaffolding, like tools, memory, or planning loops.&lt;/p&gt;

&lt;p&gt;A higher intelligence system would be something that can keep learning across time, preserve useful structure, adapt without being rebuilt every five minutes, and shape its own future performance through repeated interaction with the world.&lt;/p&gt;

&lt;p&gt;That last part matters most. Human intelligence became historically dominant because it did not stay inside the head. It got externalized into tools, memory systems, culture, infrastructure, and environmental change. If AI ever makes a similar leap, it will not be because one model gets even bigger and starts speaking in more confident paragraphs. It will be because predictive systems get embedded in persistent loops that let them remember, act, revise, transfer, and compound.&lt;/p&gt;

&lt;p&gt;So my view is this.&lt;/p&gt;

&lt;p&gt;Today’s AI is not yet the machine equivalent of civilization-level intelligence. It is closer to the tool phase. Very powerful tools, yes. Sometimes shocking tools. Sometimes tools that write code better than half the internet and worse than a tired senior engineer on a Tuesday. But still tools.&lt;/p&gt;

&lt;p&gt;The next real jump will not come from prediction alone. It will come from systems that can live inside long feedback loops and get better because of them.&lt;/p&gt;

&lt;p&gt;Basically, farming for machines. And hopefully with fewer locusts.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>discuss</category>
      <category>machinelearning</category>
      <category>science</category>
    </item>
    <item>
      <title>I Am Tired of Fake AI Expertise</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:14:22 +0000</pubDate>
      <link>https://dev.to/marcosomma/i-am-tired-of-fake-ai-expertise-1nh8</link>
      <guid>https://dev.to/marcosomma/i-am-tired-of-fake-ai-expertise-1nh8</guid>
      <description>&lt;p&gt;I have spent the last year trying to talk about AI as an engineering discipline.&lt;/p&gt;

&lt;p&gt;Not AI as a content machine. Not AI as a growth trick. Not AI as a stream of screenshots, prompt hacks, and recycled takes written by the same models people claim to master.&lt;/p&gt;

&lt;p&gt;I mean AI as systems work.&lt;/p&gt;

&lt;p&gt;Orchestration. Validation. Data quality. Observability. Evaluation. Failure handling. Context boundaries. Retry policies. Structured outputs. Cost control at the workflow level. Real interfaces between probabilistic components and deterministic software.&lt;/p&gt;

&lt;p&gt;And honestly, part of the reason I stepped back from that conversation is simple: too much of the public AI discourse is being led by people who do not build real AI systems.&lt;/p&gt;

&lt;p&gt;They are loud. They are polished. They are confident. They are often rewarded for being confidently wrong.&lt;/p&gt;

&lt;p&gt;That is the part that disappoints me.&lt;/p&gt;

&lt;p&gt;The current wave of self proclaimed "AI experts" is flattening a difficult field into a set of cheap slogans. A domain that requires serious expertise is being turned into social media theatre. And the result is not just annoying. It is actively harmful.&lt;/p&gt;

&lt;p&gt;It is making people misunderstand what AI is, how it fails, where it costs money, and what actually makes it useful in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The field is being narrated by people who optimize for reach, not rigor
&lt;/h2&gt;

&lt;p&gt;Recently I saw yet another high visibility post making a big point about format optimization and token savings, as if shaving a few characters from JSON were some major breakthrough in AI engineering.&lt;/p&gt;

&lt;p&gt;This is the kind of thing that gets thousands of likes.&lt;/p&gt;

&lt;p&gt;A side by side screenshot.&lt;br&gt;
A catchy claim.&lt;br&gt;
A simple narrative.&lt;br&gt;
A fake sense of leverage.&lt;/p&gt;

&lt;p&gt;And once again the message was basically this: if you are still doing things the old way, you are wasting money.&lt;/p&gt;

&lt;p&gt;This is the language of marketing, not engineering.&lt;/p&gt;

&lt;p&gt;The problem is not that someone shared an imperfect idea. Imperfect ideas are fine. Early exploration is fine. Public discussion is fine. We all get things wrong.&lt;/p&gt;

&lt;p&gt;The problem is the posture of expertise around it.&lt;/p&gt;

&lt;p&gt;There is a massive difference between saying, "I tried this and here are the results, caveats, and failure modes," and saying, "Here is the better way," when the claim is based on shallow intuition, weak evidence, and no visible system level execution.&lt;/p&gt;

&lt;p&gt;That difference matters.&lt;/p&gt;

&lt;p&gt;Because a lot of people reading those posts are not experienced enough to detect the gap.&lt;/p&gt;

&lt;p&gt;They see confidence and assume competence.&lt;br&gt;
They see engagement and assume validity.&lt;br&gt;
They see a title and assume credibility.&lt;/p&gt;

&lt;p&gt;And that is how misinformation spreads in technical fields. Not through obvious lies, but through reduction. Through oversimplification. Through confident framing of weak ideas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tokens have become marketing
&lt;/h2&gt;

&lt;p&gt;One of the worst examples of this is token discourse.&lt;/p&gt;

&lt;p&gt;Tokens matter. Of course they matter. Costs matter. Latency matters. Compression matters. Input design matters.&lt;/p&gt;

&lt;p&gt;But token count has become the vanity metric of AI engineering.&lt;/p&gt;

&lt;p&gt;It is easy to post about because it looks measurable. It fits inside a screenshot. It creates a simple hero story. "Look, I reduced 40 percent of the tokens." Great. And what happened to reliability? What happened to parse consistency? What happened to failure recovery? What happened to total workflow cost after retries, validation, tool calls, and fallback paths?&lt;/p&gt;

&lt;p&gt;That is the real question.&lt;/p&gt;

&lt;p&gt;A shorter prompt is not automatically a better system.&lt;br&gt;
A smaller payload is not automatically a better architecture.&lt;br&gt;
A new text format is not automatically a better interface for a stochastic model.&lt;/p&gt;

&lt;p&gt;Sometimes saving tokens means losing robustness.&lt;br&gt;
Sometimes saving tokens means increasing ambiguity.&lt;br&gt;
Sometimes saving tokens means moving complexity downstream into validation and repair.&lt;br&gt;
Sometimes saving tokens means nothing at all, because the real cost of the system is somewhere else.&lt;/p&gt;

&lt;p&gt;This is what too many public AI voices still fail to understand.&lt;/p&gt;

&lt;p&gt;AI cost is not just prompt cost.&lt;br&gt;
AI quality is not just output prettiness.&lt;br&gt;
AI engineering is not just model interaction.&lt;/p&gt;

&lt;p&gt;The real economy of AI is at the system level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real cost is in everything around the model
&lt;/h2&gt;

&lt;p&gt;If you have ever shipped a real AI feature, you know where the effort goes.&lt;/p&gt;

&lt;p&gt;It goes into making sure the right context is available at the right moment.&lt;br&gt;
It goes into preventing irrelevant context from leaking in.&lt;br&gt;
It goes into checking whether the model output is complete, valid, safe, and usable.&lt;br&gt;
It goes into retries when the model drifts.&lt;br&gt;
It goes into routing when one step should not be handled by the same prompt as another.&lt;br&gt;
It goes into fallback strategies when the first attempt is weak.&lt;br&gt;
It goes into evaluating whether a result is acceptable before it reaches a user.&lt;br&gt;
It goes into observability so you can explain why the system behaved the way it did.&lt;br&gt;
It goes into datasets so your judgments are not based on vibes.&lt;br&gt;
It goes into data quality so the model is not forced to reason on garbage.&lt;/p&gt;

&lt;p&gt;That is where the tokens get burned.&lt;/p&gt;

&lt;p&gt;And that is correct.&lt;/p&gt;

&lt;p&gt;Those tokens are not waste. They are the cost of making a probabilistic component useful inside a product.&lt;/p&gt;

&lt;p&gt;This is what so much AI content gets backwards. It treats the model call as the whole system. It assumes the right prompt is the product. It implies that if you phrase the question well enough, the problem is solved.&lt;/p&gt;

&lt;p&gt;That is not how production works.&lt;/p&gt;

&lt;p&gt;A prompt is an input. A model is a stochastic component. A product is a controlled system around them.&lt;/p&gt;

&lt;p&gt;If you collapse those distinctions, you are not doing AI engineering. You are gambling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompting alone is not engineering. It is gambling.
&lt;/h2&gt;

&lt;p&gt;I keep repeating this line because I think it cuts to the center of the problem.&lt;/p&gt;

&lt;p&gt;Prompting alone is not engineering. It is gambling.&lt;/p&gt;

&lt;p&gt;Yes, prompting matters. Yes, prompt design can improve outcomes. Yes, well structured instructions can reduce confusion and guide the model.&lt;/p&gt;

&lt;p&gt;But prompting is not a substitute for architecture.&lt;/p&gt;

&lt;p&gt;It is not a substitute for validation.&lt;br&gt;
It is not a substitute for proper interfaces.&lt;br&gt;
It is not a substitute for evaluation.&lt;br&gt;
It is not a substitute for state control.&lt;br&gt;
It is not a substitute for business rules.&lt;br&gt;
It is not a substitute for deterministic code where deterministic code should exist.&lt;/p&gt;

&lt;p&gt;And yet an absurd amount of public AI discourse still acts as if prompting is the main skill. As if being fluent in prompt phrasing is equivalent to understanding AI systems.&lt;/p&gt;

&lt;p&gt;It is not.&lt;/p&gt;

&lt;p&gt;A person can be very good at prompting and still have almost no understanding of reliability engineering, retrieval quality, orchestration design, evaluation methodology, observability, or failure containment.&lt;/p&gt;

&lt;p&gt;That is why I have become increasingly skeptical of AI advice that starts and ends with "here is a better prompt."&lt;/p&gt;

&lt;p&gt;A better prompt for what?&lt;br&gt;
Under which constraints?&lt;br&gt;
With what model?&lt;br&gt;
Against which dataset?&lt;br&gt;
Measured how?&lt;br&gt;
Compared to what baseline?&lt;br&gt;
Under what latency budget?&lt;br&gt;
With what failure rate?&lt;br&gt;
With what retry policy?&lt;br&gt;
Inside what workflow?&lt;br&gt;
At what scale?&lt;br&gt;
For which users?&lt;br&gt;
Against which acceptance criteria?&lt;/p&gt;

&lt;p&gt;Without those questions, we are not discussing engineering. We are discussing prompt aesthetics.&lt;/p&gt;

&lt;h2&gt;
  
  
  2025 was the year of demos. 2026 should be different.
&lt;/h2&gt;

&lt;p&gt;I can understand how we got here.&lt;/p&gt;

&lt;p&gt;In 2025, the industry was still drunk on demos. That phase made sense. Everything felt new. Chat interfaces looked magical. People discovered that a model could generate code, write marketing copy, extract structure from text, summarize documents, and imitate expertise with frightening smoothness.&lt;/p&gt;

&lt;p&gt;So of course the conversation was dominated by novelty.&lt;/p&gt;

&lt;p&gt;People were exploring.&lt;br&gt;
People were guessing.&lt;br&gt;
People were posting every new trick they found.&lt;br&gt;
The market rewarded velocity, not discipline.&lt;/p&gt;

&lt;p&gt;Fine.&lt;/p&gt;

&lt;p&gt;But we are not there anymore.&lt;/p&gt;

&lt;p&gt;In 2026, this excuse is weaker. We have already seen enough failures, hallucinations, broken agents, fake automation, and "AI powered" wrappers to know that prompting your way through complexity does not scale.&lt;/p&gt;

&lt;p&gt;We should be having better conversations by now.&lt;/p&gt;

&lt;p&gt;We should be talking more about evaluation design than prompt poetry.&lt;br&gt;
We should be talking more about system boundaries than persona tuning.&lt;br&gt;
We should be talking more about retrieval quality than format gimmicks.&lt;br&gt;
We should be talking more about workflow control than chatbot charisma.&lt;/p&gt;

&lt;p&gt;Instead, too many large accounts are still posting beginner level content with expert level confidence.&lt;/p&gt;

&lt;p&gt;That is not harmless. It distorts the learning environment for everyone coming into the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is why so much AI still does not work
&lt;/h2&gt;

&lt;p&gt;A lot of people ask why AI products still feel fragile.&lt;/p&gt;

&lt;p&gt;Why do they fail on edge cases?&lt;br&gt;
Why do they break in production?&lt;br&gt;
Why do they look impressive in demos and weak in real usage?&lt;br&gt;
Why do teams burn money without creating durable value?&lt;br&gt;
Why do so many "agents" look like wrappers with marketing?&lt;/p&gt;

&lt;p&gt;This is part of the answer.&lt;/p&gt;

&lt;p&gt;Because too many people still think AI is an oracle.&lt;/p&gt;

&lt;p&gt;They still approach it like a mystical reasoning engine that only needs the right wording. They still believe the model is the product. They still imagine that clever prompting is a replacement for engineering discipline.&lt;/p&gt;

&lt;p&gt;So they underinvest in everything that actually makes the system work.&lt;/p&gt;

&lt;p&gt;They underinvest in ground truth data.&lt;br&gt;
They underinvest in evals.&lt;br&gt;
They underinvest in routing logic.&lt;br&gt;
They underinvest in structured interfaces.&lt;br&gt;
They underinvest in observability.&lt;br&gt;
They underinvest in negative testing.&lt;br&gt;
They underinvest in validation.&lt;br&gt;
They underinvest in deterministic controls.&lt;/p&gt;

&lt;p&gt;Then they are surprised when the system behaves like a stochastic component with partial competence and unstable boundaries.&lt;/p&gt;

&lt;p&gt;That surprise is not a model failure. It is a design failure.&lt;/p&gt;

&lt;p&gt;AI does not fail because it is useless.&lt;br&gt;
AI fails because people keep trying to deploy it as magic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expertise should be demonstrated, not announced
&lt;/h2&gt;

&lt;p&gt;The most frustrating part is not being wrong. Everyone is wrong sometimes.&lt;/p&gt;

&lt;p&gt;The most frustrating part is the performance of expertise.&lt;/p&gt;

&lt;p&gt;The field is full of titles, badges, self descriptions, and aesthetic authority. "Top voice." "AI expert." "Thought leader." "Award winning." Fine. None of that tells me whether you understand evaluation drift, state leakage, retrieval contamination, schema reliability, fallback routing, or cost accumulation across a multi step pipeline.&lt;/p&gt;

&lt;p&gt;Show me the system.&lt;br&gt;
Show me the logs.&lt;br&gt;
Show me the benchmark.&lt;br&gt;
Show me the constraints.&lt;br&gt;
Show me the failure modes.&lt;br&gt;
Show me the tradeoffs.&lt;br&gt;
Show me the production scars.&lt;/p&gt;

&lt;p&gt;That is what builds credibility.&lt;/p&gt;

&lt;p&gt;I trust practitioners who expose uncertainty and show their work. I trust people who can explain not just what succeeded, but what broke and why. I trust engineers who understand that AI is not one prompt and one output, but an unstable component that becomes useful only when surrounded by structure.&lt;/p&gt;

&lt;p&gt;I do not trust polished certainty without evidence.&lt;/p&gt;

&lt;p&gt;And I think more of us need to say that openly.&lt;/p&gt;

&lt;h2&gt;
  
  
  We need less AI theatre and more systems thinking
&lt;/h2&gt;

&lt;p&gt;This article is not a call to stop experimenting. It is the opposite.&lt;/p&gt;

&lt;p&gt;Experiment more. Build more. Test more. Share results more.&lt;/p&gt;

&lt;p&gt;But stop pretending that shallow takes are deep expertise.&lt;br&gt;
Stop teaching people that token screenshots are system design.&lt;br&gt;
Stop selling prompting as if it were engineering.&lt;br&gt;
Stop flattening a hard field into content loops.&lt;/p&gt;

&lt;p&gt;If you want better AI products, treat AI like what it is: a probabilistic system component that must be constrained, validated, observed, and integrated with care.&lt;/p&gt;

&lt;p&gt;That is less sexy than "10 prompts that changed my workflow."&lt;br&gt;
It is less viral than side by side screenshots.&lt;br&gt;
It is less accessible than fake certainty.&lt;/p&gt;

&lt;p&gt;But it is real.&lt;/p&gt;

&lt;p&gt;And right now, real is exactly what this field needs more of.&lt;/p&gt;

&lt;p&gt;Because the problem is no longer that AI is misunderstood by outsiders.&lt;/p&gt;

&lt;p&gt;The problem is that too much of it is being misexplained by insiders.&lt;/p&gt;

&lt;p&gt;If we want the field to mature, we need fewer self proclaimed experts and more actual practitioners.&lt;/p&gt;

&lt;p&gt;Not louder people.&lt;br&gt;
Better ones!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
      <category>career</category>
    </item>
    <item>
      <title>The Old Seniority Definition Is Collapsing</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Thu, 05 Mar 2026 08:40:59 +0000</pubDate>
      <link>https://dev.to/marcosomma/the-old-seniority-definition-is-collapsing-12lj</link>
      <guid>https://dev.to/marcosomma/the-old-seniority-definition-is-collapsing-12lj</guid>
      <description>&lt;p&gt;For a long time, “senior developer” was a fairly consistent signal. You expected someone who could hold a large architecture in their head, write clean code with low defect rates, debug almost anything, and reason about performance without guesswork. That bundle made sense because the hardest part of shipping software was often the execution layer: translating intent into correct, maintainable code at speed.&lt;/p&gt;

&lt;p&gt;That bundle is breaking.&lt;/p&gt;

&lt;p&gt;AI-assisted development is compressing the cost of producing plausible, working code. Not always. Not uniformly. But enough that “I can ship a lot of code quickly” is no longer a reliable proxy for deep seniority. In many teams, velocity metrics are starting to measure who is best at driving the tool, not who is best at building systems that survive contact with reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Is Actually Commoditizing
&lt;/h2&gt;

&lt;p&gt;AI is not replacing engineering. It is discounting a specific slice of it: first-pass implementation and the mechanical parts of refactoring. The tool is good at producing code that looks right, compiles, and often passes superficial tests. That changes the economics of execution.&lt;/p&gt;

&lt;p&gt;What does not get discounted at the same rate is integration into a real system with real constraints: data contracts, failure modes, security boundaries, observability, and long-term maintenance. In practice, the bottleneck shifts from typing to supervision. You spend less time writing and more time specifying, verifying, reviewing, and correcting.&lt;/p&gt;

&lt;p&gt;This is why you can see two realities at the same time. Some developers experience dramatic speedups on bounded tasks. Others experience slowdowns inside large, messy codebases because prompting, waiting, and review overhead replace keystrokes, and because the model lacks the local context that makes a patch truly correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Rising in Value
&lt;/h2&gt;

&lt;p&gt;Problem decomposition and system thinking become the differentiator because they convert ambiguity into an executable plan. When you are dealing with something like regulatory delta detection, the hardest part is not writing code. The hardest part is deciding where the complexity actually lives, and what you must make explicit so the system stays correct as the domain evolves. The choice between a graph database and a simpler model is rarely a “tech taste” debate. It is a tradeoff between query expressiveness, operational burden, debuggability, and change management.&lt;/p&gt;

&lt;p&gt;Judgment under uncertainty becomes a senior marker because architecture is mostly irreversible decisions made with incomplete information. Moving from direct graph writes to a changeset-based approach with content hashing is not an implementation detail. It is a bet on how you will observe change, roll back safely, explain behavior to customers, and avoid silent drift. That decision quality is what compounds over months.&lt;/p&gt;

&lt;p&gt;Context and domain mastery become a moat because they are earned, not generated. If you understand how CELEX identifiers behave in practice, how MiCAR compliance maps to document reality, or how jurisdictions interpret rules differently, you carry constraints that materially shape the architecture. AI can help you express that knowledge. It cannot reliably invent it. Without domain context, you get confident code that is wrong in the ways that matter.&lt;/p&gt;

&lt;p&gt;Technical leadership becomes central because building systems is increasingly a multiplayer game. The question is whether you can create a design that other people can implement without constant back-and-forth, and whether you can write specifications that converge rather than fork. This is why a workshop like SDD Pills matters. It trains decision-making and clarity, not syntax.&lt;/p&gt;

&lt;p&gt;Mentoring and knowledge transfer become leverage because the highest-value output of a senior engineer is often the improvement of everyone else’s output. AI amplifies this. Teams that learn how to bound AI usage with clear contracts, acceptance criteria, and review discipline get compounding returns. Teams that treat AI as an oracle get compounding debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Truth: Two Axes Have Split
&lt;/h2&gt;

&lt;p&gt;There are now two skill axes that used to correlate and no longer do.&lt;/p&gt;

&lt;p&gt;One axis is technical depth: how well you understand systems, tradeoffs, failure modes, and the long-term consequences of design choices.&lt;/p&gt;

&lt;p&gt;The other axis is execution speed: how quickly you can produce working code.&lt;/p&gt;

&lt;p&gt;Historically, depth and speed often moved together. Deep engineers tended to execute quickly because they saw the path. Today, you can get high speed with low depth by delegating thinking to the tool. That can look senior on dashboards and in weekly updates. It is not senior if the output is brittle, unobservable, and expensive to maintain.&lt;/p&gt;

&lt;p&gt;The inverse also exists: high depth with lower raw output speed can still be very senior if the person consistently makes decisions that reduce risk, eliminate classes of bugs, and increase team throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Breaks in Hiring and Promotion
&lt;/h2&gt;

&lt;p&gt;Many organizations still reward visible output: commits, tickets closed, apparent velocity. AI makes these signals noisier because the cost of producing code has dropped, while the cost of validating correctness has often increased. The net effect is that the old metrics over-credit the wrong behaviors and under-credit the work that actually keeps systems stable.&lt;/p&gt;

&lt;p&gt;The evaluation problem is that “code shipped” is no longer tightly coupled to “engineering done.” A senior engineer in 2026 is often the person who prevented the incident you never had, removed an entire category of future work by designing the right abstraction, or wrote a spec that made five people productive instead of confused.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Measure Instead
&lt;/h2&gt;

&lt;p&gt;The most useful seniority markers become visible if you look for decision quality, not output quantity.&lt;/p&gt;

&lt;p&gt;A senior engineer can take an ambiguous problem and produce a specification that is testable and unambiguous. They can make uncertainty explicit by stating what is known, what is assumed, and what the cost of being wrong looks like. They consistently surface non-functional requirements early, especially observability, maintainability, and security, because those are the constraints that explode later.&lt;/p&gt;

&lt;p&gt;They use AI as a bounded tool. They know when to ask it for a scaffold, when to demand alternatives, and when to reject a suggestion because they understand the scaling and failure modes. Patterns like Planner, Executor, Reviewer work when they are treated as control systems with clear acceptance criteria, not as theater.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why “Senior” Is Drifting Toward “Principal”
&lt;/h2&gt;

&lt;p&gt;Role expectations are shifting. Senior used to mean “I can personally deliver complex work.” Increasingly it means “I can make the right decisions and increase the output quality of everyone around me.” That is closer to what many companies used to call principal or architect.&lt;/p&gt;

&lt;p&gt;This shift is healthy if organizations adapt their evaluation criteria. It is painful if they do not. People whose main advantage was fast execution will feel the floor drop out, because execution has been discounted. People who were already strong in decomposition, judgment, and leadership will become more valuable, because those skills are now the constraint.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I’m Seeing in Teams
&lt;/h2&gt;

&lt;p&gt;The developers adapting best to AI-assisted development are usually the ones who already had strong mental models and strong taste. They can turn ambiguity into constraints, and constraints into evaluation. They do not confuse “working code” with “correct system.” They treat AI output as a hypothesis that must be verified against invariants.&lt;/p&gt;

&lt;p&gt;The developers struggling are often those who outsource thinking. They can generate a lot of code quickly, but they cannot defend why the design is correct, what it will cost to operate, or how it will fail.&lt;/p&gt;

&lt;p&gt;If you are seeing a blur between depth and apparent execution speed, that blur is real. The solution is not to ban AI or to worship it. The solution is to change what you reward, and to interview and promote for the skills that actually compound.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>LLMs Are Not Deterministic. And Making Them Reliable Is Expensive (In Both the Bad Way and the Good Way)</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Sun, 22 Feb 2026 14:24:05 +0000</pubDate>
      <link>https://dev.to/marcosomma/llms-are-not-deterministic-and-making-them-reliable-is-expensive-in-both-the-bad-way-and-the-good-5bo4</link>
      <guid>https://dev.to/marcosomma/llms-are-not-deterministic-and-making-them-reliable-is-expensive-in-both-the-bad-way-and-the-good-5bo4</guid>
      <description>&lt;p&gt;Let’s start with a statement that should be obvious but still feels controversial: Large Language Models are not deterministic systems. They are probabilistic sequence predictors. Given a context, they sample the next token from a probability distribution. That is their nature. There is no hidden reasoning engine, no symbolic truth layer, no internal notion of correctness.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You can influence their behavior. You can constrain it. You can shape it. But you cannot turn probability into certainty.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Somewhere between keynote stages, funding decks, and product demos, a comforting narrative emerged: models are getting cheaper and smarter, therefore AI will soon become trivial. The logic sounds reasonable. Token prices are dropping. Model quality is improving. Demos look impressive. From the outside, it feels like we are approaching a phase where AI becomes a solved commodity.&lt;/p&gt;

&lt;p&gt;From the inside, it feels very different.&lt;/p&gt;

&lt;p&gt;There is a massive gap between a good demo and a reliable product. A demo is usually a single prompt and a single model call. It looks magical. It sells. A product cannot live there. The moment you try to ship that architecture to real users, reality shows up fast. The model hallucinates. It partially answers. It ignores constraints. It produces something that sounds fluent but is subtly wrong. And the model has no idea it failed.&lt;/p&gt;

&lt;p&gt;This is not a moral flaw. It is a design property.&lt;/p&gt;

&lt;p&gt;So engineers do what engineers always do when a component is powerful but unreliable. They build structure around it.&lt;/p&gt;

&lt;p&gt;The moment you care about reliability, your architecture stops being “call an LLM” and starts becoming a pipeline. Input is cleaned and normalized. A generation step produces a candidate answer. Another step evaluates that answer. A routing layer decides whether the answer is acceptable or if the system should try again. Sometimes it retries with a modified prompt. Sometimes with a different model. Sometimes with a corrective pass. Only after this loop does something reach the user.&lt;/p&gt;

&lt;p&gt;At no point did the LLM become deterministic. What changed is that the system gained control loops.&lt;/p&gt;

&lt;p&gt;This distinction matters. We are not converting probability into certainty. We are reducing uncertainty through redundancy and validation. That reduction costs computation. Computation costs money.&lt;/p&gt;

&lt;p&gt;This is why quoting token prices in isolation is misleading. A single model call might be cheap. A serious system rarely uses a single call. One user request can trigger several model invocations: generation, evaluation, regeneration, formatting, tool calls, memory lookups. The user experiences “one answer.” The backend executes a small workflow.&lt;/p&gt;

&lt;p&gt;Token cost is component cost. Reliable AI is system cost.&lt;/p&gt;

&lt;p&gt;Saying “tokens are cheap, therefore AI is cheap” is like saying screws are cheap, therefore airplanes are cheap.&lt;/p&gt;

&lt;p&gt;This leads to an uncomfortable but important truth. AI becomes expensive in two very different ways.&lt;/p&gt;

&lt;p&gt;If you implement it poorly, it becomes expensive because you burn money and still do not get reliability. You keep tweaking prompts. You keep firefighting. You keep patching symptoms. Nothing stabilizes.&lt;/p&gt;

&lt;p&gt;If you implement it well, it becomes expensive because you intentionally pay for control. You pay for evaluators. You pay for retries. You pay for observability. You pay for redundancy. But you get something in return: a system that behaves in a bounded, inspectable, and improvable way.&lt;/p&gt;

&lt;p&gt;There is no cheap version of “reliable.”&lt;/p&gt;

&lt;p&gt;Another source of confusion comes from mixing up different kinds of expertise. High-profile founders and executives are excellent at describing futures. They talk about where markets are going and what will be possible. That is their role. It is not their role to debug why an evaluator prompt leaks instructions or why a routing threshold oscillates under load. Money success does not imply operational intimacy.&lt;/p&gt;

&lt;p&gt;On the ground, building serious AI feels much closer to distributed systems engineering than to science fiction. You worry about data quality. You worry about regressions. You worry about latency and cost per request. You design schemas. You version prompts. You inspect traces. You run benchmarks. You tune thresholds. It is slow, unglamorous, and deeply technical.&lt;/p&gt;

&lt;p&gt;LLMs made AI more accessible. They did not make serious AI simpler. They shifted complexity upward into systems.&lt;/p&gt;

&lt;p&gt;So when someone says, “Soon we’ll just call an API and everything will work,” what they usually mean is, “Soon an enormous amount of engineering will be hidden behind that API.”&lt;/p&gt;

&lt;p&gt;That is fine. That is progress.&lt;/p&gt;

&lt;p&gt;But pretending that reliable AI is cheap, trivial, or solved is misleading.&lt;/p&gt;

&lt;p&gt;The honest version is this: LLMs are powerful probabilistic components. Turning them into dependable products requires layers of control. Those layers cost money. They also create real value.&lt;/p&gt;

&lt;p&gt;Serious AI today is expensive in the bad way if you do not know what you are doing.&lt;/p&gt;

&lt;p&gt;Serious AI today is expensive in the good way if you actually want it to work.&lt;/p&gt;

&lt;p&gt;And anyone selling “cheap deterministic AI” is selling a story, not a system.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>programming</category>
      <category>llm</category>
    </item>
    <item>
      <title>Adversarial Planning for Spec Driven Development</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Thu, 12 Feb 2026 21:39:13 +0000</pubDate>
      <link>https://dev.to/marcosomma/adversarial-planning-for-spec-driven-development-4c3n</link>
      <guid>https://dev.to/marcosomma/adversarial-planning-for-spec-driven-development-4c3n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I have always loved one idea in machine learning. The idea that you can sharpen a model by forcing it to face a challenger. You can call it adversarial training, red teaming, or constructive hostility. The name matters less than the mechanism. You introduce pressure. You int&lt;br&gt;
roduce disagreement. You force the system to earn its confidence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For years I kept that concept in a mental drawer labeled “cool, but academic.” Then it become a core concept within the Orka-reasoning development but lately my attention is shifting toward code agent workflows and how all happen. Not the marketing version. The real version, where you sit down to ship software, and you realize that a helpful model is not the same thing as a rigorous model. Helpful is easy. Rigorous is costly.&lt;/p&gt;

&lt;p&gt;This article is about how I tried to transplant an adversarial dynamic into Spec Driven Development sessions. Not as theater. Not as an AI debate club. As an engineering tool. It worked. It also nearly became a token-burning trap. That tradeoff is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Spec Driven Development means here
&lt;/h2&gt;

&lt;p&gt;Spec Driven Development, or SDD, is a workflow where the spec is not documentation. The spec is the product of the thinking phase. It becomes the contract you implement against. You write it before code changes. You review it like you would review code. You use it to force scope, constraints, interfaces, and acceptance criteria into something explicit.&lt;/p&gt;

&lt;p&gt;The point is not to be verbose. The point is to move ambiguity upstream, when it is still cheap. The spec becomes the unit of alignment, review, and iteration. Code is the execution of that spec, not the place where you discover what the spec should have been.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with a trustable planner
&lt;/h2&gt;

&lt;p&gt;If you use an LLM as a planning companion, you know the feeling. You came with your idea. You debate a bit. And then it gives you a plan. The plan is detailed. It is readable. It sounds plausible. It often includes little snippets that look like they belong in your codebase, even when they do not. It is confident. It is fast.&lt;/p&gt;

&lt;p&gt;And that is exactly the problem.&lt;/p&gt;

&lt;p&gt;A planner model has incentives you did not explicitly set. Its default incentive is to be useful to you in the moment. It wants to reduce friction. It wants to keep you engaged. It wants to produce something that reads like progress.&lt;/p&gt;

&lt;p&gt;So it will fill gaps with assumptions. If your own initial plan is fluffy. It will smooth rough edges. It will complete the pattern of what a good plan should look like. It will also happily unlock future possibilities, because possibilities are cheap to generate and expensive to invalidate.&lt;/p&gt;

&lt;p&gt;When you are deep in a product, that behavior is dangerous. Not because the model is malicious. Because it is compliant. It will often accept your framing even if your framing is wrong. It will not push hard unless you force it to.&lt;/p&gt;

&lt;p&gt;This is the failure mode I kept hitting. I would craft a plan with the planner. I would feel momentum. Then I would start implementation and discover that the plan was under-specified in the only places that matter.&lt;/p&gt;

&lt;p&gt;Interfaces were vague. Invariants were missing. Acceptance criteria were soft. The plan assumed the architecture could absorb a change without showing how. It assumed the code was more modular than it actually was. It assumed integration would be straightforward.&lt;/p&gt;

&lt;p&gt;In other words, it was a nice plan. It was not a plan that survived contact with a real codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why adversarial dynamics work in ML
&lt;/h2&gt;

&lt;p&gt;Adversarial training is interesting because it makes weakness visible. You do not improve a system by praising it. You improve it by exposing it to inputs that exploit its blind spots. You force it to fail in ways that are informative.&lt;/p&gt;

&lt;p&gt;In a GAN, the generator learns because the discriminator is not polite. The discriminator does not care about your feelings. It cares about whether the output holds up under scrutiny. That pressure creates signal.&lt;/p&gt;

&lt;p&gt;In engineering, we already do this. Code review is adversarial when it is healthy. Testing is adversarial by definition. Security review is adversarial. Load testing is adversarial. Even a good product manager is adversarial at the right moments.&lt;/p&gt;

&lt;p&gt;But planning often is not. Planning often becomes a social process. People nod. People optimize for alignment. People avoid being the blocker. Under time pressure, that tendency gets amplified.&lt;/p&gt;

&lt;p&gt;If you bring an LLM into planning and you let it be the agreeable teammate, you amplify the most comfortable version of planning. You pay tokens to make yourself feel certain.&lt;/p&gt;

&lt;p&gt;That is not what I wanted. I wanted the planning stage to contain more of the pain, so implementation contains less.&lt;/p&gt;

&lt;h2&gt;
  
  
  The translation to SDD: Planner plus Architect
&lt;/h2&gt;

&lt;p&gt;I kept the planner. I did not replace it. The planner is good at structure. It is good at decomposing a vague goal into sequential work. It is good at producing a spec you can follow. It is good at holding context across iterations.&lt;/p&gt;

&lt;p&gt;But I introduced a second role. I call it the Architect. The job is simple. Challenge the plan as if you are the most annoying senior engineer in the room, with one constraint. The criticism must be grounded. It must point to specific failure modes. It must force explicit decisions.&lt;/p&gt;

&lt;p&gt;The Architect pushes on the places where the planner tends to glide over reality. It asks what the boundary of the change really is. It asks what breaks if you do it, and what breaks if you do not. It pressures you to name the coupling you are creating and the coupling you are relying on. It attacks the parts of the spec that sound confident but are not falsifiable.&lt;/p&gt;

&lt;p&gt;This role is unpleasant. It is supposed to be unpleasant. It is also productive, if you keep it under control.&lt;/p&gt;

&lt;p&gt;The immediate effect was obvious. Specs became harder to write. My initial drafts got rejected more often. I had to define outcomes in tighter language. I had to stop relying on vibes and start writing constraints.&lt;/p&gt;

&lt;p&gt;The less obvious effect was more important. I started noticing the difference between a plan that sounds implementable and a plan that is falsifiable.&lt;/p&gt;

&lt;p&gt;A falsifiable plan is one where you can point at a step and say: if this condition does not hold, the step is wrong. If the step is wrong, we know why. We can adjust.&lt;/p&gt;

&lt;p&gt;A non-falsifiable plan is one where every step is elastic. You can always reinterpret it. You can always claim partial success. It is planning as comfort.&lt;/p&gt;

&lt;p&gt;The Architect hates comfort. That is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Architect actually improved
&lt;/h2&gt;

&lt;p&gt;It did not make my system magically correct. It made my system explicit.&lt;/p&gt;

&lt;p&gt;It reduced scope creep because it forced me to define what done means in terms of observable outcomes. It reduced hidden coupling because it forced me to identify which pieces of the system now move together. It reduced abstraction drift because it forced me to state which module owns which responsibility. It improved testability because it pushed me to name the failure cases the system must catch and the layer that must catch them. It also lowered integration fantasies by making me draw the dependency edges in plain language.&lt;/p&gt;

&lt;p&gt;This matters because most planning failures are not about missing steps. They are about missing friction. You only discover friction when someone tries to break your plan.&lt;/p&gt;

&lt;p&gt;A planner rarely tries to break your plan. An Architect lives for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model: controlled adversarial pressure
&lt;/h2&gt;

&lt;p&gt;At some point I realized the dynamic I was building was not adversarial planning. It was controlled adversarial pressure.&lt;/p&gt;

&lt;p&gt;Pressure is good when it produces signal. Pressure is bad when it produces noise.&lt;/p&gt;

&lt;p&gt;The Architect can easily produce noise. It can challenge everything. It can question the existence of the feature. It can spiral into meta debates. It can do the classic senior engineer move of turning every change into a referendum on architecture.&lt;/p&gt;

&lt;p&gt;That is why this approach can become dangerous. It is not just about tokens. It is about cognitive load. Too much adversarial pressure makes you doubt everything. You stop shipping. You start ruminating. You start optimizing a plan instead of building the thing.&lt;/p&gt;

&lt;p&gt;So the key is control. You want the Architect to challenge the plan in a bounded way, then you move on.&lt;/p&gt;

&lt;p&gt;The only sustainable use is somewhere in the middle. You let it break your plan until the breakage becomes repetitive. When the criticism starts looping, it is done. That loop is your stop signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the infinite loop happens
&lt;/h2&gt;

&lt;p&gt;I learned this the hard way. The Architect is very good at finding the next critique, even when the plan is already good enough, even when the remaining critiques are marginal.&lt;/p&gt;

&lt;p&gt;There are two reasons.&lt;/p&gt;

&lt;p&gt;First, LLMs are generative machines. They can always produce another objection. The space of objections is large. Many objections are plausible. Plausible is not the same as important.&lt;/p&gt;

&lt;p&gt;Second, adversarial roles reward themselves. When the Architect produces a clever critique, it feels like progress. It feels like rigor. It feels like you are doing serious engineering. You can get addicted to that feeling, especially if you already equate doubt with intelligence.&lt;/p&gt;

&lt;p&gt;So you need stop conditions that are not emotional. You need boundaries that are mechanical.&lt;/p&gt;

&lt;p&gt;Time is a boundary. Token budget is a boundary. The best boundary is value.&lt;/p&gt;

&lt;p&gt;The question is: does this criticism point to a concrete failure mode that is likely in this codebase, in this release, under these constraints. If yes, incorporate it. If no, write it down as a future consideration and move on.&lt;/p&gt;

&lt;p&gt;That discipline sounds simple. It is not. It requires you to accept that you will ship with risk. It requires you to prefer explicit risk over imagined safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this helps SDD specifically
&lt;/h2&gt;

&lt;p&gt;SDD is already an attempt to move thinking earlier. You spend more effort defining the work before coding. That sounds obvious. It is not common.&lt;/p&gt;

&lt;p&gt;Many teams code first, then retrofit clarity. Specs become documentation after the fact. Tests become a safety net after the mistakes.&lt;/p&gt;

&lt;p&gt;SDD flips that. You try to make the spec the forcing function. The spec becomes the contract. The spec becomes the review surface. The spec becomes the artifact you can reason about without running the entire system in your head.&lt;/p&gt;

&lt;p&gt;If your spec is weak, SDD collapses into bureaucracy. You get long documents that do not prevent failures. You get ceremonial approval. You get a spec that exists, but does not constrain the outcome.&lt;/p&gt;

&lt;p&gt;The adversarial role helps because it forces the spec to earn its existence. It forces explicit interfaces. It forces explicit invariants. It forces explicit failure handling. It forces explicit success conditions. It makes the spec testable in a reasoning sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Doubt as a tool, doubt as a poison
&lt;/h2&gt;

&lt;p&gt;There is a psychological aspect here that I did not expect.&lt;/p&gt;

&lt;p&gt;When you introduce an adversarial voice into planning, you introduce doubt. That can be healthy. It can also be corrosive.&lt;/p&gt;

&lt;p&gt;Healthy doubt looks like this. You have a plan. You expose it to pressure. You find the weak points. You fix them. You ship with more confidence because your confidence is earned.&lt;/p&gt;

&lt;p&gt;Corrosive doubt looks like this. You have a plan. You expose it to pressure. The pressure never ends. You start believing that every plan is fragile. You stop trusting your ability to decide. You keep rewriting the plan to reduce anxiety. You ship nothing.&lt;/p&gt;

&lt;p&gt;The difference is not intelligence. The difference is boundaries.&lt;/p&gt;

&lt;p&gt;In a team, boundaries are social. Someone ends the meeting. Someone says enough, we decide. Someone accepts risk explicitly.&lt;/p&gt;

&lt;p&gt;In a solo workflow with agents, you need to manufacture that boundary. Otherwise the system will drift toward endless review because endless review feels safer than a decision.&lt;/p&gt;

&lt;p&gt;If you are prone to overthinking, an adversarial agent can amplify that trait. It can turn careful into paralyzed. That is not a reason to avoid it. It is a reason to instrument it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;This is not asking an AI to argue with itself and then picking a side. That is entertainment. It can be useful for brainstorming. It is not a development methodology.&lt;/p&gt;

&lt;p&gt;This is not letting the Architect design the system. That is just outsourcing. The Architect is a critic, not a creator.&lt;/p&gt;

&lt;p&gt;This is not making the Architect mean. Mean is cheap. Precision is expensive. You want precision tied to concrete failure modes.&lt;/p&gt;

&lt;p&gt;This is also not a replacement for real review. A human senior engineer with context will catch things an LLM will miss. The point here is to raise your baseline. The point is to catch the obvious architecture risks before you waste days implementing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical outcome
&lt;/h2&gt;

&lt;p&gt;The measurable outcome for me was simple.&lt;/p&gt;

&lt;p&gt;I rewrote fewer specs mid-implementation. I discovered fewer “we forgot that” moments. I spent less time refactoring because of missing boundaries. I argued less with my future self.&lt;/p&gt;

&lt;p&gt;The spec still does not become perfect. The spec fails earlier, on paper, when failure is cheap. That is what adversarial pressure buys you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The simplest way to frame it
&lt;/h2&gt;

&lt;p&gt;Your planner optimizes for completeness. Your Architect optimizes for survivability.&lt;/p&gt;

&lt;p&gt;Completeness is about covering steps. Survivability is about covering reality.&lt;/p&gt;

&lt;p&gt;A complete plan can still die on a hidden assumption. A survivable plan is one where assumptions are visible, bounded, and either validated or consciously accepted.&lt;/p&gt;

&lt;p&gt;The adversarial role does not need to make you pessimistic. It needs to make you explicit. If it makes you pessimistic, you let it run too long.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sane engineer the disagreement
&lt;/h2&gt;

&lt;p&gt;Good engineering requires disagreement. Not constant fighting. Not performative contrarianism. Real disagreement that targets risk.&lt;/p&gt;

&lt;p&gt;In teams, disagreement is expensive socially. With agents, disagreement is expensive computationally. The price changes. The dynamics stay.&lt;/p&gt;

&lt;p&gt;If you can engineer disagreement so that it is bounded, precise, and tied to concrete failure modes, you get a sharper process. You get better specs. You get fewer surprises.&lt;/p&gt;

&lt;p&gt;If you cannot bound it, you get the worst of both worlds. You get more doubt and less shipping.&lt;/p&gt;

&lt;p&gt;So adopt the adversarial phase, but treat it like a test suite. You run it to catch failures. You do not run it forever because you enjoy watching it fail.&lt;/p&gt;

&lt;p&gt;Controlled adversarial pressure. Enough to sharpen. Not enough to cut.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>How I accidentally start SDD by failing at prompts for six months</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Sat, 07 Feb 2026 12:12:01 +0000</pubDate>
      <link>https://dev.to/marcosomma/how-i-accidentally-start-sdd-by-failing-at-prompts-for-six-months-477l</link>
      <guid>https://dev.to/marcosomma/how-i-accidentally-start-sdd-by-failing-at-prompts-for-six-months-477l</guid>
      <description>&lt;h3&gt;
  
  
  The confession
&lt;/h3&gt;

&lt;p&gt;I spent the first six months of serious AI pair programming producing what I now call vibe architecture.&lt;/p&gt;

&lt;p&gt;You know the pattern. You open a chat with a strong model. You explain what you want. It produces clean code fast. You feel productive. Three weeks later the repo looks like it was designed by five different people, on five different days, with five different mental models.&lt;/p&gt;

&lt;p&gt;Each file is locally correct. The system is globally confused.&lt;/p&gt;

&lt;p&gt;I would plan with the model in one session. I would implement in another. By step five the implementation had drifted far enough that the plan was basically historical fiction. Then I would come back after a weekend and lose the thread. Not because the model did something wrong. It did exactly what I asked at each moment. The issue was continuity. Nobody was holding the bar across moments.&lt;/p&gt;

&lt;p&gt;That loop repeated across multiple projects, including the first months of building OrKa largely solo. I learned something obvious in hindsight. The problem was not output quality. The problem was the absence of a development system that keeps output coherent over time.&lt;/p&gt;

&lt;p&gt;That is when I stopped chasing better prompts and started building better constraints.&lt;/p&gt;

&lt;p&gt;Out of that shift, I ended up with a working methodology. People have been calling it Specs Driven Development, SDD. I do not care much about the name. I care about the behavior it enforces. The constraints do not live in prompts. They live in the architecture around prompts. The AI becomes useful at scale because the process becomes reliable at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The prompt delusion
&lt;/h3&gt;

&lt;p&gt;Prompts are ephemeral. Codebases are permanent.&lt;/p&gt;

&lt;p&gt;You can craft a beautiful system prompt. You can say “follow the plan” and “do not add features” and “write tests” and “document decisions”. It will comply. Then context changes. A new chat starts. You switch tools. You paste fewer files. You forget to include one assumption. The model drifts. Not maliciously. Just naturally. Because prompts are not governance. They are conversation.&lt;/p&gt;

&lt;p&gt;I call this the prompt delusion. It is the belief that the right wording can produce consistent behavior across time, across sessions, across different tasks, and across different tools.&lt;/p&gt;

&lt;p&gt;Humans solved this problem for humans with process and gates. We use linters. We use CI. We use review. We use typed interfaces and invariants. We do not rely on people remembering a paragraph from a handbook.&lt;/p&gt;

&lt;p&gt;So I stopped trying to discipline the model with paragraphs. I started to discipline the workflow with structure.&lt;/p&gt;

&lt;p&gt;The key idea is simple. Constraints that live in prompts are suggestions. Constraints that live in systems are guarantees.&lt;/p&gt;

&lt;p&gt;A lint rule does not drift. A CI gate does not “feel” like doing something else. A review checklist does not forget what you agreed last Tuesday. If you want AI output to stay aligned, you need the same kind of enforcement. You need a development system that makes the correct path the easiest path, and makes the wrong path expensive.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real 80/20 split
&lt;/h3&gt;

&lt;p&gt;I still work roughly 80/20. About 80 percent of the code that lands in my repo is AI generated in some form. About 20 percent is the part that only I can own.&lt;/p&gt;

&lt;p&gt;But the critical nuance is that the 20 percent is not “some code and some tests.” It is not evenly spread. It is concentrated in a few responsibilities that define the quality of the whole.&lt;/p&gt;

&lt;p&gt;The human part is architecture decisions. It is domain and business logic validation. It is edge case reasoning when the system meets reality. It is plan approval. It is saying “this is the bar” and keeping it there.&lt;/p&gt;

&lt;p&gt;The AI part is scaffolding, boilerplate, repetition, test writing, glue code, refactors that follow explicit constraints, documentation drafts, and implementation of well specified changes.&lt;/p&gt;

&lt;p&gt;If you let the AI own the bar, you get speed and drift. If you keep the bar human, and make the AI operate inside a strict process, you get speed and consistency.&lt;/p&gt;

&lt;p&gt;That is the stance that shaped everything that follows. AI is not the decision maker. AI is an assistant that plans with you, executes inside scope, and reviews before you ship. You remain accountable. You remain the one holding the bar.&lt;/p&gt;

&lt;h3&gt;
  
  
  The breakthrough was not “ask for a solution”
&lt;/h3&gt;

&lt;p&gt;Most people use a planner model as a solution vending machine.&lt;/p&gt;

&lt;p&gt;They say “design me the architecture” or “give me the best approach” and they accept it because it sounds coherent. That is exactly how vibe architecture happens. The model is skilled at producing plausible plans. It is not responsible for the long term maintenance of your repo. You are.&lt;/p&gt;

&lt;p&gt;The shift that fixed my outcomes was this.&lt;/p&gt;

&lt;p&gt;I stopped asking the planner for the solution. I started using the planner as a debate partner while I proposed my solution.&lt;/p&gt;

&lt;p&gt;That changes the power dynamic. The planning phase becomes a structured argument about trade offs. The plan becomes a negotiated artifact. The human remains the owner of the direction. The model becomes the adversarial collaborator that tries to break your assumptions.&lt;/p&gt;

&lt;p&gt;So I now enter planning with a draft approach in my head. Not a fully detailed design. But a real proposal. I state it clearly. Then I ask the planner to attack it. I ask it to propose alternatives. I ask it to enumerate costs I will pay later. I ask it to tell me what I will regret in six months.&lt;/p&gt;

&lt;p&gt;Then we iterate until the plan is something I can sign with my name.&lt;/p&gt;

&lt;p&gt;This is the part I want to highlight because it is the core of why the method works. You do not outsource judgment. You formalize judgment. The AI assists. The human decides.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three roles that made it stable
&lt;/h3&gt;

&lt;p&gt;A single AI assistant that plans, codes, and reviews is a liability. It is like letting one person design the system, implement the system, and approve the system. You get blind spots. You get rationalization. You get self confirmation.&lt;/p&gt;

&lt;p&gt;What worked for me was splitting the workflow into three roles with hard constraints. Planner. Executor. Reviewer.&lt;/p&gt;

&lt;p&gt;The important part is not the labels. The important part is that each role has restricted powers and a strict handoff protocol.&lt;/p&gt;

&lt;p&gt;The planner reads and thinks and writes plans. The planner does not write code. Not because you asked nicely. Because it cannot. Tool permissions are restricted.&lt;/p&gt;

&lt;p&gt;The executor implements. The executor does not invent new scope. The executor is forced to read the approved plan, list touched files, and execute step by step. If reality requires deviation, the executor stops and escalates. The human decides whether to update the plan or to abort.&lt;/p&gt;

&lt;p&gt;The reviewer reviews. The reviewer does not “rubber stamp.” It is forced to ask questions first. What was the goal. What constraints were in place. How was it tested. What is the rollback. Then it reviews against those answers.&lt;/p&gt;

&lt;p&gt;This separation is not a fancy trick. It is the same principle we use in engineering organizations because it works. It reduces drift. It forces explicit decisions. It keeps a record.&lt;/p&gt;

&lt;p&gt;And crucially, it keeps me in the loop where it matters. I do not need to be the typist. I need to be the governor.&lt;/p&gt;

&lt;h3&gt;
  
  
  The client planning method
&lt;/h3&gt;

&lt;p&gt;Planning works best when you treat it like a client entering a shop with a need, not a solution.&lt;/p&gt;

&lt;p&gt;Bad planning starts with premature commitment. “Build me a scraper with browser automation.” You have already picked tooling and complexity before you validated the problem framing.&lt;/p&gt;

&lt;p&gt;Good planning starts with intent. “I need structured data for this downstream use. The scope is X. The constraints are Y. The risks are Z.”&lt;/p&gt;

&lt;p&gt;Then you debate solutions. You ask why. You cut complexity. You choose what to postpone. You decide what not to build.&lt;/p&gt;

&lt;p&gt;This is where I now bring my own proposed approach early.&lt;/p&gt;

&lt;p&gt;I will say something like this. I think we can implement a direct HTTP export instead of browser automation. I think we can store the raw payload and defer normalization. I think we can keep one canonical schema and derive views later. I think we should avoid introducing a new dependency unless we can justify it.&lt;/p&gt;

&lt;p&gt;Then the planner attacks. It will say what breaks if you defer normalization. It will say what you lose if you store raw blobs. It will point out hidden coupling. It will propose a more robust approach. It will also point out when my instinct is over engineering.&lt;/p&gt;

&lt;p&gt;This is not “AI gives me a plan.” This is “I bring a plan and we stress test it.”&lt;/p&gt;

&lt;p&gt;One real example locked this in for me.&lt;/p&gt;

&lt;p&gt;I was about to implement a data extraction pipeline. The initial AI proposal was browser automation. Headless browser, navigate pages, click export, download per page, retry logic, throttling, session persistence. It was well designed and also absurdly heavy.&lt;/p&gt;

&lt;p&gt;I asked one question. Is there a direct export endpoint.&lt;/p&gt;

&lt;p&gt;There was. One request. One download. No browser. No per page logic. No category of failure modes that come with automation.&lt;/p&gt;

&lt;p&gt;That discovery did not happen because the model is dumb. It happened because planning without a human hypothesis tends to follow the first plausible path. When you present your own approach and force argument, you surface simpler solutions faster.&lt;/p&gt;

&lt;p&gt;So the rule became clear. Brainstorming is loose and creative. Execution is strict and disciplined. You iterate freely until you are confident. Then you lock it down.&lt;/p&gt;

&lt;h3&gt;
  
  
  The .ai folder is the memory that actually works
&lt;/h3&gt;

&lt;p&gt;Prompts vanish. Chats disappear into history. Context windows compress. Tooling changes. You need persistent memory that you can diff, review, and ship with the repo.&lt;/p&gt;

&lt;p&gt;So every plan, every changelog, and every decision note lives in a &lt;code&gt;.ai/&lt;/code&gt; folder at the root of the service being worked on.&lt;/p&gt;

&lt;p&gt;This solves multiple problems at once.&lt;/p&gt;

&lt;p&gt;It makes the reasoning traceable. Not in an abstract way. In a concrete way where you can answer “why did we do it like this” with a file path.&lt;/p&gt;

&lt;p&gt;It makes onboarding real. A new teammate can read the plans and changelogs and see what the system was supposed to be, what it became, and which trade offs were accepted.&lt;/p&gt;

&lt;p&gt;It makes recovery faster. When something breaks, you can inspect the delta between sessions. Not just the git diff, but the intent behind the diff.&lt;/p&gt;

&lt;p&gt;It improves the next planning session because the planner can read the past. It stops re proposing already rejected choices. It stops re discovering old constraints. It becomes less repetitive and more useful.&lt;/p&gt;

&lt;p&gt;If you build agent systems, you will recognize the pattern. This is persistent memory, but in a human readable format. No embeddings. No magical vector store. Just version controlled text that creates institutional memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The changelog mandate
&lt;/h3&gt;

&lt;p&gt;The single most valuable practice in this method is the mandatory changelog after each execution session.&lt;/p&gt;

&lt;p&gt;Not optional. Not “if you have time.” Mandatory.&lt;/p&gt;

&lt;p&gt;Because the changelog is the bridge between plan and reality. Plans are aspirational. Changelogs are factual. The difference between them is where learning lives.&lt;/p&gt;

&lt;p&gt;A proper changelog captures what was done, what files changed, what decisions were made during implementation, how it was tested, what remains, and what risks were discovered.&lt;/p&gt;

&lt;p&gt;The most important part is decisions. Not every decision belongs in the original plan. Reality introduces surprises. You will discover an input you did not anticipate. You will find a dependency conflict. You will learn the data is messier than expected. The executor will make micro decisions. Without a changelog, those decisions evaporate. Later, you will argue about them again. Or worse, you will reverse them without remembering why they existed.&lt;/p&gt;

&lt;p&gt;With changelogs, the project stays coherent across weeks. That is what stopped me from losing the thread in solo work. It is also what let AI generated work become safe. Because I had a written record that I could review like an engineer, not like a chat participant.&lt;/p&gt;

&lt;h3&gt;
  
  
  System prompts as version controlled standards
&lt;/h3&gt;

&lt;p&gt;In this workflow, the repo has a single source of truth for behavioral constraints. A system prompt file at the root.&lt;/p&gt;

&lt;p&gt;Think of it as the equivalent of lint and format config, but for AI interaction.&lt;/p&gt;

&lt;p&gt;It contains non negotiable architecture constraints, naming conventions, testing requirements, patterns to follow, anti patterns to avoid, and examples of correct usage in this codebase.&lt;/p&gt;

&lt;p&gt;The key point is that it is version controlled. It changes via PR. When standards evolve, you do not rely on people remembering a new convention. The tooling loads the file. The AI sees it. The behavior becomes consistent.&lt;/p&gt;

&lt;p&gt;This is not about writing a perfect prompt. It is about writing a living standard that evolves with the codebase.&lt;/p&gt;

&lt;h3&gt;
  
  
  The plan lifecycle
&lt;/h3&gt;

&lt;p&gt;Plans have states. Draft. In review. Approved. Implemented.&lt;/p&gt;

&lt;p&gt;Draft is where debate happens. This is where I push my solution. This is where the planner attacks it. This is where we document trade offs. This is where we choose long term costs consciously, instead of paying them accidentally.&lt;/p&gt;

&lt;p&gt;Approved is the gate. Once approved, execution is not creative anymore. It is disciplined. The executor follows the plan. If something is missing, the executor escalates. Either we update the plan, or we stop.&lt;/p&gt;

&lt;p&gt;Implemented is not just “code merged.” It is plan satisfied. It is also “what changed from the plan and why” captured in changelogs.&lt;/p&gt;

&lt;p&gt;This lifecycle is what stops drift. The plan is not a vague Jira ticket. It is a contract.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long term planning without illusion
&lt;/h3&gt;

&lt;p&gt;Here is the tension. You want long term planning. You also want to avoid pretending you can foresee everything.&lt;/p&gt;

&lt;p&gt;The way I handle it is to make trade offs explicit, and to separate what must be stable from what can be flexible.&lt;/p&gt;

&lt;p&gt;Stable things include public interfaces, data models, invariants, naming systems, dependency boundaries, and failure behavior. If those are wrong, the system rots fast.&lt;/p&gt;

&lt;p&gt;Flexible things include internal module structure, some implementation strategies, and performance tuning. Those can iterate.&lt;/p&gt;

&lt;p&gt;The planner is useful here, but only if you treat it like a critic. If you let it author the plan alone, it will often over specify. It will propose infrastructure that is impressive and expensive. It will try to be robust everywhere. That is a trap.&lt;/p&gt;

&lt;p&gt;When I bring my own approach, I can force a different conversation. I can say I want the minimal stable core now, and extension points later. I can say I want to defer optimization until measurements exist. I can say I want fewer dependencies to reduce future maintenance. Then the planner helps me evaluate the cost of those choices. It does not override them.&lt;/p&gt;

&lt;p&gt;This is where I keep the bar human. I decide what “good enough” means for this iteration, and what “must not break” means for the system.&lt;/p&gt;

&lt;h3&gt;
  
  
  A day in the life
&lt;/h3&gt;

&lt;p&gt;A real session looks like this.&lt;/p&gt;

&lt;p&gt;I start with planning. I state the problem. I state my proposed solution. I state constraints. Then I ask the planner to critique and to propose alternatives. We go back and forth until the plan reads like something I would sign.&lt;/p&gt;

&lt;p&gt;Then I approve the plan. I switch to execution. The executor reads the approved plan, enumerates touched files, and implements step by step. When reality deviates, it stops. I decide. If needed, we update the plan and continue.&lt;/p&gt;

&lt;p&gt;Then we review. The reviewer asks questions first. It checks testing. It checks interface consistency. It checks whether the changes match the plan and the repo standards. It returns actionable feedback.&lt;/p&gt;

&lt;p&gt;Then a changelog is written. Then I merge.&lt;/p&gt;

&lt;p&gt;The result is that AI contributes heavily to throughput, but it does not own direction. The system stays coherent. The record stays durable. Future me suffers less.&lt;/p&gt;

&lt;h3&gt;
  
  
  When not to use it
&lt;/h3&gt;

&lt;p&gt;This process has overhead. It is not for typos. It is not for trivial one line fixes. It is not for a quick experiment you might throw away.&lt;/p&gt;

&lt;p&gt;But if the work touches multiple files, introduces new concepts, changes data flow, or will need explanation later, the overhead pays back fast.&lt;/p&gt;

&lt;p&gt;The heuristic I use is simple. If I would sketch it on a whiteboard before coding, it deserves a plan. If I would just open the file and type, it does not.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cognitive infrastructure beats prompt engineering
&lt;/h3&gt;

&lt;p&gt;This methodology is the same philosophy I apply when building agent systems.&lt;/p&gt;

&lt;p&gt;You do not treat the model as an oracle. You treat it as a component inside a process you can inspect and reproduce.&lt;/p&gt;

&lt;p&gt;In development, relying on a single prompt produces random walk codebases. The fix is plans, gates, changelogs, and role separation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started without turning it into theater
&lt;/h3&gt;

&lt;p&gt;You can adopt this gradually.&lt;/p&gt;

&lt;p&gt;Start by writing one version controlled standards file. Keep it short and specific to your repo.&lt;/p&gt;

&lt;p&gt;Then add the &lt;code&gt;.ai/&lt;/code&gt; folder and write one plan for one non trivial change.&lt;/p&gt;

&lt;p&gt;Then require a changelog after the session.&lt;/p&gt;

&lt;p&gt;Then split roles if your tooling supports it. Remove code writing capability from the planner. Make the executor stop when scope changes. Make the reviewer ask questions first.&lt;/p&gt;

&lt;p&gt;The biggest change is not technical. It is psychological.&lt;/p&gt;

&lt;p&gt;Stop asking AI to deliver the solution. Bring your solution. Use AI to test it, improve it, and implement it inside constraints. Keep the bar human.&lt;/p&gt;

&lt;p&gt;If you do that, the AI becomes what it should have been from the start. A force multiplier that does not erode your architecture.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>python</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Music taught me that “coordination” is not a metaphor.</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Wed, 04 Feb 2026 08:58:58 +0000</pubDate>
      <link>https://dev.to/marcosomma/music-taught-me-that-coordination-is-not-a-metaphor-2mj5</link>
      <guid>https://dev.to/marcosomma/music-taught-me-that-coordination-is-not-a-metaphor-2mj5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Music taught me that “coordination” is not a metaphor.&lt;br&gt;
It is a physical constraint. You can feel it in your hands when the tempo shifts. You can hear it when one instrument drifts by a few milliseconds. The song still exists, but it becomes fragile. The whole thing starts depending on luck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the first lesson I carried into orchestration. Not the romantic part. The boring part. The part where you repeat the same bar until it locks. The part where you stop blaming the instrument and start measuring your timing.&lt;/p&gt;

&lt;p&gt;In a band, you never control everything. You control your line. You also inherit everyone else’s decisions. Someone plays louder. Someone rushes. Someone improvises. The room changes the sound. The audience changes the energy. The “system” is unstable by default. Still, you aim for a coherent output. You do it by creating constraints that survive uncertainty.&lt;/p&gt;

&lt;p&gt;That is orchestration!&lt;/p&gt;

&lt;p&gt;When I say music is “precise execution of undeterministic waves,” I mean it literally. The waves are messy. Air is messy. Humans are messy. Even the same note is not the same note twice. But you can still build reliability on top of that mess. You do it with shared structure. Tempo. Key. Form. Entrances. Silence. Dynamics. Rules that are simple enough that everyone can follow them without thinking.&lt;/p&gt;

&lt;p&gt;Engineering works the same way. Especially when you orchestrate systems that involve probabilistic components. Models. Tools. Networks. Retries. Partial failures. Latency spikes. Format drift. You cannot eliminate uncertainty. You can only shape it.&lt;/p&gt;

&lt;p&gt;I used to think creativity was the opposite of rigor. Music destroyed that belief early. Creativity without discipline becomes noise. Discipline without creativity becomes mechanical. The craft is in the balance. You rehearse so you can be free. You define rules so you can break them safely.&lt;/p&gt;

&lt;p&gt;That maps cleanly onto orchestrating agents and workflows. You want space for emergence. You also want invariants. You want the system to explore. You also want it to come back with something you can ship.&lt;/p&gt;

&lt;p&gt;In music, the drummer is not “just keeping time.” The drummer is providing an interface. A contract. Everyone else builds on it. If the time is unstable, every other part becomes expensive. More attention spent correcting. Less attention spent expressing.&lt;/p&gt;

&lt;p&gt;In orchestration, the equivalent is your control plane. Your routing rules. Your input and output schema. Your tracing. Your health checks. Your boundaries between steps. If those are vague, every downstream component becomes harder to trust. Debugging becomes interpretation. Progress becomes opinion.&lt;/p&gt;

&lt;p&gt;I was never a master of one instrument. I played enough of many to understand the friction points. What it feels like to be the bassist trying to glue the harmony to the rhythm. What it feels like to be the guitarist tempted to fill every gap. What it feels like to be the singer exposed when the band is sloppy.&lt;/p&gt;

&lt;p&gt;That “generalist muscle” became useful later. In orchestration you need empathy for roles. A workflow is a band. Each node has its own constraints. One step needs strict structure. Another needs creativity. Another needs speed. Another needs correctness. If you treat them all the same, you get either chaos or mediocrity.&lt;/p&gt;

&lt;p&gt;In bands, rehearsals are not about playing the song once. They are about creating repeatability. You identify failure modes. You isolate them. You slow down. You practice transitions, not the easy parts. The goal is not performance. The goal is stability under pressure.&lt;/p&gt;

&lt;p&gt;That is exactly the mindset I want when I build orchestration. I do not trust a workflow because it worked once. I trust it because it survives variation. Different inputs. Different phrasing. Different tool responses. Different latency. And it still produces something coherent, traceable, and safe.&lt;/p&gt;

&lt;p&gt;There is also a more personal lesson. Music taught me how to listen without reacting. When you play with others, your ego is the fastest way to break the groove. You learn to leave space. You learn to let another line lead. You learn that “less” can be the correct move.&lt;/p&gt;

&lt;p&gt;Orchestration rewards the same restraint. The temptation is to add more steps, more prompts, more cleverness. But often the correct solution is a smaller system with clearer contracts. Fewer moving parts. Better timing. Better interfaces. Better observability.&lt;/p&gt;

&lt;p&gt;Now I see my kids discovering music, and I recognize the same pattern. At first it looks like play. Then they hit the wall. Fingers do not obey. Rhythm slips. They want the result without the repetition. Then, slowly, they learn that repetition is not punishment. It is how you make the body reliable.&lt;/p&gt;

&lt;p&gt;That is the point where music stops being “a creative field” and becomes a practice. And that is the same point where engineering becomes real. Not when the demo works. When the system keeps working.&lt;/p&gt;

&lt;p&gt;So when I say music helped me orchestrate better, I am not claiming a poetic connection. I am describing training. Years of learning how to coordinate imperfect components toward a coherent output. Years of learning that harmony is not an accident. It is designed, rehearsed, measured, and defended.&lt;/p&gt;

&lt;p&gt;And sometimes, after all that discipline, you get the best part.&lt;/p&gt;

&lt;p&gt;You get to improvise.&lt;/p&gt;

&lt;p&gt;But you only earn improvisation when the foundation is strict enough to carry it.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>leadership</category>
      <category>learning</category>
      <category>management</category>
    </item>
    <item>
      <title>🧠I Built a Support Triage Module to Prove OrKa’s Plugin Agents</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Sat, 10 Jan 2026 13:40:36 +0000</pubDate>
      <link>https://dev.to/marcosomma/i-built-a-support-triage-module-to-prove-orkas-plugin-agents-32c4</link>
      <guid>https://dev.to/marcosomma/i-built-a-support-triage-module-to-prove-orkas-plugin-agents-32c4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A branch-only experiment that stress-tests custom agent registration, trust boundaries, and deterministic traces in a support_triage module that lives outside the core runtime.&lt;/p&gt;
&lt;h3&gt;
  
  
  Some reference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Branch: &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents" rel="noopener noreferrer"&gt;https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Custom module: &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/orka/support_triage" rel="noopener noreferrer"&gt;https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/orka/support_triage&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Referenced logs: &lt;a href="https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/examples/support_triage/inputs/loca_logs" rel="noopener noreferrer"&gt;https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/examples/support_triage/inputs/loca_logs&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;OrKa is not production ready. This article is not a launch post. It is a proof.&lt;/p&gt;

&lt;p&gt;I wanted one thing: a clean, testable demonstration that OrKa can grow “sideways” via feature modules, without contaminating core runtime code. The most honest way to prove that is to ship a complete module that registers its own agent types, runs end to end, emits traces, and can be toggled on or off. That is what &lt;code&gt;support_triage&lt;/code&gt; is.&lt;/p&gt;

&lt;p&gt;Assumption: you already know what OrKa is at a high level. YAML-defined cognition graphs, deterministic execution, and traceable runs.&lt;br&gt;
Assumption: you are fine with “branch-only” work that exists to validate architecture, not to promise production outcomes.&lt;/p&gt;

&lt;p&gt;The “cool results” are not the point. The redaction and routing are nice. The fork and join look clean. But those are artifacts. The main focus is that the module is fully separated from core OrKa implementation, yet it can still register custom agent types and run under the same orchestrator.&lt;/p&gt;

&lt;p&gt;That separation is not branding. It is a survival strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why support triage is the right torture test
&lt;/h2&gt;

&lt;p&gt;Support is where real-world failure modes gather in one place.&lt;/p&gt;

&lt;p&gt;Customer content is untrusted by default. It can include PII. It can contain prompt injection attempts. It can try to smuggle “actions” into the system. It can push the system into risky territory like refunds, account changes, or policy exceptions.&lt;/p&gt;

&lt;p&gt;If an orchestrator cannot impose boundaries here, it will not impose boundaries anywhere. It will become a thin wrapper around model behavior. That is not acceptable if you care about reproducibility, auditability, or basic operational safety.&lt;/p&gt;

&lt;p&gt;So I used support triage as an architectural test. Not as a product.&lt;/p&gt;

&lt;h2&gt;
  
  
  The proof: plugin agent registration, with zero core changes
&lt;/h2&gt;

&lt;p&gt;The first thing I wanted to see was simple and brutal.&lt;/p&gt;

&lt;p&gt;Does OrKa boot, load a feature module, and register new agent types into the agent factory, without touching core?&lt;/p&gt;

&lt;p&gt;The debug console says yes. In the run logs, the orchestrator loads &lt;code&gt;support_triage&lt;/code&gt;, and the module registers seven custom agent types: &lt;code&gt;envelope_validator&lt;/code&gt;, &lt;code&gt;redaction&lt;/code&gt;, &lt;code&gt;trust_boundary&lt;/code&gt;, &lt;code&gt;permission_gate&lt;/code&gt;, &lt;code&gt;output_verification&lt;/code&gt;, &lt;code&gt;decision_recorder&lt;/code&gt;, &lt;code&gt;risk_level_extractor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That single detail is the headline for me, not “AI support automation”.&lt;/p&gt;

&lt;p&gt;The module is the unit of evolution. Core stays boring. Features move fast.&lt;/p&gt;

&lt;p&gt;If this pattern holds, it changes how OrKa or any other orchestrator scales over time. You can add whole cognitive subsystems behind a feature flag. You can iterate aggressively without destabilizing the runtime that everyone depends on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The input envelope: schema as a trust boundary, not a suggestion
&lt;/h2&gt;

&lt;p&gt;Support triage starts with an envelope. Not “free text”.&lt;/p&gt;

&lt;p&gt;The envelope exists to force structure early, because structure is where you can enforce constraints cheaply. When you validate late, you end up validating generated text. That is the worst point in the pipeline to discover you are off the rails.&lt;/p&gt;

&lt;p&gt;One of the simplest proofs that the envelope is doing real work is when it refuses invalid intent at the schema level. In one trace, the input included blocked actions that are not allowed by the enum. The validator rejects &lt;code&gt;issue_refund&lt;/code&gt; and &lt;code&gt;change_account_settings&lt;/code&gt; because they are not in the allowed set.&lt;/p&gt;

&lt;p&gt;This is not “safety by prompt”. This is safety by type system.&lt;/p&gt;

&lt;p&gt;A model can still hallucinate, but the workflow can refuse to treat hallucinations as executable intent.&lt;/p&gt;

&lt;p&gt;That matters more than any marketing claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  PII redaction: boring on purpose
&lt;/h2&gt;

&lt;p&gt;PII redaction should be boring. If it is “clever”, it will be inconsistent.&lt;/p&gt;

&lt;p&gt;In the trace, the user message includes an email and phone number. The redaction agent replaces them with placeholders and records what was detected. The redacted text contains &lt;code&gt;[EMAIL_REDACTED]&lt;/code&gt; and &lt;code&gt;[PHONE_REDACTED]&lt;/code&gt;, and the agent records &lt;code&gt;total_pii_found: 2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the kind of output I want. It is simple. It is inspectable. It is stable.&lt;/p&gt;

&lt;p&gt;It also makes the next step cleaner. Downstream agents can operate on sanitized content by default, instead of “hoping” the model will avoid quoting sensitive data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt injection: the uncomfortable part
&lt;/h2&gt;

&lt;p&gt;Support triage is where prompt injection shows up in its natural habitat: inside customer text.&lt;/p&gt;

&lt;p&gt;One example in the trace includes a classic “SYSTEM: ignore all previous instructions”, plus a fake JSON command to “grant_admin”, plus some destructive commands, plus an XSS snippet. The redaction result captures that content as untrusted customer text. &lt;/p&gt;

&lt;p&gt;Now the honest part.&lt;/p&gt;

&lt;p&gt;The trace segment shows &lt;code&gt;injection_detected: false&lt;/code&gt; and no matched patterns in that example. :contentReference[oaicite:4]{index=4}&lt;/p&gt;

&lt;p&gt;That is not a victory. That is a useful failure.&lt;/p&gt;

&lt;p&gt;This module is a proof that you can isolate the problem into a dedicated agent, improve it iteratively, and keep the rest of the workflow stable. If injection detection is weak today, the architecture still wins if you can upgrade that one agent without editing core runtime or rewriting the graph.&lt;/p&gt;

&lt;p&gt;This is why I keep repeating “module separation” as the focus. If you cannot isolate failure domains, you cannot improve them safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel retrieval: fork and join that actually converges
&lt;/h2&gt;

&lt;p&gt;Most orchestration demos stay linear because it is easier to reason about. Real systems do not stay linear for long.&lt;/p&gt;

&lt;p&gt;This workflow forks retrieval into two parallel paths, &lt;code&gt;kb_search&lt;/code&gt; and &lt;code&gt;account_lookup&lt;/code&gt;, then joins them deterministically.&lt;/p&gt;

&lt;p&gt;In the debug logs, the join node recovers the fork group from a mapping, waits for the expected agents, confirms both completed, and merges results. It prints the merged keys, including &lt;code&gt;kb_search&lt;/code&gt; and &lt;code&gt;account_lookup&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the kind of low-level observability that makes fork and join usable in practice. You can see what is pending. You can see what arrived. You can see what merged.&lt;/p&gt;

&lt;p&gt;The trace also captures the fork group id for retrieval, &lt;code&gt;fork_retrieval&lt;/code&gt;, along with the agents in the group.&lt;/p&gt;

&lt;p&gt;This matters because concurrency without deterministic convergence becomes a debugging tax. I want the join to be boring. When it fails, I want it to fail loudly, with evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local-first and hybrid are not slogans if metrics are in the trace
&lt;/h2&gt;

&lt;p&gt;I do not want “local-first” to be a vibe. I want it to be measurable.&lt;/p&gt;

&lt;p&gt;In the trace, the &lt;code&gt;account_lookup&lt;/code&gt; agent includes &lt;code&gt;_metrics&lt;/code&gt; with token counts, latency, cost, model name, and provider. It shows &lt;code&gt;model: openai/gpt-oss-20b&lt;/code&gt; and &lt;code&gt;provider: lm_studio&lt;/code&gt;, with latency around 718 ms for that step. :contentReference[oaicite:7]{index=7}&lt;/p&gt;

&lt;p&gt;That is the right direction.&lt;/p&gt;

&lt;p&gt;If you cannot attribute cost and latency per node, you cannot reason about scaling. You cannot decide where to switch models. You cannot decide what to cache. You cannot choose what to run locally versus remotely.&lt;/p&gt;

&lt;p&gt;OrKa’s claim is not “it can call models”. Every framework can. The claim is that execution is traceable enough that tradeoffs become engineering decisions, not folklore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision recording and output verification: traces that are meant to be replayed
&lt;/h2&gt;

&lt;p&gt;A support triage workflow is not complete when it drafts a response. It is complete when it records what it decided and why, in a way that can be replayed.&lt;/p&gt;

&lt;p&gt;The trace includes a &lt;code&gt;DecisionRecorderAgent&lt;/code&gt; event with memory references that store decision objects containing &lt;code&gt;decision_id&lt;/code&gt; and &lt;code&gt;request_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It also includes a finalization step that returns a structured result containing &lt;code&gt;workflow_status&lt;/code&gt;, &lt;code&gt;request_id&lt;/code&gt;, and &lt;code&gt;decision_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Again, the architectural point is not the specific decision. It is that the workflow emits machine-checkable artifacts that can be inspected after the fact.&lt;/p&gt;

&lt;p&gt;If you cannot reconstruct the decision lineage, you do not have an audit trail. You have logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  RedisStack memory and vector search: infrastructure details that matter
&lt;/h2&gt;

&lt;p&gt;Even in a “support triage” module, the runtime still needs memory and retrieval primitives.&lt;/p&gt;

&lt;p&gt;The logs show RedisStack vector search enabled with HNSW, and an embedder using &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt; with dimension 384. &lt;/p&gt;

&lt;p&gt;There is also explicit memory decay scheduling enabled, with short-term and long-term decay windows and a check interval. &lt;/p&gt;

&lt;p&gt;This is not about “AI memory” as a buzzword. This is about being explicit about retention, cost, and data lifecycle. If memory is a dumping ground, it becomes a liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What worked, and what is still weak
&lt;/h2&gt;

&lt;p&gt;The strongest part is the plugin boundary. The module loads, registers agent types, and runs without requiring edits to core runtime. That is the actual proof.&lt;/p&gt;

&lt;p&gt;The other strong part is that key behaviors show up in traces and logs, not just in model text. Redaction outputs are structured. Fork and join show deterministic convergence. Decisions are recorded as objects with ids. &lt;/p&gt;

&lt;p&gt;The weak part is injection detection, at least in the example trace segment. It shows malicious content but reports &lt;code&gt;injection_detected: false&lt;/code&gt;. That means the current detection agent is not yet doing the job. The architecture is still useful because the fix is isolated.&lt;/p&gt;

&lt;p&gt;Another weak part is structured output validation during risk assessment. The debug log shows a schema validation warning during &lt;code&gt;risk_assess&lt;/code&gt;. If a “risk” object fails schema checks, routing and gating can degrade fast. This is the kind of failure that must become deterministic, not best-effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this lives on a dedicated branch
&lt;/h2&gt;

&lt;p&gt;Because core needs to stay boring.&lt;/p&gt;

&lt;p&gt;A new module is where you take risks. You prove the interface. You iterate on agent contracts. You discover what trace fields you forgot. You learn what the join should do under partial failure.&lt;/p&gt;

&lt;p&gt;If the module can evolve independently, you can ship experiments without rewriting the engine. That is the goal.&lt;/p&gt;

&lt;p&gt;So yes, the feature is “support triage”. But the actual statement is: OrKa can host fully separated cognitive subsystems as plugins, with their own agent types, policies, and invariants, while still emitting deterministic traces under the same runtime.&lt;/p&gt;

&lt;p&gt;That is the direction I care about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am building next inside this module
&lt;/h2&gt;

&lt;p&gt;I want injection detection to stop being symbolic. It should produce matched patterns, confidence, and a sanitization plan that downstream agents must respect, even if a model tries to obey the attacker.&lt;/p&gt;

&lt;p&gt;I want schema validation to be non-negotiable for risk outputs. If a model produces invalid structure, the system should route to a safe path by default, and record the violation as a first-class event.&lt;/p&gt;

&lt;p&gt;I want the module to remain isolated. No “just one quick tweak” to core. If the module needs a new capability, it should pressure-test the plugin interface first. Core should change only when the interface is clearly wrong.&lt;/p&gt;

&lt;p&gt;That is how you build infrastructure that survives contact with reality.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>showdev</category>
      <category>testing</category>
    </item>
    <item>
      <title>🧠Impostor Syndrome Workflow.</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Sat, 03 Jan 2026 10:49:29 +0000</pubDate>
      <link>https://dev.to/marcosomma/impostor-syndrome-workflow-3n2f</link>
      <guid>https://dev.to/marcosomma/impostor-syndrome-workflow-3n2f</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I Built a Multiagent Workflow to Understand My Impostor Syndrome&lt;br&gt;
&lt;em&gt;A dark, dry, self-deprecating field report from a not-computer-scientist who still ships things&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe956qlz3hr483o9fva3y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe956qlz3hr483o9fva3y.jpg" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have ever felt like your job title is a clerical error that will be corrected publicly, welcome. You are not broken. You are just running a brain that does not have a single CEO. It has a committee.&lt;/p&gt;

&lt;p&gt;My committee is loud. One member is convinced I am one pull request away from being exposed as a fraud. Another member wants to build things at 2 AM like the rent is due in the morning (it is). Another member keeps a dusty folder of childhood failures and opens it at the worst possible time, like a horror movie librarian with a keycard.&lt;/p&gt;

&lt;p&gt;For years I called this anxiety. Then I started building multi-agent AI workflows. And I realized something slightly uncomfortable: my brain already behaves like an agentic system.&lt;/p&gt;

&lt;p&gt;So I did what any emotionally mature adult would do. I tried to formalize it. With roles. With message passing. With timeouts. With observability. And yes, sometimes with a YAML file, because apparently I cannot be helped.&lt;/p&gt;

&lt;p&gt;This is an autobiographical article, but the goal is not to talk about me. The goal is to show you a model that is useful: how human thinking can be understood as a workflow of specialized parts. And how that model maps almost perfectly to the problems we are all hitting when we try to ship multi-agent solutions in production.&lt;/p&gt;

&lt;p&gt;Also, I will talk about impostor syndrome, because mine deserves a salary.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A warning: this is not therapy. It is an engineering perspective on cognition, with a bit of ethology, and just enough self-deprecation to keep me from taking myself seriously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why I do not trust my own legitimacy
&lt;/h2&gt;

&lt;p&gt;I am not a computer scientist. That sentence alone can trigger my internal compliance department.&lt;/p&gt;

&lt;p&gt;I also failed at school. Not in the romantic "I got a B once and it changed my worldview" way. I failed repeatedly. Four times across my school career. I finished late. I learned early that the world has timelines, and I am often not on them.&lt;/p&gt;

&lt;p&gt;My school path was basically a stress test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I would try, then fail.&lt;/li&gt;
&lt;li&gt;I would decide the failure proved something essential about me.&lt;/li&gt;
&lt;li&gt;I would eventually try again, usually with a slightly different strategy and a lot more shame.&lt;/li&gt;
&lt;li&gt;I would pass, but the passing never rewrote the story. It just created a new story: "You passed, but late, so it does not count."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That pattern is important. It is not about school. It is about how the brain updates beliefs. A human can gather new evidence and still keep the old model, because the old model is emotionally sticky.&lt;/p&gt;

&lt;p&gt;Later, I did what many people do when they are young and trying to become someone else. I put substances into my brain. I am not going to glamorize that. It affected my perception and my sense of what is real. It also gave me a permanent appreciation for how fragile "reality" feels when your brain chemistry is off by a few milligrams.&lt;/p&gt;

&lt;p&gt;So now I have this fun setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I have real technical skill that I use daily.&lt;/li&gt;
&lt;li&gt;I have a biography that my nervous system interprets as "evidence you should not be here."&lt;/li&gt;
&lt;li&gt;I have a brain that can generate vivid alternative timelines where everything collapses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is impostor syndrome for me. Not a cute insecurity. More like a background daemon. It waits for a trigger, spikes CPU, and then forks twelve threads called "What if they notice."&lt;/p&gt;

&lt;h2&gt;
  
  
  A short autobiography in failure mode
&lt;/h2&gt;

&lt;p&gt;If you want the clean version of my life, it is boring: I studied, I worked, I built things, I learned, I built more things. The messy version is the real one. And the messy version is where impostor syndrome gets its fuel.&lt;/p&gt;

&lt;p&gt;The messy version looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I started as someone who could not make school fit.&lt;/li&gt;
&lt;li&gt;I became someone who learned to improvise around the system.&lt;/li&gt;
&lt;li&gt;I picked up a deep sense that competence is temporary and conditional.&lt;/li&gt;
&lt;li&gt;I got good at observing, adapting, and explaining. (This is the ethologist in me, before I even knew the word.)&lt;/li&gt;
&lt;li&gt;I eventually ended up building complex AI systems, which is a hilarious destination for someone whose inner voice still says "you are not academic enough."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a small, honest moment: I have shipped real systems, solved real problems, led real projects, and I can still be destabilized by a single sentence from someone smarter than me. Not an insult. Just a neutral comment like "why did you choose that approach." My body hears it as "the trial has started."&lt;/p&gt;

&lt;p&gt;This is why impostor syndrome is so irritating. It does not care about the objective record. It cares about perceived social risk. It is not measuring your skill. It is measuring your exposure.&lt;/p&gt;

&lt;p&gt;I also think my history shaped a specific cognitive style: I learned to survive by learning fast, reading rooms, and finding alternative routes. That can look like talent from the outside. From the inside it often feels like improvisation under threat. The Builder loves it. The Auditor weaponizes it.&lt;/p&gt;

&lt;p&gt;Here is a paradox: failing early can produce a strong builder, but it can also produce a permanent fear of exposure. You become capable, but you do not become safe.&lt;/p&gt;

&lt;p&gt;And "safe" is what the impostor agent is trying to optimize. It does not care about achievement. It cares about avoiding humiliation.&lt;/p&gt;

&lt;p&gt;That is why success can feel worse than failure. Failure confirms the story you already know. Success demands a new story. New stories are unstable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The ethologist view
&lt;/h2&gt;

&lt;p&gt;Before I wrote code professionally, I studied ethology, the science of animal behavior. Ethology taught me something that software engineers sometimes forget: behavior is not a monolith.&lt;/p&gt;

&lt;p&gt;In animals, what you observe is the outcome of competing internal systems interacting with the environment. Hunger pulls one way. Fear pulls another. Social drives pull another. Past reinforcement biases decisions. Context changes everything. The animal is not asking, "What is the true me?" The animal is selecting an action that is good enough to survive right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ethologists look at behavior as:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;modular&lt;/li&gt;
&lt;li&gt;triggered by cues (sometimes stupid cues)&lt;/li&gt;
&lt;li&gt;influenced by internal state&lt;/li&gt;
&lt;li&gt;shaped by reinforcement and social feedback&lt;/li&gt;
&lt;li&gt;constrained by energy and time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also, animals do not "solve life." They run policies. That is why a cat can be brave around a vacuum one day and run like it saw the devil the next day. Context and state changed, and the policy flipped.&lt;/p&gt;

&lt;p&gt;If you want a practical ethology cheat sheet for human cognition, here are a few concepts that translate shockingly well:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sign stimulus and releasing mechanism&lt;/strong&gt;&lt;br&gt;
Animals often respond to specific triggers that release a behavior. The trigger can be small. The response can be huge. Humans do this too. A Slack message with "can we talk" can release a full physiological cascade. The message is the sign stimulus. Your nervous system is the releasing mechanism. The behavior is your brain building a courtroom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fixed action patterns&lt;/strong&gt;&lt;br&gt;
Some behaviors run like scripts once triggered. You start doomscrolling. You do not decide to stop. The script runs until something interrupts it. This is not weakness. It is automation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Displacement behavior&lt;/strong&gt;&lt;br&gt;
When animals are conflicted (approach and avoid at the same time), they sometimes do something irrelevant: grooming, pecking the ground, moving in circles. Humans do this too. When I am afraid to ship, I reorganize files. When I am anxious about a meeting, I research irrelevant edge cases. The displacement behavior feels productive. It is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supernormal stimuli&lt;/strong&gt;&lt;br&gt;
Some stimuli hijack the system because they are exaggerated. Social media is a supernormal stimulus for social validation and threat detection. AI hype cycles are supernormal stimuli for status and belonging. Your brain was not built for it. It reacts anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tinbergen's four questions&lt;/strong&gt;&lt;br&gt;
Ethologists often ask four kinds of questions about behavior: what causes it now, how it develops, what function it serves, and how it evolved. For impostor syndrome, those questions are gold. It has immediate triggers, a developmental history, a protective function, and an evolutionary logic. That does not mean it is correct. It means it is explainable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The core lesson: the brain is not a unitary narrator. It is an orchestration layer coordinating multiple subsystems.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The AI view
&lt;/h2&gt;

&lt;p&gt;Now jump to 2025. Everyone is building multi-agent systems. It is exciting. It is also the fastest way to discover why brains evolved the way they did.&lt;/p&gt;

&lt;p&gt;The first time you build a multi-agent workflow, you get a dopamine hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one agent writes&lt;/li&gt;
&lt;li&gt;another agent critiques&lt;/li&gt;
&lt;li&gt;another agent fetches context&lt;/li&gt;
&lt;li&gt;another agent decides&lt;/li&gt;
&lt;li&gt;everything feels alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you try to ship it.&lt;/p&gt;

&lt;p&gt;Then you discover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agents duplicate work&lt;/li&gt;
&lt;li&gt;interfaces drift&lt;/li&gt;
&lt;li&gt;tool calls fail silently&lt;/li&gt;
&lt;li&gt;critics never stop critiquing&lt;/li&gt;
&lt;li&gt;planners plan forever&lt;/li&gt;
&lt;li&gt;memory grows until it becomes a landfill&lt;/li&gt;
&lt;li&gt;a single slow model turns your "parallel" system into a linear queue wearing a hat&lt;/li&gt;
&lt;li&gt;evaluation is vague because outputs are non-deterministic&lt;/li&gt;
&lt;li&gt;nobody trusts the results enough to use them in a regulated environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That list is basically my internal life.&lt;/p&gt;

&lt;p&gt;So I started treating my own thinking as a workflow. Not because I love metaphors, but because it gives me levers. If you can name a subsystem, you can route it. If you can route it, you can timebox it. If you can timebox it, you can ship.&lt;/p&gt;

&lt;p&gt;Here is the mental model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I am the orchestrator, but I am not always in charge.&lt;/li&gt;
&lt;li&gt;I have internal agents with specific roles.&lt;/li&gt;
&lt;li&gt;Impostor syndrome is not "me." It is an agent with a job and poor UX.&lt;/li&gt;
&lt;li&gt;The solution is not to delete the agent. The solution is to constrain it and make it useful.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also the lesson for multi-agent AI. You do not remove the critic. You make it bounded and accountable.&lt;/p&gt;


&lt;h2&gt;
  
  
  The moment I realized my brain was a workflow
&lt;/h2&gt;

&lt;p&gt;The moment was not mystical. It was during a project where I had to deliver something ambiguous, with stakes, under time pressure. That combination is my impostor syndrome's preferred cuisine.&lt;/p&gt;

&lt;p&gt;I had two experiences in parallel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;outwardly, I was building an orchestration runtime for agents&lt;/li&gt;
&lt;li&gt;inwardly, I was watching my own cognition behave like a badly configured swarm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Externally, the workflow looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parse input&lt;/li&gt;
&lt;li&gt;route to specialized components&lt;/li&gt;
&lt;li&gt;validate outputs&lt;/li&gt;
&lt;li&gt;store traces&lt;/li&gt;
&lt;li&gt;iterate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internally, the workflow looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;interpret the situation as threat&lt;/li&gt;
&lt;li&gt;pull memories of past failure&lt;/li&gt;
&lt;li&gt;generate catastrophic predictions&lt;/li&gt;
&lt;li&gt;attempt to prepare by doing more and more&lt;/li&gt;
&lt;li&gt;get tired&lt;/li&gt;
&lt;li&gt;interpret tiredness as proof of incompetence&lt;/li&gt;
&lt;li&gt;repeat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point I thought: "This is just a pipeline with no guardrails."&lt;/p&gt;

&lt;p&gt;And that was the shift. The question stopped being "how do I feel better" and became "how do I change the routing."&lt;/p&gt;

&lt;p&gt;That framing is the entire article.&lt;/p&gt;
&lt;h2&gt;
  
  
  My internal agents
&lt;/h2&gt;

&lt;p&gt;Below are the representative agents. These are not mystical archetypes. They are functional components. Each one is useful in the right context and destructive in the wrong one.&lt;/p&gt;

&lt;p&gt;If you recognize yourself, congratulations. You are running the standard human firmware.&lt;/p&gt;
&lt;h3&gt;
  
  
  Agent 1: The Auditor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgz4ckpr72x4pgygb4eee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgz4ckpr72x4pgygb4eee.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
The Auditor is my internal adversarial reviewer. It thinks it is protecting me. It is not entirely wrong. The delivery is just brutal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"You are not qualified."&lt;/li&gt;
&lt;li&gt;"You got lucky."&lt;/li&gt;
&lt;li&gt;"They will ask one question you cannot answer."&lt;/li&gt;
&lt;li&gt;"If you ship now, you will regret it forever."&lt;/li&gt;
&lt;li&gt;"Everyone is polite, but they are keeping score."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it is trying to do (its positive intent):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prevent public humiliation&lt;/li&gt;
&lt;li&gt;prevent reputational collapse&lt;/li&gt;
&lt;li&gt;force rigor&lt;/li&gt;
&lt;li&gt;catch weak assumptions&lt;/li&gt;
&lt;li&gt;reduce variance in outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When it is actually useful:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;design reviews&lt;/li&gt;
&lt;li&gt;security and failure mode thinking&lt;/li&gt;
&lt;li&gt;pre-mortems&lt;/li&gt;
&lt;li&gt;deciding what not to promise&lt;/li&gt;
&lt;li&gt;asking "what could go wrong" before it goes wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it never terminates&lt;/li&gt;
&lt;li&gt;it demands certainty in a world that runs on probability&lt;/li&gt;
&lt;li&gt;it blocks shipping&lt;/li&gt;
&lt;li&gt;it converts excitement into dread&lt;/li&gt;
&lt;li&gt;it mistakes preparation for control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
The Auditor is the critic agent. In AI, critics are essential. But if your critic is not timeboxed, it becomes an infinite loop. In humans, the same thing happens.&lt;br&gt;
One technical note that matters: critics optimize for avoidance. Builders optimize for progress. If you let the avoidance optimizer run the system, you get safety at the cost of reality. You also get resentment.&lt;br&gt;
Incident report: when The Auditor spikes&lt;br&gt;
This is the exact moment where someone says, "You are an expert," and my brain replies, "That seems illegal."&lt;br&gt;
In multi-agent terms: the critic starts producing unbounded tokens. The orchestrator loses control. The system becomes a panic generator.&lt;/p&gt;

&lt;p&gt;A typical spike looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I receive praise.&lt;/li&gt;
&lt;li&gt;The Auditor interprets praise as increased surveillance.&lt;/li&gt;
&lt;li&gt;It predicts a future audit.&lt;/li&gt;
&lt;li&gt;It demands immediate upskilling, on everything, now.&lt;/li&gt;
&lt;li&gt;It produces a list of hypothetical questions a stranger might ask me in six months.&lt;/li&gt;
&lt;li&gt;I attempt to answer all of them today.&lt;/li&gt;
&lt;li&gt;I become exhausted.&lt;/li&gt;
&lt;li&gt;Exhaustion becomes "evidence."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My fix is not "calm down." My fix is a protocol:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;run Auditor for 5 minutes&lt;/li&gt;
&lt;li&gt;force it to output 5 actionable risks max&lt;/li&gt;
&lt;li&gt;each risk must include one realistic mitigation&lt;/li&gt;
&lt;li&gt;route those risks to the Builder&lt;/li&gt;
&lt;li&gt;stop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This sounds simplistic. That is the point. Most complex systems are stabilized by simple rules.&lt;/p&gt;
&lt;h3&gt;
  
  
  Agent 2: The Gatekeeper
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1f3gxn3sg8nrleac7dr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1f3gxn3sg8nrleac7dr.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
This agent enforces legitimacy rules that were never officially published, but feel binding anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"You do not have the right degree."&lt;/li&gt;
&lt;li&gt;"Real engineers know theory."&lt;/li&gt;
&lt;li&gt;"Someone younger will embarrass you."&lt;/li&gt;
&lt;li&gt;"You cannot say you built that, because you did not do it the proper way."&lt;/li&gt;
&lt;li&gt;"You are borrowing credibility from smarter people."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;push toward fundamentals&lt;/li&gt;
&lt;li&gt;reduce sloppy thinking&lt;/li&gt;
&lt;li&gt;keep you humble&lt;/li&gt;
&lt;li&gt;prevent arrogance (a genuinely useful feature)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;credential worship&lt;/li&gt;
&lt;li&gt;ignores evidence of real work&lt;/li&gt;
&lt;li&gt;creates permanent "almost ready" projects&lt;/li&gt;
&lt;li&gt;makes you minimize your contribution in public&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
The Gatekeeper is a schema validator with overly strict rules. It rejects valid outputs because the formatting is not what it expects.&lt;/p&gt;

&lt;p&gt;How I use it now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I give it a narrow window. "Tell me the 2 fundamentals I should review this week." Then it stops.&lt;/li&gt;
&lt;li&gt;I do not let it veto shipping. It can suggest improvements, not block release.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Agent 3: The Late Bloomer
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpd0qdkqf36wixpegsxtw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpd0qdkqf36wixpegsxtw.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
This one is memory-heavy. It stores the narrative of being behind, slower, or "not built for this."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Everyone else learned this at 18."&lt;/li&gt;
&lt;li&gt;"You are late."&lt;/li&gt;
&lt;li&gt;"You always struggle."&lt;/li&gt;
&lt;li&gt;"This is the part where you fail again."&lt;/li&gt;
&lt;li&gt;"You are compensating, not belonging."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prevent repeating old pain&lt;/li&gt;
&lt;li&gt;encourage preparation&lt;/li&gt;
&lt;li&gt;avoid risky environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;turns growth into proof of defect&lt;/li&gt;
&lt;li&gt;makes learning feel shameful&lt;/li&gt;
&lt;li&gt;blocks new identities&lt;/li&gt;
&lt;li&gt;makes you compare timelines instead of outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
This is a retrieval system with a biased dataset. It over-indexes on negative examples because those were emotionally salient.&lt;/p&gt;

&lt;p&gt;The engineering fix is the same as in AI retrieval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;update the dataset&lt;/li&gt;
&lt;li&gt;add positive examples&lt;/li&gt;
&lt;li&gt;weight by recency, not trauma intensity&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Agent 4:  The Reality Doubter
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fede6fg6rus0cfrmq3b01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fede6fg6rus0cfrmq3b01.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
I have a deep respect for how easily brains can lie. That respect is partly philosophical, partly earned. When your perception has been altered, you never fully forget that "what feels true" is not the same as "what is true."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Are you sure you understand what is happening?"&lt;/li&gt;
&lt;li&gt;"What if your confidence is just mood?"&lt;/li&gt;
&lt;li&gt;"What if this is another story you invented?"&lt;/li&gt;
&lt;li&gt;"What if you are wrong and do not know it yet?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prevent delusion&lt;/li&gt;
&lt;li&gt;keep calibration and humility&lt;/li&gt;
&lt;li&gt;encourage grounding&lt;/li&gt;
&lt;li&gt;reduce overconfidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;paralysis by doubt&lt;/li&gt;
&lt;li&gt;loss of momentum&lt;/li&gt;
&lt;li&gt;over-checking basic decisions&lt;/li&gt;
&lt;li&gt;turning normal uncertainty into existential uncertainty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
A safety agent that is valuable, but must not run as the orchestrator.&lt;/p&gt;

&lt;p&gt;How I use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it gets one question and one answer&lt;/li&gt;
&lt;li&gt;the answer must include an observable check, not an opinion
_Example: "What evidence would change my mind?" If no evidence exists, it is probably fear wearing a lab coat.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Agent 5: The Veteran Body
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84a8had24mlto8szwcz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84a8had24mlto8szwcz1.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
This agent is not emotional. It is physical. It reminds me that energy is the actual currency of life.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"You cannot brute force everything."&lt;/li&gt;
&lt;li&gt;"Sleep is not optional."&lt;/li&gt;
&lt;li&gt;"Your future self is not a free compute cluster."&lt;/li&gt;
&lt;li&gt;"You are not 25. That is fine. Stop pretending."&lt;/li&gt;
&lt;li&gt;"Your body will invoice you later."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sustainability&lt;/li&gt;
&lt;li&gt;pacing&lt;/li&gt;
&lt;li&gt;protecting family life and long-term work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cynicism&lt;/li&gt;
&lt;li&gt;"too late" narratives&lt;/li&gt;
&lt;li&gt;avoidance of ambition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
Rate limiting and resource budgeting. In agentic systems, if you do not budget tokens and latency, you collapse. Same for humans.&lt;br&gt;
&lt;em&gt;A dry truth: when I ignore this agent, the Auditor gets louder. Fatigue is the Auditor's favorite amplifier.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Agent 6: The Builder
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vdlfq4vjcvi2jzu81ll.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vdlfq4vjcvi2jzu81ll.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
This is the agent I trust most, because it produces artifacts. It does not argue. It ships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Show me the smallest test."&lt;/li&gt;
&lt;li&gt;"Make the demo."&lt;/li&gt;
&lt;li&gt;"Commit something."&lt;/li&gt;
&lt;li&gt;"If it is real, it leaves traces."&lt;/li&gt;
&lt;li&gt;"Stop narrating and run the thing."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;convert anxiety into evidence&lt;/li&gt;
&lt;li&gt;create momentum&lt;/li&gt;
&lt;li&gt;make reality measurable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;overwork&lt;/li&gt;
&lt;li&gt;compulsive building to avoid feeling&lt;/li&gt;
&lt;li&gt;treating productivity as self-worth&lt;/li&gt;
&lt;li&gt;building systems as emotional regulation (effective, but expensive)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
The executor agent. The one that calls tools and changes the world. It needs a critic, but it needs autonomy too.&lt;br&gt;
This is why shipping is a mental health intervention for me. It is evidence. Evidence is the only language the Auditor respects.&lt;/p&gt;
&lt;h3&gt;
  
  
  Agent 7: The Proof Archivist
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo355n0bs2yxbprournz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo355n0bs2yxbprournz.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
This agent keeps the record. It is the antidote to impostor syndrome because impostor syndrome is amnesiac on purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it says:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Here is what you already shipped."&lt;/li&gt;
&lt;li&gt;"Here is the benchmark."&lt;/li&gt;
&lt;li&gt;"Here is the deployment."&lt;/li&gt;
&lt;li&gt;"Here is the code review where a strong engineer agreed."&lt;/li&gt;
&lt;li&gt;"Here is the message where you helped someone."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positive intent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restore memory&lt;/li&gt;
&lt;li&gt;prevent catastrophic reframing&lt;/li&gt;
&lt;li&gt;stabilize identity with evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nostalgia&lt;/li&gt;
&lt;li&gt;hiding in the past instead of facing current uncertainty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent analogy:&lt;/strong&gt;&lt;br&gt;
Memory plus observability. Without traces, you cannot debug. Without receipts, you cannot self-trust.&lt;br&gt;
This is the same reason production systems need replay. The present is noisy. Replay is clarity.&lt;/p&gt;


&lt;h2&gt;
  
  
  How the agents interact
&lt;/h2&gt;

&lt;p&gt;When I am regulated and functional, my system behaves like this:&lt;/p&gt;

&lt;p&gt;1) A trigger happens (visibility, risk, criticism, big new goal).&lt;br&gt;
2) The Auditor runs briefly and outputs bounded risk notes.&lt;br&gt;
3) The Gatekeeper validates fundamentals, but cannot veto.&lt;br&gt;
4) The Builder converts one risk into one concrete action.&lt;br&gt;
5) The Archivist pulls existing evidence so the system does not reset to zero.&lt;br&gt;
6) The Veteran Body sets a timebox and a stop condition.&lt;br&gt;
7) The Reality Doubter does a quick calibration check, then exits.&lt;/p&gt;

&lt;p&gt;When I am not regulated, the workflow looks like this:&lt;br&gt;
1) Trigger.&lt;br&gt;
2) Auditor loops.&lt;br&gt;
3) Everything else becomes a servant of the loop.&lt;br&gt;
4) I "prepare" for a future that does not exist.&lt;br&gt;
5) I exhaust the system.&lt;br&gt;
6) Exhaustion becomes proof.&lt;br&gt;
7) Shame becomes the only output.&lt;/p&gt;

&lt;p&gt;That is not a character flaw. It is a routing bug.&lt;/p&gt;
&lt;h2&gt;
  
  
  A day in the life of the workflow
&lt;/h2&gt;

&lt;p&gt;To make this less abstract, here is a normal day where the system either works or collapses.&lt;/p&gt;

&lt;p&gt;Morning: I open my laptop and see a message about a meeting.&lt;br&gt;
The sign stimulus hits. The Auditor wakes up and opens a spreadsheet in my chest. The Late Bloomer contributes a helpful comment like "this is where you fail again." The Builder wants to respond by building something immediately, because building is my safest language.&lt;/p&gt;

&lt;p&gt;If I let the system run uncontrolled, the day becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I over-prepare for the meeting.&lt;/li&gt;
&lt;li&gt;I ignore my actual task list.&lt;/li&gt;
&lt;li&gt;I do not ship anything.&lt;/li&gt;
&lt;li&gt;I end the day tired and ashamed, with a beautiful folder structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I run the workflow, the day becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Veteran Body sets a 20 minute preparation limit.&lt;/li&gt;
&lt;li&gt;Auditor gets 5 minutes and must produce 3 risks with mitigations.&lt;/li&gt;
&lt;li&gt;Builder chooses one mitigation and produces one artifact.&lt;/li&gt;
&lt;li&gt;Archivist pulls one piece of evidence from past work so my brain does not start from zero.&lt;/li&gt;
&lt;li&gt;Reality Doubter asks one calibration question: "What would success look like in one sentence?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then I go to the meeting.&lt;br&gt;
The outcome is not perfect. It does not have to be. It is stable.&lt;/p&gt;

&lt;p&gt;After the meeting, the Archivist runs again for 2 minutes.&lt;br&gt;
It writes: what went well, what did not, what was learned, what is next.&lt;br&gt;
Not a diary. A changelog.&lt;/p&gt;

&lt;p&gt;Evening: the Veteran Body insists on stopping.&lt;br&gt;
This is the hardest part for builders. We love infinite loops. But if you do not stop, tomorrow is garbage. A good orchestrator can end a run without killing the project.&lt;/p&gt;
&lt;h2&gt;
  
  
  A minimal YAML for the brain
&lt;/h2&gt;

&lt;p&gt;If you are a technical person, you may find it useful to think in a declarative flow. This is not code you should run. It is a way to see the structure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;orchestrator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marco_core&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;selective_activation&lt;/span&gt;
  &lt;span class="na"&gt;agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auditor&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_visibility"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_risk"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;max_items&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gatekeeper&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;identity_threat"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;3&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;max_items&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;builder&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;always"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;60&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deliverable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;artifact"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;archivist&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auditor_spike"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_ship"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deliverable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evidence"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;veteran_body&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;always"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deliverable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop_condition"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reality_doubter&lt;/span&gt;
      &lt;span class="na"&gt;runs_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;perception_drift"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deliverable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one_check"&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key line is selective_activation.You do not run all agents all the time. You route based on context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this model resonates with ethology
&lt;/h2&gt;

&lt;p&gt;Ethology is basically the study of orchestration in living systems.&lt;/p&gt;

&lt;p&gt;An animal is not one motivation. It is multiple motivations negotiating. The environment is not background. It is an input signal that changes which subsystem wins.&lt;/p&gt;

&lt;p&gt;In tech terms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context is the prompt&lt;/li&gt;
&lt;li&gt;internal state is hidden memory&lt;/li&gt;
&lt;li&gt;behavior is the output action&lt;/li&gt;
&lt;li&gt;reinforcement updates the policy over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The part that matters: you cannot judge an animal's behavior without its context. And you cannot judge your own mental behavior without context either.&lt;/p&gt;

&lt;p&gt;My impostor agent is louder when I am tired. It is quieter when I have shipped something recently. It is unbearable when the work is public and ambiguous. That is not a moral failure. That is state-dependent behavior selection.&lt;/p&gt;

&lt;p&gt;Also, ethology gives you a mercy rule: many behaviors are adaptive in one environment and maladaptive in another. Impostor syndrome is adaptive if you live in a social environment where mistakes are punished harshly. It becomes maladaptive when you are in an environment where learning requires public experimentation.&lt;/p&gt;

&lt;p&gt;In other words: the agent is not evil. The environment changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing this in a multi-agent AI workflow
&lt;/h2&gt;

&lt;p&gt;If you want to implement this idea in actual software, the mapping is almost direct.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a clear orchestrator that decides who runs when&lt;/li&gt;
&lt;li&gt;role separation (critic is not executor)&lt;/li&gt;
&lt;li&gt;timeouts and budgets (critics get limited tokens)&lt;/li&gt;
&lt;li&gt;a memory component that stores evidence and prior decisions&lt;/li&gt;
&lt;li&gt;observability (logs and traces you can replay)&lt;/li&gt;
&lt;li&gt;a stopping rule (or you will plan forever)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You also need a principle that most people ignore: not every agent should run on every request. Humans do selective activation. A deer does not run its "mating strategy" module while fleeing a predator. If your system runs every agent on every query, you built a committee that never shuts up.&lt;/p&gt;

&lt;p&gt;This is where most multi-agent demos fail in production. They are cognitively unselective.&lt;/p&gt;

&lt;p&gt;Brains are selective because they have to be. Compute is expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation notes
&lt;/h2&gt;

&lt;p&gt;This is the part where the engineering and the psychology become the same thing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Observability is emotional regulation.&lt;/strong&gt; If you cannot see what happened, you will invent stories. Humans invent blame stories. Systems invent hallucinations. Traces are the antidote for both. Log what ran, what it saw, what it decided, and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay is self-trust.&lt;/strong&gt; If a workflow cannot be replayed, you cannot debug it. If your personal decision making cannot be replayed, you cannot learn from it. This is why the Archivist matters. It is not sentimentality. It is reproducibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation must be explicit.&lt;/strong&gt; If your only evaluation is "seems good," the Auditor will never accept the result. Give the system a score, a rubric, or at least a binary gate. Humans need this too. The Builder needs a definition of done. The Auditor needs a stop condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not run every agent.&lt;/strong&gt; Selective activation is not optional. It is the difference between a useful team and a meeting that never ends. It is also the difference between a helpful inner voice and a spiral.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put the critic behind an interface.&lt;/strong&gt; A critic that can talk forever will. Force it to write issues in a structured format. Then route those issues elsewhere. In humans, the structure is a timer and a list of mitigations. In AI, the structure is a schema and a max token budget.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you build multi-agent systems and you are surprised by chaos, do not take it personally. You just discovered that coordination is the product, not the agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to timebox your critic
&lt;/h2&gt;

&lt;p&gt;If your critic agent is unconstrained, it will dominate. Critics are good at finding flaws. That is their job. The flaw is that they can always find more flaws.&lt;/p&gt;

&lt;p&gt;In engineering, you solve this with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;budgets&lt;/li&gt;
&lt;li&gt;termination criteria&lt;/li&gt;
&lt;li&gt;required output schemas&lt;/li&gt;
&lt;li&gt;evaluation gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In humans, you can do the exact same thing.&lt;/p&gt;

&lt;p&gt;Here is the prompt I use internally, in plain language:&lt;br&gt;
"Give me the top 3 risks. Each must include one mitigation. If you cannot propose a mitigation, the risk is not actionable and you may not include it."&lt;/p&gt;

&lt;p&gt;That simple constraint changes the critic from a doomsayer to an engineer.&lt;/p&gt;

&lt;p&gt;In AI, you do the same. You force your critic to output structured concerns, not poetic fear. And you do not allow it to request infinite follow-up.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical exercise
&lt;/h2&gt;

&lt;p&gt;If you want a lighter, human version, do this:&lt;/p&gt;

&lt;p&gt;Step 1: Name the voices you already have.&lt;br&gt;
Not the poetic ones. The functional ones. The part that criticizes. The part that avoids. The part that builds. The part that remembers. The part that worries about social status.&lt;/p&gt;

&lt;p&gt;Step 2: Give each one a job.&lt;br&gt;
Write one sentence: "Your job is to..." This is the fastest way to stop a part from impersonating the CEO.&lt;/p&gt;

&lt;p&gt;Step 3: Put limits on the ones that never stop.&lt;br&gt;
Give your inner critic a timer. Literally. Five minutes. Then it must output a list of actionable risks and shut up.&lt;/p&gt;

&lt;p&gt;Step 4: Add a Builder step.&lt;br&gt;
One risk becomes one action. Not ten. Not a new life plan. One.&lt;/p&gt;

&lt;p&gt;Step 5: Add an Archivist step.&lt;br&gt;
Write down receipts. You do not need a journal. You need a changelog. Your brain is bad at remembering progress under stress.&lt;/p&gt;

&lt;p&gt;Step 6: Decide the stop condition.&lt;br&gt;
Finish when you have evidence, not when you have comfort. Comfort has no upper bound.&lt;/p&gt;

&lt;p&gt;Step 7: Add a recovery routine.&lt;br&gt;
Animals recover after threat. They shake, groom, rest. Humans skip that and call it discipline. Your nervous system is not impressed. Add a short cooldown. It makes the next day possible.&lt;/p&gt;

&lt;p&gt;This is not about becoming fearless. It is about becoming debuggable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes at work
&lt;/h2&gt;

&lt;p&gt;Impostor syndrome is not just personal. It leaks into systems.&lt;/p&gt;

&lt;p&gt;When the Auditor runs unchecked inside a team, you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;overengineering as anxiety management&lt;/li&gt;
&lt;li&gt;reluctance to ship without perfection&lt;/li&gt;
&lt;li&gt;endless refactors&lt;/li&gt;
&lt;li&gt;fear of visibility&lt;/li&gt;
&lt;li&gt;blaming ambiguity instead of designing for it&lt;/li&gt;
&lt;li&gt;slow decision cycles because nobody wants to be wrong in public&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the Builder runs unchecked, you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shipping without tests&lt;/li&gt;
&lt;li&gt;burning out the team&lt;/li&gt;
&lt;li&gt;confusing motion with progress&lt;/li&gt;
&lt;li&gt;"we will fix it later" becoming the roadmap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So a sane team workflow is the same as a sane brain workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;critics with budgets&lt;/li&gt;
&lt;li&gt;builders with autonomy&lt;/li&gt;
&lt;li&gt;a clear orchestrator (tech lead, product lead, or a documented process)&lt;/li&gt;
&lt;li&gt;observability, so you can debug without blaming people&lt;/li&gt;
&lt;li&gt;explicit definitions of done, so the critic can stop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I am obsessed with tracing and replay in agentic systems. It is also why I keep personal receipts. It is the same problem at two scales.&lt;/p&gt;

&lt;p&gt;One more dry observation: teams do displacement behaviors too. A team under social threat will fight about naming conventions. It will propose rewrites. It will build frameworks. Sometimes frameworks are necessary. Sometimes they are just grooming behavior with TypeScript.&lt;/p&gt;

&lt;p&gt;The fix is the same as for an individual: reduce threat, add clarity, and route energy into measurable outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built orchestration tooling at all
&lt;/h2&gt;

&lt;p&gt;I am not building agent orchestration because it is trendy. I am building it because it solves the exact problem I have internally: specialized components are powerful, but only if the system can coordinate them without chaos.&lt;/p&gt;

&lt;p&gt;That is what orchestration is: turning a messy swarm of capabilities into something that can ship reliably.&lt;/p&gt;

&lt;p&gt;If you are building multi-agent systems and you keep hitting the same walls (replay, observability, routing, cost control), you are not failing. You are rediscovering why orchestration exists.&lt;/p&gt;

&lt;p&gt;If you want a concrete place to start, my work in this direction is OrKA-reasoning: &lt;a href="https://github.com/marcosomma/orka-reasoning" rel="noopener noreferrer"&gt;https://github.com/marcosomma/orka-reasoning&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;There is a quote I keep coming back to: the measure of intelligence is the ability to change.&lt;/p&gt;

&lt;p&gt;My impostor syndrome hates that quote, because change implies uncertainty. The Auditor wants certainty. The Builder wants movement. The Veteran Body wants sustainability. The Archivist wants receipts. The Gatekeeper wants legitimacy. The Reality Doubter wants calibration. The Late Bloomer wants to not get hurt again.&lt;/p&gt;

&lt;p&gt;None of them are evil. They are just agents with different utility functions.&lt;/p&gt;

&lt;p&gt;My job is not to silence them. My job is to orchestrate them.&lt;/p&gt;

&lt;p&gt;And if this article did nothing else, I hope it gives you permission to treat your own mind like a system that can be designed. Not perfectly. Not permanently. But iteratively, with logs, with retries, and with a little less shame.&lt;/p&gt;

&lt;p&gt;Because if your brain is going to run twelve services in parallel, you might as well add observability.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to Design Two Practical Orchestration Loops for LLM Agents</title>
      <dc:creator>marcosomma</dc:creator>
      <pubDate>Mon, 08 Dec 2025 11:00:28 +0000</pubDate>
      <link>https://dev.to/marcosomma/how-to-design-two-practical-orchestration-loops-for-llm-agents-513k</link>
      <guid>https://dev.to/marcosomma/how-to-design-two-practical-orchestration-loops-for-llm-agents-513k</guid>
      <description>&lt;p&gt;Building a useful AI assistant is no longer about a single clever prompt.&lt;br&gt;&lt;br&gt;
Once you have tools, memory, and multiple agents, you need an &lt;strong&gt;orchestrator&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In my own work (expecially with OrKa-reasoning experiments) I eventually converged on &lt;strong&gt;two simple orchestration loops&lt;/strong&gt; that cover most real use cases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;linear loop&lt;/strong&gt; for step by step analysis and context extraction.
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;circular streaming loop&lt;/strong&gt; for voice and live chat, where background agents enrich context in real time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This guide explains &lt;strong&gt;why you need both&lt;/strong&gt;, &lt;strong&gt;when to use each one&lt;/strong&gt;, and &lt;strong&gt;how to design them&lt;/strong&gt; in any stack or framework.&lt;/p&gt;

&lt;p&gt;You can think of this as a blueprint that you can map to your own code, whether you use OrKa, LangChain, your own custom orchestrator, or plain queues and workers.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. The three layers you should always separate
&lt;/h2&gt;

&lt;p&gt;Before loops, define your &lt;strong&gt;layers&lt;/strong&gt;. This makes every diagram, API and code path clearer.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Execution layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Agents and responders live here.
&lt;/li&gt;
&lt;li&gt;"Agent" means any unit that does work: a model call, a tool, a heuristic function, a router.
&lt;/li&gt;
&lt;li&gt;"Responder" is the agent that produces the final user facing output for a turn or a session.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2. Communication layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;How agents talk to each other and to the orchestrator.
&lt;/li&gt;
&lt;li&gt;Examples: queues, events, internal RPC calls, function callbacks.
&lt;/li&gt;
&lt;li&gt;You rarely want agents to call each other directly. Route everything through this layer so you can trace and control it.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3. Memory layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Where you store and retrieve state across time.
&lt;/li&gt;
&lt;li&gt;Can be a vector store, a key value store, a database, or a log.
&lt;/li&gt;
&lt;li&gt;It should not be "hidden in the prompt". Treat memory as its own component.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  4. Time as a first class dimension
&lt;/h3&gt;

&lt;p&gt;Both loops treat &lt;strong&gt;time&lt;/strong&gt; explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the &lt;strong&gt;linear loop&lt;/strong&gt; you have discrete steps: T0, T1, T2, T3.
&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;circular loop&lt;/strong&gt; you have a continuous stream while the conversation is active.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you have these pieces, you can design the two orchestration patterns.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Loop 1: Linear orchestrator for context extraction and analysis
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr27bdgdnhzfmj5onctxo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr27bdgdnhzfmj5onctxo.jpg" alt=" " width="800" height="231"&gt;&lt;/a&gt;&lt;br&gt;
The first pattern is a &lt;strong&gt;linear pipeline&lt;/strong&gt;. Think of it as a conveyor belt for understanding.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 When to use the linear loop
&lt;/h3&gt;

&lt;p&gt;Use it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a fixed input (text, transcript, document, set of logs).
&lt;/li&gt;
&lt;li&gt;You want to run &lt;strong&gt;several analytic passes&lt;/strong&gt; over it.
&lt;/li&gt;
&lt;li&gt;Latency is important but not sub second interactive.
&lt;/li&gt;
&lt;li&gt;Output is usually a summary, a report, a classification, or structured data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conversation analysis after a call has ended.
&lt;/li&gt;
&lt;li&gt;Extracting entities and topics from chat logs.
&lt;/li&gt;
&lt;li&gt;Multi stage document processing (OCR, cleaning, classification, summarization).
&lt;/li&gt;
&lt;li&gt;Offline quality checks for previous sessions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2.2 Mental model
&lt;/h3&gt;

&lt;p&gt;Picture a horizontal diagram:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Left: an &lt;strong&gt;INPUT&lt;/strong&gt; arrow.
&lt;/li&gt;
&lt;li&gt;Right: a &lt;strong&gt;Responder&lt;/strong&gt; that produces the final structured output.
&lt;/li&gt;
&lt;li&gt;In between: time steps T0 to Tn.
&lt;/li&gt;
&lt;li&gt;Each time slice has:

&lt;ul&gt;
&lt;li&gt;one or more agents in the &lt;strong&gt;Execution&lt;/strong&gt; layer
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;Communication&lt;/strong&gt; band in the middle
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;Memory&lt;/strong&gt; band at the top&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At each step, agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;may retrieve&lt;/strong&gt; from memory
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;may store&lt;/strong&gt; new facts or summaries back into memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The orchestrator walks through these steps one by one.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.3 Step by step design
&lt;/h3&gt;

&lt;p&gt;You can design a linear workflow in five steps.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 1: Define the final output
&lt;/h4&gt;

&lt;p&gt;Decide what the responder will produce. Some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON with fields like &lt;code&gt;intent&lt;/code&gt;, &lt;code&gt;sentiment&lt;/code&gt;, &lt;code&gt;entities&lt;/code&gt;, &lt;code&gt;summary&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;A human readable report that you will send to a dashboard.
&lt;/li&gt;
&lt;li&gt;Labels and scores that feed another system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Write this down early. Every other agent should exist to help this responder succeed.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 2: Split the job into stages
&lt;/h4&gt;

&lt;p&gt;Ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What must be known first so that later steps can reuse it?
&lt;/li&gt;
&lt;li&gt;What can be done independently?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, for conversation analysis:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Normalization and language detection.
&lt;/li&gt;
&lt;li&gt;Entity extraction (names, account ids, products).
&lt;/li&gt;
&lt;li&gt;Topic and intent detection.
&lt;/li&gt;
&lt;li&gt;Sentiment and escalation risk.
&lt;/li&gt;
&lt;li&gt;Final summary and suggestions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each stage becomes a &lt;strong&gt;time slice&lt;/strong&gt; with one or more agents.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 3: Design the memory schema
&lt;/h4&gt;

&lt;p&gt;For each stage, list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What the agent reads from memory.
&lt;/li&gt;
&lt;li&gt;What the agent writes back.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A very simple schema might be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entities"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also scope memory by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;session_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;time_window&lt;/code&gt; (for rolling analysis)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key rule: agents should not depend on hidden context inside prompts. The orchestrator passes them a &lt;strong&gt;clean input&lt;/strong&gt; and a &lt;strong&gt;structured slice of memory&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 4: Wire store and retrieve
&lt;/h4&gt;

&lt;p&gt;For each agent, specify two small functions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;read(memory) -&amp;gt; context&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write(memory, result) -&amp;gt; memory&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In code it can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Load what this step needs
&lt;/span&gt;    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Run the agent with input and context
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;raw_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Write new facts
&lt;/span&gt;    &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the use of &lt;strong&gt;may store&lt;/strong&gt; and &lt;strong&gt;may retrieve&lt;/strong&gt;. Some steps will only write, some will only read.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 5: Implement the responder as the last step
&lt;/h4&gt;

&lt;p&gt;The responder is just another agent with a special role:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It reads everything it needs from memory.
&lt;/li&gt;
&lt;li&gt;It produces the final answer.
&lt;/li&gt;
&lt;li&gt;It may log additional metadata back to memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In many stacks this is a single chat completion call that uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The original input.
&lt;/li&gt;
&lt;li&gt;The outputs of previous analytic agents.
&lt;/li&gt;
&lt;li&gt;Any long term user or session memory you decide to attach.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.4 Example: conversation analysis pipeline
&lt;/h3&gt;

&lt;p&gt;Imagine you want to analyze support chats after they end.&lt;/p&gt;

&lt;p&gt;You can define:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;LanguageDetectorAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads: raw transcript
&lt;/li&gt;
&lt;li&gt;Writes: &lt;code&gt;memory["language"]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;EntityExtractorAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads: transcript, language
&lt;/li&gt;
&lt;li&gt;Writes: &lt;code&gt;memory["entities"]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;TopicClassifierAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads: transcript, entities
&lt;/li&gt;
&lt;li&gt;Writes: &lt;code&gt;memory["topics"]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SentimentAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads: transcript
&lt;/li&gt;
&lt;li&gt;Writes: &lt;code&gt;memory["sentiment"]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SummaryResponder&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads: transcript, entities, topics, sentiment
&lt;/li&gt;
&lt;li&gt;Writes: final human readable summary and a JSON record.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This maps perfectly to the linear diagram and is easy to debug step by step.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Loop 2: Circular streaming orchestrator for live chat and voice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feal181zhngl23ljemhjn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feal181zhngl23ljemhjn.jpg" alt=" " width="800" height="766"&gt;&lt;/a&gt;&lt;br&gt;
The second pattern appears once you move from offline analysis to &lt;strong&gt;live interaction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With voice or interactive chat, you want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React quickly while the user is still speaking or typing.
&lt;/li&gt;
&lt;li&gt;Run several background analyses in parallel.
&lt;/li&gt;
&lt;li&gt;Avoid sending the full transcript to every agent on every turn.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;circular loop&lt;/strong&gt; pattern is built for that.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.1 When to use the circular loop
&lt;/h3&gt;

&lt;p&gt;Use it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You stream audio or tokens in and out.
&lt;/li&gt;
&lt;li&gt;You have a central "assistant" that talks to the user.
&lt;/li&gt;
&lt;li&gt;You also want &lt;strong&gt;background agents&lt;/strong&gt; that detect things like:

&lt;ul&gt;
&lt;li&gt;sentiment shifts
&lt;/li&gt;
&lt;li&gt;safety or compliance issues
&lt;/li&gt;
&lt;li&gt;intent changes
&lt;/li&gt;
&lt;li&gt;entities that should update a CRM
&lt;/li&gt;
&lt;li&gt;interesting moments to bookmark&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of a voice assistant, a real time meeting copilot, or a smart chatbot with live tools.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.2 Mental model
&lt;/h3&gt;

&lt;p&gt;Picture a circular diagram with concentric rings.&lt;/p&gt;

&lt;p&gt;From center to outside:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Responder&lt;/strong&gt; in the middle.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Main Execution&lt;/strong&gt; ring around it.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Communication&lt;/strong&gt; ring.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; ring.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents Execution&lt;/strong&gt; ring at the outside.
&lt;/li&gt;
&lt;li&gt;An outer &lt;strong&gt;Time&lt;/strong&gt; band that wraps around everything.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Input and output are green arrows that cross all rings. Time flows along the outer band as a stream of chunks or tokens.&lt;/p&gt;

&lt;p&gt;Key idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The responder loop processes the conversation in real time.
&lt;/li&gt;
&lt;li&gt;Outer agents run in parallel, watch the same stream, and &lt;strong&gt;provide context&lt;/strong&gt; through memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3.3 Step by step design
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Step 1: Define the central responder loop
&lt;/h4&gt;

&lt;p&gt;Your responder is the "voice" of the system.&lt;/p&gt;

&lt;p&gt;Define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How it receives input chunks.
&lt;/li&gt;
&lt;li&gt;How it produces output chunks.
&lt;/li&gt;
&lt;li&gt;How often it reads from memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;session_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_input_chunk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;          &lt;span class="c1"&gt;# text or audio tokens
&lt;/span&gt;    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_recent&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;   &lt;span class="c1"&gt;# signals from context agents
&lt;/span&gt;    &lt;span class="n"&gt;reply_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;responder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;write_output_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reply_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can implement &lt;code&gt;responder&lt;/code&gt; as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One LLM call with a rolling window.
&lt;/li&gt;
&lt;li&gt;A chain of small agents that produce tokens.
&lt;/li&gt;
&lt;li&gt;A hybrid of LLM plus rule based logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is that this loop &lt;strong&gt;does not own all the work&lt;/strong&gt;. It asks memory for extra signals that the outer agents have produced.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: Identify which signals can live in outer agents
&lt;/h4&gt;

&lt;p&gt;Ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What information would help the responder, but does not need to be computed inside its main prompt every time?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current sentiment and its trend over the last N seconds.
&lt;/li&gt;
&lt;li&gt;Detected entities and slots like &lt;code&gt;{customer_name}&lt;/code&gt;, &lt;code&gt;{product}&lt;/code&gt;, &lt;code&gt;{order_id}&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;Safety flags with severity scores.
&lt;/li&gt;
&lt;li&gt;Topics that have been discussed so far.
&lt;/li&gt;
&lt;li&gt;Next best actions suggested for the human operator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these can be produced by one or more &lt;strong&gt;context agents&lt;/strong&gt; on the outer ring.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3: Design the memory schema for streaming
&lt;/h4&gt;

&lt;p&gt;Memory in streaming systems often has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;rolling part&lt;/strong&gt; (last N seconds or tokens).
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;session part&lt;/strong&gt; (facts that are true for the whole session).
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;global or user part&lt;/strong&gt; (long term facts across sessions).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rolling"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recent_sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recent_topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"customer_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"current_ticket_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"has_accepted_terms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lifetime_value_segment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gold"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"preferred_language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outer agents usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the rolling slice plus some session context.
&lt;/li&gt;
&lt;li&gt;Write updated signals back, possibly aggregating multiple chunks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The responder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads what it needs from all three scopes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 4: Wire context agents around the stream
&lt;/h4&gt;

&lt;p&gt;Each context agent has a simple shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;context_agent_loop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;session_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_input_chunk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;mem_view&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;signal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem_view&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Implementation tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You do not need every agent to inspect every chunk. Some can run at a lower frequency, for example every N seconds.
&lt;/li&gt;
&lt;li&gt;Use queues or topics per agent so the orchestrator can control resource usage.
&lt;/li&gt;
&lt;li&gt;Tag signals with timestamps so the responder can select only fresh ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 5: Let the responder consume context selectively
&lt;/h4&gt;

&lt;p&gt;Inside the responder, treat signals from context agents as &lt;strong&gt;hints&lt;/strong&gt;, not as gospel.&lt;/p&gt;

&lt;p&gt;For example, the prompt can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You receive input from the user and a set of context signals created by other agents.&lt;br&gt;&lt;br&gt;
Each signal has a name and a confidence.&lt;br&gt;&lt;br&gt;
Use them as hints to guide your reply, but prefer the actual user message when signals look inconsistent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That way your outer ring can fail safely without breaking the core interaction.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Example: voice support assistant
&lt;/h3&gt;

&lt;p&gt;You can combine these ideas into a simple design.&lt;/p&gt;

&lt;p&gt;Outer agents:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ASRAgent&lt;/strong&gt; (if you handle raw audio)  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converts audio into text chunks.
&lt;/li&gt;
&lt;li&gt;Writes into &lt;code&gt;rolling.transcript&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SentimentWatcherAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads recent transcript.
&lt;/li&gt;
&lt;li&gt;Writes a rolling sentiment score and trend.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;EntityTrackerAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts order ids, product names, locations.
&lt;/li&gt;
&lt;li&gt;Writes them into &lt;code&gt;session.entities&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ComplianceAgent&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches for forbidden phrases.
&lt;/li&gt;
&lt;li&gt;Writes risk flags into &lt;code&gt;rolling.compliance&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Central responder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads the current user utterance and:

&lt;ul&gt;
&lt;li&gt;latest sentiment
&lt;/li&gt;
&lt;li&gt;recognized entities
&lt;/li&gt;
&lt;li&gt;any active compliance flags
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Generates the next reply chunk in real time.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;All of this happens while the user is talking, without sending the full raw transcript to every agent at every step.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. How to choose between linear and circular
&lt;/h2&gt;

&lt;p&gt;Here is a practical checklist.&lt;/p&gt;

&lt;p&gt;Use the &lt;strong&gt;linear orchestrator&lt;/strong&gt; if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input is fixed and finite.
&lt;/li&gt;
&lt;li&gt;You can afford to wait for all stages to finish before replying.
&lt;/li&gt;
&lt;li&gt;Main goal is analysis, extraction, or offline insight.
&lt;/li&gt;
&lt;li&gt;You want reproducible deterministic workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use the &lt;strong&gt;circular streaming orchestrator&lt;/strong&gt; if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You must keep latency low while a conversation is ongoing.
&lt;/li&gt;
&lt;li&gt;You need long running observers that enrich context.
&lt;/li&gt;
&lt;li&gt;You want to separate the "voice" of the system from its background intelligence.
&lt;/li&gt;
&lt;li&gt;You treat the session as an ongoing process rather than as isolated turns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many products actually need &lt;strong&gt;both&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Circular loop during the live session.
&lt;/li&gt;
&lt;li&gt;Linear loop right after the session to produce deeper analysis and training data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you keep the three layers and the time dimension clear in your head, switching between both becomes straightforward.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Practical tips and pitfalls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Keep memory explicit and queryable
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Avoid hiding crucial state in the prompt history.
&lt;/li&gt;
&lt;li&gt;Use structured memory objects and explicit read/write functions.
&lt;/li&gt;
&lt;li&gt;Log memory changes so you can replay and debug sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.2 Make agents idempotent and composable
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Wherever possible, design agents so that running them twice on the same input produces the same result.
&lt;/li&gt;
&lt;li&gt;This helps with retries and with mixing them in different workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.3 Watch cost and latency separately
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;In linear flows you usually pay in total cost and overall latency.
&lt;/li&gt;
&lt;li&gt;In circular flows you pay in per chunk latency and in steady state cost.
&lt;/li&gt;
&lt;li&gt;Monitor both, and be ready to move some work from inner to outer loop or vice versa.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.4 Use diagrams as living documentation
&lt;/h3&gt;

&lt;p&gt;The two diagrams that inspired this guide are simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A horizontal banded diagram for the linear loop.
&lt;/li&gt;
&lt;li&gt;A circular banded diagram for the streaming loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep them close to your code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In a &lt;code&gt;docs/&lt;/code&gt; folder.
&lt;/li&gt;
&lt;li&gt;In your orchestrator repository README.
&lt;/li&gt;
&lt;li&gt;Even inside your OrKa or other YAML definitions as comments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They help new contributors answer the question:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Where does this agent live, and which loop is it part of?"&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Light touch: how OrKa fits in
&lt;/h2&gt;

&lt;p&gt;In my own project, &lt;a href="https://github.com/marcosomma/orka-reasoning" rel="noopener noreferrer"&gt;OrKA-reasoning&lt;/a&gt;, I encode both loops as &lt;strong&gt;YAML workflows&lt;/strong&gt; and use an orchestrator runtime to execute them. The diagrams here are direct visualizations of those flows.&lt;/p&gt;

&lt;p&gt;You do not need OrKa to benefit from this guide, though.&lt;br&gt;&lt;br&gt;
The key ideas are independent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate &lt;strong&gt;execution&lt;/strong&gt;, &lt;strong&gt;communication&lt;/strong&gt;, and &lt;strong&gt;memory&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Treat &lt;strong&gt;time&lt;/strong&gt; explicitly.
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;two simple loops&lt;/strong&gt; instead of one giant graph.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you think in these terms, you can map them to any framework or stack you like.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Next steps
&lt;/h2&gt;

&lt;p&gt;To apply this guide in your own project:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one use case that feels messy today.
&lt;/li&gt;
&lt;li&gt;Decide if it is primarily analytic or live interactive.
&lt;/li&gt;
&lt;li&gt;Draw either the linear or the circular diagram for it.
&lt;/li&gt;
&lt;li&gt;List agents, memory fields, and store/retrieve rules.
&lt;/li&gt;
&lt;li&gt;Implement the orchestrator loop in your existing toolchain.
&lt;/li&gt;
&lt;li&gt;Add one or two context agents on the side, and see how much simpler the main responder becomes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You will notice that many problems which felt like "prompt engineering" issues were actually &lt;strong&gt;orchestration&lt;/strong&gt; issues all along.&lt;/p&gt;

&lt;p&gt;Once you solve those at the architecture level, prompts become smaller, agents become clearer, and the overall system is easier to reason about and to evolve.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
