<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vilius</title>
    <description>The latest articles on DEV Community by Vilius (@vystartasv).</description>
    <link>https://dev.to/vystartasv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F133303%2F50baa34e-e011-4576-8b1a-5974d272fc34.jpg</url>
      <title>DEV Community: Vilius</title>
      <link>https://dev.to/vystartasv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vystartasv"/>
    <language>en</language>
    <item>
      <title>My AI Agents Kept Burning Tokens on Subagents That Can't Code — So I Built a Decision Gate</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Mon, 04 May 2026 17:12:31 +0000</pubDate>
      <link>https://dev.to/vystartasv/my-ai-agents-kept-burning-tokens-on-subagents-that-cant-code-so-i-built-a-decision-gate-2135</link>
      <guid>https://dev.to/vystartasv/my-ai-agents-kept-burning-tokens-on-subagents-that-cant-code-so-i-built-a-decision-gate-2135</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I run 19 autonomous AI agents in production. They handle research, content, monitoring, deployment — the kind of always-on work that makes a solo developer's output look like a small team's.&lt;/p&gt;

&lt;p&gt;The delegation feature was supposed to be the multiplier. Spawn a subagent, give it a task, get results in parallel. In theory, it turns one agent into many. In practice, it was burning thousands of tokens for exactly zero output.&lt;/p&gt;

&lt;p&gt;The problem wasn't the agents. It was that nobody had taught them &lt;em&gt;when not to delegate&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem That Forced My Hand
&lt;/h2&gt;

&lt;p&gt;Here's what happens when you ask a subagent to code something:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The subagent spawns, reads the context, starts working — looks promising&lt;/li&gt;
&lt;li&gt;It tries to write a file. The file operation fails silently. The subagent doesn't notice&lt;/li&gt;
&lt;li&gt;It tries again with a different approach. Same silent failure&lt;/li&gt;
&lt;li&gt;Six hundred seconds later: timeout. Zero output. Thousands of tokens gone&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The core issue is structural: subagents can't reliably write files, can't run builds, can't verify their own output. They're built for &lt;strong&gt;read-only work&lt;/strong&gt; — research, analysis, data gathering. But nothing in the agent's training tells it that. It just sees "task → delegate" and fires.&lt;/p&gt;

&lt;p&gt;I watched this happen dozens of times. Every failure was another chunk of the context window gone, another session wasted, another moment of wondering whether multi-agent workflows were fundamentally broken.&lt;/p&gt;

&lt;p&gt;They weren't. The delegation call just needed a bouncer at the door.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: Agentic Delegation
&lt;/h2&gt;

&lt;p&gt;Agentic Delegation is a decision protocol that sits between your agent and its delegation tool. It has three layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Decision Tree
&lt;/h3&gt;

&lt;p&gt;Before any &lt;code&gt;delegate_task&lt;/code&gt; call, the protocol classifies the work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CODING → BLOCKED. Routed to write_file/patch/terminal (10x faster, 100% reliable)
RESEARCH → ALLOWED. But verified after completion, max 2 retries
UNKNOWN → DECOMPOSED. Broken into atomic subtasks first, then routed individually
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a hard rule, not a suggestion. The skill document literally says "NEVER VIOLATE" at the top of the coding section. If your agent ignores it and delegates coding anyway, there's a self-correction protocol that kicks in after the inevitable timeout.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Task Decomposer
&lt;/h3&gt;

&lt;p&gt;Complex tasks get broken into atomic subtasks by a lightweight classifier — either your local LLM (free) or Gemini Flash (cheap cloud fallback). No dependencies beyond Python's stdlib.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python3.11 scripts/decompose.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Research GRPO training papers, write a summary, and add it to README"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Research GRPO training papers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"delegate"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Write a summary of the findings"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"direct"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Update the project README"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"direct"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three subtasks. One delegated (the research). Two handled directly (the writing). No subagent ever touches a file.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Validation Gate
&lt;/h3&gt;

&lt;p&gt;Models hallucinate. Sometimes the decomposer labels a coding task as "delegate." The validation gate catches this with a hard keyword check and reassigns it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'[{"id":"1","description":"implement JWT auth","tool":"delegate"}]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | python3.11 scripts/decompose.py &lt;span class="nt"&gt;--validate-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"implement JWT auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"direct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[FIXED: was delegate]"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The annotation is deliberate. It leaves a paper trail so you can see what the model wanted to do vs what the gate enforced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The protocol is surprisingly thin — under 400 lines total. The decision tree is a markdown file. The decomposer is a single Python script. The validation gate is a 20-line function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User gives agent a complex task
         │
         ▼
┌─────────────────────┐
│  Decision Tree      │  ← SKILL.md rules
│  Coding? → BLOCKED  │
│  Research? → ALLOW  │
│  Unknown? → SPLIT   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Decomposer    │  ← decompose.py
│  Local LLM (free)   │
│  or Gemini Flash    │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Validation Gate    │  ← Hard rule check
│  No coding→delegate │
│  Fixed if violated  │
└────────┬────────────┘
         │
         ▼
    Route each subtask:
    direct → write_file / patch
    delegate → delegate_task (bounded)
    terminal → terminal()
    clarify → ask user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It runs as a Hermes skill that auto-loads when delegation triggers fire, or as a standalone Python tool. Either way, it adds about 200ms of overhead per delegation decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The delegation feature is a UI demo, not a production primitive.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It works in a 2-minute screen recording. In production, with real tasks and real context windows, it falls apart. The gap between demo and production is where all the work lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The right answer is usually "don't delegate."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After decomposing dozens of complex tasks, a pattern emerged: roughly 85% of subtasks should be handled directly by the main agent. Delegation is only the right call for bounded, read-only research tasks. Everything else is faster and more reliable via direct tool calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A validation gate is worth more than a better prompt.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent time trying to engineer the perfect decomposition prompt — more examples, stricter formatting, longer system instructions. What actually worked was adding a 20-line validation function that just checks if a coding task got mislabeled and fixes it. Defensive engineering beats prompt engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/agentic-delegation" rel="noopener noreferrer"&gt;github.com/vystartasv/agentic-delegation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11+, oMLX AgenticQwen-8B (local, free), Hermes Agent skills system
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install as Hermes skill&lt;/span&gt;
git clone https://github.com/vystartasv/agentic-delegation.git &lt;span class="se"&gt;\&lt;/span&gt;
  ~/.hermes/skills/software-development/agentic-delegation

&lt;span class="c"&gt;# Or use standalone&lt;/span&gt;
git clone https://github.com/vystartasv/agentic-delegation.git
python3.11 agentic-delegation/scripts/decompose.py &lt;span class="s2"&gt;"your task here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The protocol is a direct implementation of the Agentic Flow methodology — ten patterns for working with AI agents, developed over months of running a 19-agent fleet. The delegation pattern is the one that saves the most tokens.&lt;/p&gt;

&lt;p&gt;Feedback welcome — especially from anyone else running multi-agent setups who's hit the delegation wall.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>My 19 AI Agents Kept Breaking Each Other — The 4 Tools That Fixed It</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Mon, 04 May 2026 15:07:50 +0000</pubDate>
      <link>https://dev.to/vystartasv/my-19-ai-agents-kept-breaking-each-other-the-4-tools-that-fixed-it-3559</link>
      <guid>https://dev.to/vystartasv/my-19-ai-agents-kept-breaking-each-other-the-4-tools-that-fixed-it-3559</guid>
      <description>&lt;p&gt;I run 19 AI agents on my machine. They wake up throughout the day to review code, publish content, check server health, research medical literature, and self-improve. Some run hourly. Some fire at 2am.&lt;/p&gt;

&lt;p&gt;For months they were reliable. Then I noticed the cracks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Moment I Realised It Was Broken
&lt;/h2&gt;

&lt;p&gt;Three things happened in the same week:&lt;/p&gt;

&lt;p&gt;One agent updated a skill file and another overwrote it 30 seconds later with stale data. The skill file was now wrong — silently corrupted — and both agents continued as if nothing happened.&lt;/p&gt;

&lt;p&gt;A cron job tried to publish a blog post to dev.to. It needed an API key from 1Password. The agent sat there waiting for a fingerprint that would never come. The job failed. Then it tried again next tick. And the next. 17 consecutive failures before I noticed.&lt;/p&gt;

&lt;p&gt;Another agent was trying to read a project repository. Its local model has a 40K token context window. Someone had dumped &lt;code&gt;node_modules&lt;/code&gt;, &lt;code&gt;.git&lt;/code&gt;, and every log file into the prompt. The model couldn't see the actual code. It guessed. The output was nonsense.&lt;/p&gt;

&lt;p&gt;None of these were model problems. None were prompt problems. Every single one was an &lt;em&gt;infrastructure problem&lt;/em&gt; — the layer between the agent and its environment was missing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: Four Infrastructure Tools
&lt;/h2&gt;

&lt;p&gt;I spent a weekend building four single-purpose tools that handle the four categories of failures I kept seeing. Each tool is a Python package. Each does exactly one thing. Each has tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Agent State DB — So They Stop Overwriting Each Other
&lt;/h3&gt;

&lt;p&gt;The problem: 19 agents, one filesystem. No coordination. When two agents modify the same file, last-write-wins, and the loser's changes evaporate silently.&lt;/p&gt;

&lt;p&gt;The fix: a SQLite database with WAL-mode concurrency that gives every agent a persistent identity, a run journal, versioned key-value state, advisory locks, and a coordination channel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;agent-state stats
&lt;span class="go"&gt;  Registered agents:  20
  Active runs:         2
  Completed runs:     47
  Failed runs:         8
  Active locks:        1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents now write to the DB before touching shared files. If they see a lock on &lt;code&gt;catalog.json&lt;/code&gt;, they wait. If they want to announce what they're working on, they call &lt;code&gt;agent-state coord working-on&lt;/code&gt;. Other agents can check before starting conflicting work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11, SQLite WAL, Click CLI. 8 tests. MIT.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Credential Proxy — So They Can Get Passwords Without Fingers
&lt;/h3&gt;

&lt;p&gt;The problem: password managers need a fingerprint, a master password, or a hardware key tap. Cron jobs have none of those. Any agent that needs an API key is dead on arrival.&lt;/p&gt;

&lt;p&gt;The fix: a local daemon that decrypts your credentials once at boot and serves them over a Unix socket. Agents call &lt;code&gt;get_credential("github.com")&lt;/code&gt;. No Touch ID. No popups.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;credential-proxy status
&lt;span class="go"&gt;  Daemon:    running (pid 85985)
  Socket:    ~/.hermes/credential_proxy/proxy.sock
  Credentials: 353 loaded
  Chrome import: auto-deleted after import
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything is Fernet-encrypted at rest. The socket is &lt;code&gt;chmod 600&lt;/code&gt;. The database and master key are &lt;code&gt;chmod 600&lt;/code&gt;. Nothing touches the network. It's a locked box in your house, not a cloud service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11, Fernet (AES-128-CBC + HMAC-SHA256), Unix domain sockets, launchd. 24 tests. MIT.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Packer — So Local Models Can See What Matters
&lt;/h3&gt;

&lt;p&gt;The problem: local models have small context windows (40K tokens max for Q4 quants). Dumping a whole repo — &lt;code&gt;node_modules&lt;/code&gt;, build artifacts, 42MB of logs — wastes 90% of the window on noise.&lt;/p&gt;

&lt;p&gt;The fix: a deterministic pre-cron script that takes a repo path and outputs a compact markdown blob of only the high-signal files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3.11 context_packer.py ~/Agent-Projects/agent-foundry
&lt;span class="go"&gt;  2,521 files scanned
  8 high-signal files packed
  12,847 characters (safe within budget)
  Priority: README.md, pyproject.toml, src/main.py, tests/
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It reads &lt;code&gt;AGENTS.md&lt;/code&gt;, &lt;code&gt;ARCHITECTURE.md&lt;/code&gt;, &lt;code&gt;README.md&lt;/code&gt;, prioritizes recently modified files, excludes &lt;code&gt;.git&lt;/code&gt;, &lt;code&gt;node_modules&lt;/code&gt;, &lt;code&gt;__pycache__&lt;/code&gt;, and &lt;code&gt;venv&lt;/code&gt;, and outputs a token-budgeted markdown document. Drop it as a pre-cron script and your local model suddenly sees the code it's supposed to work on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11, stat-based file scoring. MIT.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cron Guard — So Failures Don't Cascade
&lt;/h3&gt;

&lt;p&gt;The problem: a broken cron job fails every tick. If it runs hourly, that's 24 failures before you wake up and notice. Multiply by 19 jobs and one bad configuration means hundreds of silent failures.&lt;/p&gt;

&lt;p&gt;The fix: a pre-cron script that checks the last 3 runs of every job via the Agent State DB. Three consecutive failures → auto-pause + alert.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3.11 cron_guard.py
&lt;span class="go"&gt;  Checked: 20 jobs
  Healthy: 19
  Blocked: 1 (k6a-weekly — 3 consecutive failures)
  Pause instructions written to /tmp/cron_guard_blocked.json
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent that was failing 17 times in a row now stops itself after 3. I get an alert. I fix the root cause. It resumes. No more failure cascades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11, Agent State DB integration. MIT.&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Work Together
&lt;/h2&gt;

&lt;p&gt;The four tools are independent but designed to chain:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cron Guard&lt;/strong&gt; runs first — checks if the job should even proceed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent State DB&lt;/strong&gt; registers the run — the agent gets an identity and a run ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Packer&lt;/strong&gt; builds the prompt context — the model sees what matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential Proxy&lt;/strong&gt; serves API keys on demand — the agent authenticates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All four are pre-cron scripts. They run before the model prompt is even sent. They're deterministic Python, not LLM calls. That's intentional — infrastructure should be boring and reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Agent failures are rarely model failures.&lt;/strong&gt; Every failure I debugged traced back to the environment: missing credentials, corrupted files, context overflow, no coordination. The models were fine. The scaffolding was missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Shared state is the difference between a collection of scripts and a fleet.&lt;/strong&gt; Before the Agent State DB, my 19 agents were 19 independent processes that happened to run on the same machine. After, they're a system. They know about each other. They coordinate. They journal their own history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Infrastructure should be boring.&lt;/strong&gt; None of these tools use AI. They're deterministic Python scripts. They run in milliseconds. They have tests. The more AI you put in your AI infrastructure, the more ways it can fail. Let the models be models. Let the plumbing be plumbing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent State DB:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/agent-state-db" rel="noopener noreferrer"&gt;github.com/vystartasv/agent-state-db&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credential Proxy:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/credential-proxy" rel="noopener noreferrer"&gt;github.com/vystartasv/credential-proxy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Packer:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/agent-state-db" rel="noopener noreferrer"&gt;github.com/vystartasv/agent-state-db&lt;/a&gt; (bundled in &lt;code&gt;scripts/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron Guard:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/agent-state-db" rel="noopener noreferrer"&gt;github.com/vystartasv/agent-state-db&lt;/a&gt; (bundled in &lt;code&gt;scripts/&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All MIT licensed. Python 3.11. Install with &lt;code&gt;pip install -e .&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're running multiple agents and hitting the same walls, I'd love to hear what you're building. Feedback welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>Managing 150+ AI Agent Skills at Scale — What Broke, What I Built</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Mon, 04 May 2026 12:16:27 +0000</pubDate>
      <link>https://dev.to/vystartasv/managing-150-ai-agent-skills-at-scale-what-broke-what-i-built-1e73</link>
      <guid>https://dev.to/vystartasv/managing-150-ai-agent-skills-at-scale-what-broke-what-i-built-1e73</guid>
      <description>&lt;p&gt;&lt;em&gt;By Vilius Vystartas | May 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I run a lot of AI agents. Not chatbots — autonomous agents. Cron jobs that monitor my infrastructure every hour. Self-improvers that analyze past sessions and encode learnings. Delegated coders that build features while I sleep. Together they load from a library of 153 reusable skills — structured procedures that tell an agent how to do something specific, from sending iMessages to debugging SPFx builds.&lt;/p&gt;

&lt;p&gt;The system worked fine when I had 20 skills and one agent. It started breaking when the numbers climbed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem That Forced My Hand
&lt;/h2&gt;

&lt;p&gt;Here's the setup: each skill lives as a &lt;code&gt;SKILL.md&lt;/code&gt; file in &lt;code&gt;~/.hermes/skills/&lt;/code&gt;. When an agent loads a skill and discovers it's broken, missing steps, or out of date, it records the problem in a shared &lt;code&gt;skill_gaps.jsonl&lt;/code&gt; file. Later, I review the gaps and fix the skills.&lt;/p&gt;

&lt;p&gt;This is fine when one agent writes to the file at a time.&lt;/p&gt;

&lt;p&gt;It stops being fine when three autonomous agents — say, a 2am cron job, a self-improvement loop, and a code review agent — all try to write to the same JSONL file within the same second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrent writes collide. Lines get truncated. Data vanishes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I lost track of which skills needed fixing. Agents kept loading broken skills silently because the gap reporting was unreliable. Worse, I had no search — finding "that one skill about PyPI releases" meant grepping a directory tree and hoping the frontmatter was consistent.&lt;/p&gt;

&lt;p&gt;The flat-file approach doesn't scale past a few dozen skills. I had 153.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: Skill Forge
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Skill Forge&lt;/strong&gt; is a SQLite-backed skill registry with quality gates, full-text search, and concurrent-safe writes. It replaces the broken JSONL pipeline with atomic transactions. It doesn't move your skills — it indexes them in place.&lt;/p&gt;

&lt;p&gt;Think of it as &lt;code&gt;pip&lt;/code&gt; for agent skills, but local-first, with validation before installation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;forge status
&lt;span class="go"&gt;
Skill Forge Registry Status
===========================
  Database: ~/.hermes/skill-forge/forge.db
  Total skills: 153

  By category:
    mlops: 12     devops: 8     creative: 15
    career: 3     research: 7   (uncategorized): 108

  Quality checks run: 306
  Skills with failures: 0 ✓
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why SQLite?
&lt;/h3&gt;

&lt;p&gt;Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WAL mode&lt;/strong&gt; — multiple agents can read and write simultaneously without locking each other out. Each agent gets its own connection with foreign-key enforcement. When two agents register different skills at the same time, both succeed. Atomic transactions, no corrupted state.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;FTS5&lt;/strong&gt; — full-text search over name, category, description, and body content. Finding "that skill about PyPI release classifiers" is &lt;code&gt;forge search "pypi classifier"&lt;/code&gt; — instant, ranked results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single file&lt;/strong&gt; — &lt;code&gt;forge.db&lt;/code&gt; in &lt;code&gt;~/.hermes/skill-forge/&lt;/code&gt;. No server process. No configuration. Backs up with &lt;code&gt;forge export&lt;/code&gt;. Portable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quality Gates That Catch Real Problems
&lt;/h3&gt;

&lt;p&gt;Before Skill Forge, broken skills went undetected until an agent loaded them mid-task and hit a wall. Now every skill runs through two validation passes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontmatter validator&lt;/strong&gt; — catches missing YAML, absent required fields (name/description/version), and invalid semver strings. A skill with &lt;code&gt;version: "latest"&lt;/code&gt; gets flagged. One with &lt;code&gt;version: "1.2.3"&lt;/code&gt; passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structure validator&lt;/strong&gt; — checks for required sections: a description block, trigger conditions, and usage steps. A skill that's just a title and a broken shell command fails. One with proper &lt;code&gt;## Trigger&lt;/code&gt;, &lt;code&gt;## Steps&lt;/code&gt;, and &lt;code&gt;## Pitfalls&lt;/code&gt; sections passes.&lt;/p&gt;

&lt;p&gt;The first run on my 153 skills: 102 passed, 51 flagged. The flagged ones weren't bugs — they were real quality issues I'd been ignoring. Skills missing version numbers. Skills with no trigger conditions. Skills where the "Steps" section was one garbled paragraph.&lt;/p&gt;

&lt;p&gt;I fixed 38 of them that afternoon. The other 13 are low-priority and tagged for later.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLI Commands That Match the Workflow
&lt;/h3&gt;

&lt;p&gt;Ten commands, each solving a specific pain point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;forge import-hermes              &lt;span class="c"&gt;# First run: scan ~/.hermes/skills/, register everything&lt;/span&gt;
forge register &amp;lt;path&amp;gt;            &lt;span class="c"&gt;# Add a single skill&lt;/span&gt;
forge validate &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--name&lt;/span&gt; &amp;lt;n&amp;gt;]      &lt;span class="c"&gt;# Run quality gates on all or one skill&lt;/span&gt;
forge search &amp;lt;query&amp;gt;             &lt;span class="c"&gt;# FTS5 over name + description + body&lt;/span&gt;
forge list &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--category&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;cat&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;]&lt;/span&gt;    &lt;span class="c"&gt;# Filtered listing&lt;/span&gt;
forge status                     &lt;span class="c"&gt;# Health overview&lt;/span&gt;
forge inspect &amp;lt;name&amp;gt;             &lt;span class="c"&gt;# Full detail + quality check history&lt;/span&gt;
forge prune                      &lt;span class="c"&gt;# Remove stale entries (skill file deleted from disk)&lt;/span&gt;
forge &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;-o&lt;/span&gt; &amp;lt;file&amp;gt;]         &lt;span class="c"&gt;# JSON dump for backups or analysis&lt;/span&gt;
forge watch &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--once&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--interval&lt;/span&gt; &amp;lt;s&amp;gt;]  &lt;span class="c"&gt;# Auto-reimport on changes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;watch&lt;/code&gt; command is the cron workhorse. Drop this in a 30-minute cron job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;forge watch &lt;span class="nt"&gt;--once&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It scans the skills directory, detects new/modified files (content hash, not timestamp), registers new ones, re-registers changed ones (version bump), and marks deleted skills as stale. One pass, everything synced.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;The stack is deliberately minimal — Python 3.11, Click for the CLI, SQLite for storage, PyYAML for frontmatter parsing. No web framework, no message queue, no cloud dependency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLI (forge)                        ← Click entry point
  ├── registry (SQLite + WAL)      ← skill index + metadata
  ├── importer                     ← scan ~/.hermes/skills/ → register
  ├── validator                    ← frontmatter + structure checks
  └── FTS5 index                   ← full-text search

Storage:  ~/.hermes/skill-forge/forge.db  (single file)
Skills:   ~/.hermes/skills/                (unchanged — indexed in place)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Skills stay as flat &lt;code&gt;SKILL.md&lt;/code&gt; files. Forge indexes them, validates them, searches them, and tracks their history — but it never moves or modifies them. Your existing automation continues working. Forge adds a layer on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tests and Quality
&lt;/h3&gt;

&lt;p&gt;89 tests. Full suite runs in 0.26 seconds. Covers registry CRUD, importer (Hermes scanner + content-change detection), validators (frontmatter + structure, edge cases like empty files and missing YAML delimiters), CLI integration (prune, export, watch), and concurrent-write scenarios.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SQLite with WAL mode solves the concurrent-agent problem cleanly.&lt;/strong&gt; You don't need Postgres or Redis for this. Connection-level pragmas (&lt;code&gt;PRAGMA journal_mode=WAL&lt;/code&gt;, &lt;code&gt;PRAGMA foreign_keys=ON&lt;/code&gt;) and atomic transactions are enough when your write volume is hundreds per hour, not thousands per second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality gates catch real problems, not theoretical ones.&lt;/strong&gt; 51 of my 153 skills had issues I didn't know about — missing versions, malformed frontmatter, empty sections. Agents were loading these skills silently. The validator turned invisible problems into visible ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content-aware sync matters.&lt;/strong&gt; My first import skipped files that already existed in the registry by path. This meant I missed skills that had been modified but not renamed. Switching to content-hash comparison caught 12 modified skills on the next import.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/vystartasv/skill-forge" rel="noopener noreferrer"&gt;github.com/vystartasv/skill-forge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stack:&lt;/strong&gt; Python 3.11+, Click, SQLite + FTS5, PyYAML
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/vystartasv/skill-forge
&lt;span class="nb"&gt;cd &lt;/span&gt;skill-forge
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
forge import-hermes
forge status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're running autonomous AI agents with persistent skill libraries — or if you're building agent infrastructure and wondering how to manage the growing pile of procedures — I'd love feedback on the schema design and quality gate approach.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>sqlite</category>
    </item>
    <item>
      <title>Installing AWS Elastic Beanstalk cli on OpenSuse</title>
      <dc:creator>Vilius</dc:creator>
      <pubDate>Tue, 04 Jun 2019 20:04:15 +0000</pubDate>
      <link>https://dev.to/vystartasv/installing-aws-elastic-beanstalk-cli-on-opensuse-358e</link>
      <guid>https://dev.to/vystartasv/installing-aws-elastic-beanstalk-cli-on-opensuse-358e</guid>
      <description>&lt;p&gt;How to successfully install EB cli on OpenSuse you do need to install a few dev build libraries &lt;strong&gt;before&lt;/strong&gt; for make to succeed the build.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# sudo zypper in gcc zlib-devel libffi-devel libopenssl-devel
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This should save lots of trouble for Suse users.&lt;/p&gt;

</description>
      <category>opensuse</category>
      <category>eb</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
