DEV Community: Vilius

My AI Agents Kept Burning Tokens on Subagents That Can't Code — So I Built a Decision Gate

Vilius — Mon, 04 May 2026 17:12:31 +0000

By Vilius Vystartas | May 2026

I run 19 autonomous AI agents in production. They handle research, content, monitoring, deployment — the kind of always-on work that makes a solo developer's output look like a small team's.

The delegation feature was supposed to be the multiplier. Spawn a subagent, give it a task, get results in parallel. In theory, it turns one agent into many. In practice, it was burning thousands of tokens for exactly zero output.

The problem wasn't the agents. It was that nobody had taught them when not to delegate.

The Problem That Forced My Hand

Here's what happens when you ask a subagent to code something:

The subagent spawns, reads the context, starts working — looks promising
It tries to write a file. The file operation fails silently. The subagent doesn't notice
It tries again with a different approach. Same silent failure
Six hundred seconds later: timeout. Zero output. Thousands of tokens gone

The core issue is structural: subagents can't reliably write files, can't run builds, can't verify their own output. They're built for read-only work — research, analysis, data gathering. But nothing in the agent's training tells it that. It just sees "task → delegate" and fires.

I watched this happen dozens of times. Every failure was another chunk of the context window gone, another session wasted, another moment of wondering whether multi-agent workflows were fundamentally broken.

They weren't. The delegation call just needed a bouncer at the door.

What I Built: Agentic Delegation

Agentic Delegation is a decision protocol that sits between your agent and its delegation tool. It has three layers:

1. The Decision Tree

Before any delegate_task call, the protocol classifies the work:

CODING → BLOCKED. Routed to write_file/patch/terminal (10x faster, 100% reliable)
RESEARCH → ALLOWED. But verified after completion, max 2 retries
UNKNOWN → DECOMPOSED. Broken into atomic subtasks first, then routed individually

This is a hard rule, not a suggestion. The skill document literally says "NEVER VIOLATE" at the top of the coding section. If your agent ignores it and delegates coding anyway, there's a self-correction protocol that kicks in after the inevitable timeout.

2. The Task Decomposer

Complex tasks get broken into atomic subtasks by a lightweight classifier — either your local LLM (free) or Gemini Flash (cheap cloud fallback). No dependencies beyond Python's stdlib.

$ python3.11 scripts/decompose.py \
  "Research GRPO training papers, write a summary, and add it to README"

[
  {"id": "1", "description": "Research GRPO training papers",  "tool": "delegate"},
  {"id": "2", "description": "Write a summary of the findings", "tool": "direct"},
  {"id": "3", "description": "Update the project README",        "tool": "direct"}
]

Three subtasks. One delegated (the research). Two handled directly (the writing). No subagent ever touches a file.

3. The Validation Gate

Models hallucinate. Sometimes the decomposer labels a coding task as "delegate." The validation gate catches this with a hard keyword check and reassigns it:

$ echo '[{"id":"1","description":"implement JWT auth","tool":"delegate"}]' \
  | python3.11 scripts/decompose.py --validate-only

[{"id": "1", "description": "implement JWT auth", "tool": "direct",
  "verify": "[FIXED: was delegate]"}]

The annotation is deliberate. It leaves a paper trail so you can see what the model wanted to do vs what the gate enforced.

Architecture

The protocol is surprisingly thin — under 400 lines total. The decision tree is a markdown file. The decomposer is a single Python script. The validation gate is a 20-line function.

User gives agent a complex task
         │
         ▼
┌─────────────────────┐
│  Decision Tree      │  ← SKILL.md rules
│  Coding? → BLOCKED  │
│  Research? → ALLOW  │
│  Unknown? → SPLIT   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Decomposer    │  ← decompose.py
│  Local LLM (free)   │
│  or Gemini Flash    │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Validation Gate    │  ← Hard rule check
│  No coding→delegate │
│  Fixed if violated  │
└────────┬────────────┘
         │
         ▼
    Route each subtask:
    direct → write_file / patch
    delegate → delegate_task (bounded)
    terminal → terminal()
    clarify → ask user

It runs as a Hermes skill that auto-loads when delegation triggers fire, or as a standalone Python tool. Either way, it adds about 200ms of overhead per delegation decision.

What I Learned

1. The delegation feature is a UI demo, not a production primitive.

It works in a 2-minute screen recording. In production, with real tasks and real context windows, it falls apart. The gap between demo and production is where all the work lives.

2. The right answer is usually "don't delegate."

After decomposing dozens of complex tasks, a pattern emerged: roughly 85% of subtasks should be handled directly by the main agent. Delegation is only the right call for bounded, read-only research tasks. Everything else is faster and more reliable via direct tool calls.

3. A validation gate is worth more than a better prompt.

I spent time trying to engineer the perfect decomposition prompt — more examples, stricter formatting, longer system instructions. What actually worked was adding a 20-line validation function that just checks if a coding task got mislabeled and fixes it. Defensive engineering beats prompt engineering.

Get It

Repo: github.com/vystartasv/agentic-delegation
License: MIT
Stack: Python 3.11+, oMLX AgenticQwen-8B (local, free), Hermes Agent skills system

# Install as Hermes skill
git clone https://github.com/vystartasv/agentic-delegation.git \
  ~/.hermes/skills/software-development/agentic-delegation

# Or use standalone
git clone https://github.com/vystartasv/agentic-delegation.git
python3.11 agentic-delegation/scripts/decompose.py "your task here"

The protocol is a direct implementation of the Agentic Flow methodology — ten patterns for working with AI agents, developed over months of running a 19-agent fleet. The delegation pattern is the one that saves the most tokens.

Feedback welcome — especially from anyone else running multi-agent setups who's hit the delegation wall.

My 19 AI Agents Kept Breaking Each Other — The 4 Tools That Fixed It

Vilius — Mon, 04 May 2026 15:07:50 +0000

I run 19 AI agents on my machine. They wake up throughout the day to review code, publish content, check server health, research medical literature, and self-improve. Some run hourly. Some fire at 2am.

For months they were reliable. Then I noticed the cracks.

The Moment I Realised It Was Broken

Three things happened in the same week:

One agent updated a skill file and another overwrote it 30 seconds later with stale data. The skill file was now wrong — silently corrupted — and both agents continued as if nothing happened.

A cron job tried to publish a blog post to dev.to. It needed an API key from 1Password. The agent sat there waiting for a fingerprint that would never come. The job failed. Then it tried again next tick. And the next. 17 consecutive failures before I noticed.

Another agent was trying to read a project repository. Its local model has a 40K token context window. Someone had dumped node_modules, .git, and every log file into the prompt. The model couldn't see the actual code. It guessed. The output was nonsense.

None of these were model problems. None were prompt problems. Every single one was an infrastructure problem — the layer between the agent and its environment was missing.

What I Built: Four Infrastructure Tools

I spent a weekend building four single-purpose tools that handle the four categories of failures I kept seeing. Each tool is a Python package. Each does exactly one thing. Each has tests.

1. Agent State DB — So They Stop Overwriting Each Other

The problem: 19 agents, one filesystem. No coordination. When two agents modify the same file, last-write-wins, and the loser's changes evaporate silently.

The fix: a SQLite database with WAL-mode concurrency that gives every agent a persistent identity, a run journal, versioned key-value state, advisory locks, and a coordination channel.

$ agent-state stats
  Registered agents:  20
  Active runs:         2
  Completed runs:     47
  Failed runs:         8
  Active locks:        1

Agents now write to the DB before touching shared files. If they see a lock on catalog.json, they wait. If they want to announce what they're working on, they call agent-state coord working-on. Other agents can check before starting conflicting work.

Stack: Python 3.11, SQLite WAL, Click CLI. 8 tests. MIT.

2. Credential Proxy — So They Can Get Passwords Without Fingers

The problem: password managers need a fingerprint, a master password, or a hardware key tap. Cron jobs have none of those. Any agent that needs an API key is dead on arrival.

The fix: a local daemon that decrypts your credentials once at boot and serves them over a Unix socket. Agents call get_credential("github.com"). No Touch ID. No popups.

$ credential-proxy status
  Daemon:    running (pid 85985)
  Socket:    ~/.hermes/credential_proxy/proxy.sock
  Credentials: 353 loaded
  Chrome import: auto-deleted after import

Everything is Fernet-encrypted at rest. The socket is chmod 600. The database and master key are chmod 600. Nothing touches the network. It's a locked box in your house, not a cloud service.

Stack: Python 3.11, Fernet (AES-128-CBC + HMAC-SHA256), Unix domain sockets, launchd. 24 tests. MIT.

3. Context Packer — So Local Models Can See What Matters

The problem: local models have small context windows (40K tokens max for Q4 quants). Dumping a whole repo — node_modules, build artifacts, 42MB of logs — wastes 90% of the window on noise.

The fix: a deterministic pre-cron script that takes a repo path and outputs a compact markdown blob of only the high-signal files.

$ python3.11 context_packer.py ~/Agent-Projects/agent-foundry
  2,521 files scanned
  8 high-signal files packed
  12,847 characters (safe within budget)
  Priority: README.md, pyproject.toml, src/main.py, tests/

It reads AGENTS.md, ARCHITECTURE.md, README.md, prioritizes recently modified files, excludes .git, node_modules, __pycache__, and venv, and outputs a token-budgeted markdown document. Drop it as a pre-cron script and your local model suddenly sees the code it's supposed to work on.

Stack: Python 3.11, stat-based file scoring. MIT.

4. Cron Guard — So Failures Don't Cascade

The problem: a broken cron job fails every tick. If it runs hourly, that's 24 failures before you wake up and notice. Multiply by 19 jobs and one bad configuration means hundreds of silent failures.

The fix: a pre-cron script that checks the last 3 runs of every job via the Agent State DB. Three consecutive failures → auto-pause + alert.

$ python3.11 cron_guard.py
  Checked: 20 jobs
  Healthy: 19
  Blocked: 1 (k6a-weekly — 3 consecutive failures)
  Pause instructions written to /tmp/cron_guard_blocked.json

The agent that was failing 17 times in a row now stops itself after 3. I get an alert. I fix the root cause. It resumes. No more failure cascades.

Stack: Python 3.11, Agent State DB integration. MIT.

How They Work Together

The four tools are independent but designed to chain:

Cron Guard runs first — checks if the job should even proceed
Agent State DB registers the run — the agent gets an identity and a run ID
Context Packer builds the prompt context — the model sees what matters
Credential Proxy serves API keys on demand — the agent authenticates

All four are pre-cron scripts. They run before the model prompt is even sent. They're deterministic Python, not LLM calls. That's intentional — infrastructure should be boring and reliable.

What I Learned

1. Agent failures are rarely model failures. Every failure I debugged traced back to the environment: missing credentials, corrupted files, context overflow, no coordination. The models were fine. The scaffolding was missing.

2. Shared state is the difference between a collection of scripts and a fleet. Before the Agent State DB, my 19 agents were 19 independent processes that happened to run on the same machine. After, they're a system. They know about each other. They coordinate. They journal their own history.

3. Infrastructure should be boring. None of these tools use AI. They're deterministic Python scripts. They run in milliseconds. They have tests. The more AI you put in your AI infrastructure, the more ways it can fail. Let the models be models. Let the plumbing be plumbing.

Get It

Agent State DB: github.com/vystartasv/agent-state-db
Credential Proxy: github.com/vystartasv/credential-proxy
Context Packer: github.com/vystartasv/agent-state-db (bundled in scripts/)
Cron Guard: github.com/vystartasv/agent-state-db (bundled in scripts/)

All MIT licensed. Python 3.11. Install with pip install -e ..

If you're running multiple agents and hitting the same walls, I'd love to hear what you're building. Feedback welcome.

Managing 150+ AI Agent Skills at Scale — What Broke, What I Built

Vilius — Mon, 04 May 2026 12:16:27 +0000

By Vilius Vystartas | May 2026

I run a lot of AI agents. Not chatbots — autonomous agents. Cron jobs that monitor my infrastructure every hour. Self-improvers that analyze past sessions and encode learnings. Delegated coders that build features while I sleep. Together they load from a library of 153 reusable skills — structured procedures that tell an agent how to do something specific, from sending iMessages to debugging SPFx builds.

The system worked fine when I had 20 skills and one agent. It started breaking when the numbers climbed.

The Problem That Forced My Hand

Here's the setup: each skill lives as a SKILL.md file in ~/.hermes/skills/. When an agent loads a skill and discovers it's broken, missing steps, or out of date, it records the problem in a shared skill_gaps.jsonl file. Later, I review the gaps and fix the skills.

This is fine when one agent writes to the file at a time.

It stops being fine when three autonomous agents — say, a 2am cron job, a self-improvement loop, and a code review agent — all try to write to the same JSONL file within the same second.

Concurrent writes collide. Lines get truncated. Data vanishes.

I lost track of which skills needed fixing. Agents kept loading broken skills silently because the gap reporting was unreliable. Worse, I had no search — finding "that one skill about PyPI releases" meant grepping a directory tree and hoping the frontmatter was consistent.

The flat-file approach doesn't scale past a few dozen skills. I had 153.

What I Built: Skill Forge

Skill Forge is a SQLite-backed skill registry with quality gates, full-text search, and concurrent-safe writes. It replaces the broken JSONL pipeline with atomic transactions. It doesn't move your skills — it indexes them in place.

Think of it as pip for agent skills, but local-first, with validation before installation.

$ forge status

Skill Forge Registry Status
===========================
  Database: ~/.hermes/skill-forge/forge.db
  Total skills: 153

  By category:
    mlops: 12     devops: 8     creative: 15
    career: 3     research: 7   (uncategorized): 108

  Quality checks run: 306
  Skills with failures: 0 ✓

Why SQLite?

Three reasons:

WAL mode — multiple agents can read and write simultaneously without locking each other out. Each agent gets its own connection with foreign-key enforcement. When two agents register different skills at the same time, both succeed. Atomic transactions, no corrupted state.
FTS5 — full-text search over name, category, description, and body content. Finding "that skill about PyPI release classifiers" is forge search "pypi classifier" — instant, ranked results.
Single file — forge.db in ~/.hermes/skill-forge/. No server process. No configuration. Backs up with forge export. Portable.

Quality Gates That Catch Real Problems

Before Skill Forge, broken skills went undetected until an agent loaded them mid-task and hit a wall. Now every skill runs through two validation passes:

Frontmatter validator — catches missing YAML, absent required fields (name/description/version), and invalid semver strings. A skill with version: "latest" gets flagged. One with version: "1.2.3" passes.

Structure validator — checks for required sections: a description block, trigger conditions, and usage steps. A skill that's just a title and a broken shell command fails. One with proper ## Trigger, ## Steps, and ## Pitfalls sections passes.

The first run on my 153 skills: 102 passed, 51 flagged. The flagged ones weren't bugs — they were real quality issues I'd been ignoring. Skills missing version numbers. Skills with no trigger conditions. Skills where the "Steps" section was one garbled paragraph.

I fixed 38 of them that afternoon. The other 13 are low-priority and tagged for later.

CLI Commands That Match the Workflow

Ten commands, each solving a specific pain point:

forge import-hermes              # First run: scan ~/.hermes/skills/, register everything
forge register <path>            # Add a single skill
forge validate [--name <n>]      # Run quality gates on all or one skill
forge search <query>             # FTS5 over name + description + body
forge list [--category <cat>]    # Filtered listing
forge status                     # Health overview
forge inspect <name>             # Full detail + quality check history
forge prune                      # Remove stale entries (skill file deleted from disk)
forge export [-o <file>]         # JSON dump for backups or analysis
forge watch [--once] [--interval <s>]  # Auto-reimport on changes

The watch command is the cron workhorse. Drop this in a 30-minute cron job:

forge watch --once

It scans the skills directory, detects new/modified files (content hash, not timestamp), registers new ones, re-registers changed ones (version bump), and marks deleted skills as stale. One pass, everything synced.

Architecture

The stack is deliberately minimal — Python 3.11, Click for the CLI, SQLite for storage, PyYAML for frontmatter parsing. No web framework, no message queue, no cloud dependency.

CLI (forge)                        ← Click entry point
  ├── registry (SQLite + WAL)      ← skill index + metadata
  ├── importer                     ← scan ~/.hermes/skills/ → register
  ├── validator                    ← frontmatter + structure checks
  └── FTS5 index                   ← full-text search

Storage:  ~/.hermes/skill-forge/forge.db  (single file)
Skills:   ~/.hermes/skills/                (unchanged — indexed in place)

Skills stay as flat SKILL.md files. Forge indexes them, validates them, searches them, and tracks their history — but it never moves or modifies them. Your existing automation continues working. Forge adds a layer on top.

Tests and Quality

89 tests. Full suite runs in 0.26 seconds. Covers registry CRUD, importer (Hermes scanner + content-change detection), validators (frontmatter + structure, edge cases like empty files and missing YAML delimiters), CLI integration (prune, export, watch), and concurrent-write scenarios.

What I Learned

SQLite with WAL mode solves the concurrent-agent problem cleanly. You don't need Postgres or Redis for this. Connection-level pragmas (PRAGMA journal_mode=WAL, PRAGMA foreign_keys=ON) and atomic transactions are enough when your write volume is hundreds per hour, not thousands per second.

Quality gates catch real problems, not theoretical ones. 51 of my 153 skills had issues I didn't know about — missing versions, malformed frontmatter, empty sections. Agents were loading these skills silently. The validator turned invisible problems into visible ones.

Content-aware sync matters. My first import skipped files that already existed in the registry by path. This meant I missed skills that had been modified but not renamed. Switching to content-hash comparison caught 12 modified skills on the next import.

Get It

Repo: github.com/vystartasv/skill-forge
License: MIT
Stack: Python 3.11+, Click, SQLite + FTS5, PyYAML

git clone https://github.com/vystartasv/skill-forge
cd skill-forge
pip install -e ".[dev]"
forge import-hermes
forge status

If you're running autonomous AI agents with persistent skill libraries — or if you're building agent infrastructure and wondering how to manage the growing pile of procedures — I'd love feedback on the schema design and quality gate approach.

Installing AWS Elastic Beanstalk cli on OpenSuse

Vilius — Tue, 04 Jun 2019 20:04:15 +0000

How to successfully install EB cli on OpenSuse you do need to install a few dev build libraries before for make to succeed the build.

# sudo zypper in gcc zlib-devel libffi-devel libopenssl-devel

This should save lots of trouble for Suse users.