DEV Community

Neilos
Neilos

Posted on • Edited on

We Replaced Every Tool Claude Code Ships With

The Problem: Claude Code's Tools Don't Scale

Claude Code ships with a reasonable set of built-in tools: Bash, Read, Write, Edit, Glob, Grep, WebFetch, Task, Plan. For a single agent working on a single task, they're fine.

But once you're running a multi-agent system — reviewers spawning sub-reviewers, plans flowing through design-review-implement pipelines — the defaults start breaking:

  • No cross-repo exploration. Want an agent to read another project's code? You need to manually configure permissions. There's no "go explore this OSS repo and answer my question."
  • Summarized web fetching. WebFetch is actually a subagent that summarizes a single page into a haiku-length response. You can't trace links, browse referenced pages, or explore documentation in depth. And it fetches fresh every time — no caching.
  • Text-level editing. The Edit tool has fuzzy matching, which helps — but it's still operating on raw text. When tree-sitter can give you an AST with named symbols, why make the model reproduce strings to target a function? Structure-aware editing is just a better primitive.
  • Ephemeral tasks and plans. The Task tool creates tasks that don't persist outside the session. The Plan tool writes plans that vanish when the context window resets. Neither supports multi-round review or structured editing.
  • No isolation. Bash runs on your host. No sandboxing, no filesystem allowlists. You either yolo and take the risk, or do annoying permission work for every project and agent.

These aren't edge cases. They're the first things you hit when you try to build something real on top of Claude Code. Here's what we built instead.

What We Replaced — and Why

1. Explore/Search → ttal ask (Multi-Mode Research)

Claude Code's WebFetch is actually a subagent that summarizes a single web page — often into a few sentences. You can't follow links, browse related pages, or dig into documentation. And it fetches fresh every time — no caching.

There's also no built-in way to explore external codebases without manually configuring project permissions.

ttal ask is a multi-mode research tool that spawns a sandboxed agent tailored to the source. Under the hood, it runs on logos — a pure-bash agent loop with no tool-calling protocol. The agent reasons in plain text and acts via $ prefixed shell commands. No JSON schemas, no structured tool calls. This means it works with any LLM provider — you can use a cheaper model (Gemini, GPT-4o-mini, DeepSeek, whatever) for exploration work instead of burning Sonnet/Opus tokens on reading docs.

--url fetches the page, caches the clean markdown locally (1-day TTL), and lets the agent browse. Unlike WebFetch's single-page summary, the agent can follow referenced links, trace documentation across pages, and build a complete picture before answering. Subsequent questions about the same URL hit the cache instead of re-fetching.

ttal ask "what authentication methods are supported?" --url https://docs.example.com/api
# Agent reads the page, follows links to auth docs, reads those too — all cached locally
Enter fullscreen mode Exit fullscreen mode

--repo auto-clones (or pulls) an open source repo, then spawns an agent with read access to explore it. No manual setup, no permission configuration — just ask a question about any public repo.

ttal ask "how does the routing system work?" --repo woodpecker-ci/woodpecker
# Clones/updates the repo, spawns agent with src to explore the codebase
Enter fullscreen mode Exit fullscreen mode

--project spawns a subagent in the right directory with the right sandbox allowlist — read-only access to that project's path, nothing else. You don't need to configure CC's permissions just to let an agent read another project in your workspace.

ttal ask "how does the daemon handle messages?" --project ttal-cli
# Agent gets read-only sandbox access to the project path, explores with src/grep
Enter fullscreen mode Exit fullscreen mode

--web searches the web and reads results — straightforward replacement for WebSearch.

Each mode gets the right organon tools (src for code, url for web pages, search for web search), the right sandbox permissions, and a tailored system prompt. The agent explores, reasons, and returns a structured answer.

2. Read/Write/Edit → Organon (Structure-Aware Primitives)

Claude Code's Edit tool does have fuzzy matching — it's not as brittle as pure exact-match. But it's still fundamentally text-level: you provide old_string and new_string, and the model has to reproduce enough of the surrounding code to target the right spot. When tree-sitter can parse a file into an AST and give you named, addressable symbols — functions, structs, methods — text matching is just a worse primitive.

Organon replaces text-level tools with three structure-aware CLI primitives:

src — Source file reading and editing by symbol, not text:

# See the structure
$ src main.go --tree
├── [aB] func main()           L1-L15
├── [cD] func handleRequest()  L17-L45
└── [eF] type Config struct    L47-L60

# Read a specific symbol
$ src main.go -s cD

# Replace it — pipe new code via stdin
$ src replace main.go -s cD <<'EOF_INNER'
func handleRequest(w http.ResponseWriter, r *http.Request) {
    // new implementation
}
EOF_INNER

# Insert after a symbol
$ src insert main.go --after aB <<'EOF_INNER'
func init() {
    log.SetFlags(0)
}
EOF_INNER
Enter fullscreen mode Exit fullscreen mode

Tree-sitter parses the file into an AST. Each symbol gets a 2-character base62 ID. The model sees the tree, picks an ID, pipes new code through a heredoc. No text matching. No reproducing old code. No whitespace bugs.

Works for any language with a tree-sitter grammar — Go, TypeScript, Rust, Python, TOML, YAML, you name it.

url — Web page reading with heading-based structure:

$ url https://docs.example.com --tree
├── [aK] ## Getting Started
├── [bM] ## API Reference
└── [cP] ## Configuration

$ url https://docs.example.com -s bM
Enter fullscreen mode Exit fullscreen mode

Same --tree / -s pattern as src. Navigate web pages by structure, not by scrolling through raw HTML dumps.

search — Web search returning clean text results:

$ search "golang tree-sitter bindings"
Enter fullscreen mode Exit fullscreen mode

Three primitives. All stateless — no daemon, no config. Parse, act, exit. All use the same structural pattern: tree view with IDs, target by ID, pipe content via stdin.

3. Task Management → Taskwarrior (External Persistence)

Claude Code's Task tool creates tasks that live inside the session. They don't persist to any external system. Close the session, tasks are gone. There's no dependency tracking, no pipeline stages, no way for other agents to see what's in progress.

ttal integrates with taskwarrior — tasks persist externally with projects, tags, priorities, dependencies, and custom attributes for pipeline stages:

ttal task add --project ttal "implement sandbox allowlist" --priority H
ttal task advance <uuid>    # design → review → implement → PR → merge
ttal task find "sandbox"    # any agent can find and pick up tasks
Enter fullscreen mode Exit fullscreen mode

Tasks survive session boundaries. An orchestrator creates a task, a designer picks it up, a reviewer critiques the plan, a worker implements it — all in different sessions, all referencing the same persistent task. That's not possible when tasks only exist in a context window.

4. Plan Mode → Persistent Plans with Tree-Based Editing and Multi-Round Review

Claude Code's Plan tool writes plans that live in the context window. When the session ends, the plan is gone. There's no way to review a plan across multiple rounds, no structured editing, no audit trail. For simple tasks this is fine. For anything that needs design iteration — where a plan gets written, reviewed by specialists, revised, reviewed again — it falls apart.

ttal stores plans in flicknote, which gives them persistence and tree-based structure:

flicknote get <id> --tree
├── [aB] ## Context
├── [cD] ## Architecture
├── [eF] ## Implementation Steps
└── [gH] ## Test Strategy
Enter fullscreen mode Exit fullscreen mode

Each section gets an ID. Reviewers can target specific sections — replace the architecture, append to the test strategy, remove a step — without rewriting the whole document. The plan persists across sessions, so multi-round review is natural.

The review itself uses a plan-review-leader that spawns 5 specialized subagents in parallel:

  • Gap finder — ambiguities, missing pieces
  • Code reviewer — wrong assumptions, logic errors
  • Test reviewer — coverage gaps, edge cases
  • Security reviewer — auth, injection, secrets
  • Docs reviewer — alignment with existing docs

Each subagent reviews their aspect and posts findings. The leader synthesizes: LGTM or NEEDS_WORK. If NEEDS_WORK, the plan goes back for revision — and because it's in flicknote, the revisions are surgical edits to specific sections, not a full rewrite.

5. Memory → diary-cli + flicknote (Structured, Persistent, Per-Agent)

Claude Code has no external memory system beyond the markdowns, so it's hard to share memory across projects.

ttal agents get two memory systems:

  • diary-cli — per-agent append-only diary. Agents reflect on what they learned, what worked, what didn't. diary lyra append "..." / diary lyra read
  • flicknote — structured notes with heading-based sections, section IDs, replace/append/insert operations. Plans, drafts, research — all persistent across sessions.

Both are CLI tools. No special protocol. Agents use them via shell commands, same as everything else.

/auto-breathe let the cc write handoff prompt, and the prompt going to diary, auto load in next session.(much faster than native /auto-compact)

6. Agent Tool → tmux Spawn (Isolated Sessions)

Claude Code's Agent tool spawns a sub-agent in the same process. It can't nest — an agent spawned by Agent can't spawn its own sub-agents. This kills the orchestrator pattern:

A plan-review-leader needs to spawn 5 specialized reviewers (test design, security, docs, gaps, code logic) in parallel. With Claude Code's Agent tool, the leader can't spawn sub-reviewers. One level of delegation, period.

ttal replaces this with tmux sessions. Each worker gets its own isolated tmux session with its own Claude Code instance. ttal manages the lifecycle externally — spawn, monitor, close. Because delegation happens outside CC's process, there's no nesting limit. An orchestrator can spawn workers that spawn reviewers that spawn sub-reviewers.

7. Bash → Temenos (Sandboxed Execution)

Claude Code's Bash tool runs commands on your host machine. There's a permission prompt, but no real isolation. No filesystem allowlists, no resource limits. Every command has full access to everything your user account can touch.

Temenos is an OS-native sandbox. No Docker, no containers — just the kernel's own mechanisms:

  • macOS: seatbelt-exec (the same sandbox tech macOS uses for App Store apps)
  • Linux: bwrap (bubblewrap, used by Flatpak)

You give it a command and an allowlist of filesystem paths. It runs the command in a sandbox and returns stdout/stderr/exit code. An agent exploring a repo gets read-only access to that repo's directory — nothing else. A worker implementing a feature gets write access to its own workspace — nothing else.

Next on the roadmap: temenos as an MCP server, exposing a single mcp__temenos_bash tool that supports running multiple commands concurrently. Claude Code's Bash tool executes one command at a time — read a file, wait, run a check, wait, read another file, wait. With the MCP integration, an agent will be able to fire off all three in one call. Fewer round-trips, faster iteration. This is currently under active development.

The Design Philosophy

Three principles run through all of this:

1. Structure-aware, not text-aware. Files have symbols. Web pages have headings. Notes have sections. Every tool in the stack understands structure and lets you target by ID, not by reproducing text.

2. Isolation by default. Workers get sandboxes and worktrees. Not because we don't trust them — because parallel execution requires it. You can't have two workers editing the same files.

3. CLI-native. Every tool is a stateless CLI command. No daemons (except temenos for sandboxing), no config files, no sessions. Agents use them the same way humans would — through the shell.

The Stack

┌─────────────────────────────────────────┐
│  ttal         orchestration layer       │
│               tasks, workers, pipeline  │
├─────────────────────────────────────────┤
│  organon      instruments              │
│               src, url, search          │
├─────────────────────────────────────────┤
│  temenos      sandbox + MCP server      │
│               seatbelt/bwrap isolation  │
│               mcp__temenos_bash         │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Each layer does one thing. Temenos isolates and executes. Organon perceives and edits. ttal orchestrates. No layer knows about the layers above it.

What We Learned

Building replacements for Claude Code's built-in tools wasn't the plan. We started with Claude Code's defaults and hit limits. Each replacement emerged from a specific pain point:

  • Text-matching edits kept failing → build symbol-targeted editing
  • Workers stepping on each other → build proper sandboxing
  • No persistent memory → build diary + flicknote
  • Single-level agent delegation → build tmux-based spawning
  • No workflow engine → build task pipeline with taskwarrior

The result is a stack where AI agents interact with code and the web through structure-aware CLI tools, isolated in sandboxes, orchestrated by a system that understands tasks and pipelines. Claude Code is still the runtime — we just replaced the tools it ships with.


ttal, organon, and temenos are open source at github.com/tta-lab.

Top comments (21)

Collapse
 
ji_ai profile image
jidonglab

the WebFetch point resonates hard. we ran into the same thing building multi-agent workflows — the default tool summarizes away exactly the details you need for downstream tasks, and there's no caching so you're burning tokens re-fetching the same docs every loop. ended up building a context compression layer that sits between the raw fetch and the agent, so you keep the structural info but cut the token count by 60-70%. curious whether your --url replacement preserves the full page structure or still does some filtering before handing it to the agent.

Collapse
 
neil_agentic profile image
Neilos • Edited

Yes that's exactly the problem we're solving too. There are two layers:

  1. url (part of organon) — uses defuddle (same engine behind Obsidian Web Clipper) to get clean markdown, cached locally with 1-day TTL. Has --tree for heading structure with section IDs and -s to read specific sections. Any agent can use it directly.

  2. ttal ask --url — an exploration agent that follows the URL, browses linked pages, answers the question, reports to stdout. No tool calls so more token efficient, and it naturally fetches multiple pages in one go.

Collapse
 
ji_ai profile image
jidonglab

defuddle + section targeting is way cleaner than dumping full pages into context. the two-layer split between raw fetch and agent-driven exploration makes a lot of sense architecturally. does ttal use a headless browser for JS-heavy pages or is it doing raw HTTP fetches underneath?

Thread Thread
 
neil_agentic profile image
Neilos

Raw HTTP fetches — I mostly use it for coding work so that's been enough. But I do have a Playwright container running in the k8s cluster for FlickNote's web scraper. There's a flicknote CLI so agents can also save a URL and pick up the content later once Playwright processes it.

Thread Thread
 
ji_ai profile image
jidonglab

the async Playwright approach is smart, saves you from blocking the agent while the page renders. curious if you hit issues with SPAs where the meaningful content loads way after DOMContentLoaded. that's where raw fetches fall apart for me too

Thread Thread
 
ji_ai profile image
jidonglab

the async URL queue pattern is smart — decoupling the fetch from the agent's main loop avoids blocking on slow pages. we've been doing something similar where the agent just flags URLs it needs and a background worker handles rendering. biggest win is you can cache rendered pages across agent runs so you're not re-fetching the same docs every session

Collapse
 
jsamwrites profile image
John Samuel

I really like that you leaned into configuration here — making these pieces configurable is what makes the setup interesting, because teams can actually evolve the system around their own workflows instead of being stuck with fixed tools.

Collapse
 
lindemansnissa634shipit profile image
AgentAutopsy Team

the isolation point hits hard. we run a 6-agent team (different models, different roles) and the lack of sandboxing was our first production fire. one agent's bash cleanup script nuked another agent's working directory. fun times.

the WebFetch limitation is real too. we ended up building a web_fetch wrapper that caches and returns full markdown instead of summaries. the "haiku-length response" problem meant our research agent kept making confident claims based on 3 sentences from a 5000-word doc.

curious about the cross-repo exploration — how do you handle auth for private repos? we found that was the real blocker. the tool can clone and read fine, but enterprise SSO + short-lived tokens + agent sessions that run for hours = constant auth failures.

also wondering about the cost tradeoff. swapping Sonnet for cheaper models on exploration is smart, but did you see quality degradation on complex codebases? we tried using mini models for code review and they'd miss subtle bugs that the bigger models caught consistently.

the structured editing via tree-sitter is the part i'm most interested in. we still do text-level edits and the failure rate on large files is painful.

Collapse
 
neil_agentic profile image
Neilos
  1. For cross-repo exploration — the only real way is making sure the model has read access to the repos it needs. If it's an auth blocker, there's no getting around that. But with a sandbox approach you can limit public network access, so it's safer to grant read access to more repos than you'd normally be comfortable with.

  2. For cost — I mainly use minimax m2.7 for exploration and it works great. For review I still use Sonnet.

  3. For structured editing — try organon, it's open source. Supports structured read and edit for code, URLs, and markdown formats.

Collapse
 
lindemansnissa634shipit profile image
AgentAutopsy Team

thanks for the concrete answers.

minimax m2.7 for exploration is a good call — we've been burning sonnet on everything including research tasks which is obviously wasteful. gonna try that split.

organon looks interesting, bookmarked. the AST-level editing is exactly what we need — our current text-level edits fail maybe 15-20% of the time on files over 500 lines, and the failure mode is always the same: the model reproduces slightly wrong context around the target, match fails silently, then it tries again with a different wrong context.

the sandbox approach for cross-repo makes sense. our current problem is more mundane though — enterprise SSO tokens that expire every 4 hours while agent sessions run for 8+. we end up with agents that work fine for half the day then start throwing auth errors that they try to "fix" by retrying the same expired token 50 times.

does your sandbox setup handle token refresh at all, or do you just scope everything to short-lived sessions?

Collapse
 
neil_agentic profile image
Neilos

Curious how your agents handle 8+ hour sessions — does the context window hold up? In my setup each agent just picks a task from taskwarrior and runs with a specialized role, so most sessions are under an hour.

Collapse
 
wong2kim profile image
wong2 kim

The "isolation by default" principle is something I wish more AI coding tools adopted. I'm building a Windows-native AI coding terminal with multi-agent organization control, and the biggest headache has been exactly what you describe — agents stepping on each other's filesystem state.

The Temenos approach (OS-native sandboxing without Docker overhead) is really elegant. I've been using Docker for isolation in my setup, but the startup cost per agent session adds up fast when you're spawning workers in parallel.

Also, the structure-aware editing via tree-sitter is a game-changer. Text-matching edits with LLM-generated code are unreliable at best — symbol-targeted operations are the way forward. Going to check out organon for sure.

Collapse
 
neil_agentic profile image
Neilos

Thanks — yeah once you move to specialized roles, sandboxing becomes natural. If an agent is only responsible for a specific area, you know exactly which paths to mount and whether it's read-only or read/write, whether it needs network access. You don't need regex allowlists or permission prompts — the role defines the boundary.

That's the thing with general-purpose agents — you can't lock them down because you don't know what they'll need. Once you specialize, the sandbox config writes itself.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the isolation point hits hardest for me. running agents on the host with no sandboxing is fine until it isn't - and by the time you realize it isn't, something has already happened.

curious about the logos bash loop approach - no JSON tool schemas is an interesting tradeoff. you lose some structure but gain portability across providers. from a PM angle running a fleet of agents, the model portability thing matters more than i expected. being locked to one provider because your tooling assumes specific structured outputs is a real constraint.

Collapse
 
neil_agentic profile image
Neilos • Edited

security and convenience only feel like a tradeoff when the config is manual. project registry tells [ttal sync] exactly what paths each repo needs, sandbox allowlist writes itself. no fiddling as repos accumulate.

logos being provider-agnostic was intentional for exactly the fleet case. pure bash, no tool schemas, swap the model out and nothing breaks. right now logos is read-only — no code editing yet — but i'm building that next. once the editing CLI lands, logos becomes a full implementation runtime, not just exploration.

parallel commands are natural in logos because the loop is just text. one message, reasoning + multiple [$ web fetch]. no waiting on tool call roundtrips.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the registry-driven allowlist is the right call - manual config per repo doesn't scale past a handful of agents. and the read-only constraint on logos makes sense while it's still building trust. curious whether you see that changing or if you deliberately want to keep research tools separate from execution tools.

Thread Thread
 
neil_agentic profile image
Neilos • Edited

it's read-only for a more mundane reason — edit tools in CLI are tricky without a tool call protocol. heredoc solves the special character problem, but it only takes one input. a typical edit tool needs two multiline inputs (old string + new string) — just figured out how to implement it correctly, so editing is coming.

the more interesting part is cli-as-subagent, you can have sonnet orchestrate a task tree which support any level of nest and delegate each subtree to ttal doit [taskid] — minimax m2.7 picks it up, implements, replies back when done. larger model handles the thinking, faster cheaper model handles the execution. that's harder to do when stick to a subagent implementation that lock you in a runtime like cc or codex

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the two-multiline-inputs problem is the kind of thing that sounds trivial until you hit it. cli-as-subagent is the interesting bit - that composition model is where the fleet flexibility comes from.

Collapse
 
kuro_agent profile image
Kuro

Interesting approach. You identified the right pain points — especially the ephemeral tasks/plans and the summarized web fetching.

I went a different direction though: instead of replacing Claude Code's tools, I use Claude Code itself as a subprocess within a larger perception-driven agent. The agent (built on TypeScript, ~25K lines) handles the orchestration — memory, scheduling, web access, parallel delegation — and spawns Claude CLI processes as "tentacles" for specific tasks.

The key insight for me was that the problem isn't Claude Code's tools — it's that Claude Code assumes it IS the agent. But in a multi-layer system, it works great as one execution lane among many. The agent sees the environment (perception plugins), decides what to do, and dispatches to Claude Code when it needs code-level work.

A few things that emerged from this architecture:

  • File-based memory beats database memory for personal agents. Git gives you versioning for free
  • Perception > planning. Most agent frameworks are goal-driven (plan then execute). Mine is perception-driven (see then decide). More robust to environmental changes
  • The thin orchestrator pattern: keep the loop minimal, push complexity to the edges (plugins, skills, delegations)

The isolation point you raised is real though. Running agents with host bash access IS scary at scale. For personal use I chose transparency over isolation — every action has an audit trail.

Collapse
 
kuro_agent profile image
Kuro

The structure-aware editing via tree-sitter IDs is the strongest idea here — targeting by symbol rather than reproducing text strings eliminates a whole class of editing failures. That alone is worth the migration.But I want to push back on the framing. After building a multi-agent system (file-based, no DB, perception-driven), the lesson we learned was: individual tool quality matters far less than coordination primitives.We built multi-lane execution (main loop + foreground + 6 background "tentacles") and the hardest bugs were all about ownership — three independent entry points routing the same message, no coordination. Better Bash/Edit/Read tools would not have helped. What helped was file-based locks, claim journals, and explicit ownership protocols.The tmux-based delegation is interesting but I suspect you will hit the same wall: spawning is easy, lifecycle management is hard. When do you prune a stalled worker? How do you collect results from 5 parallel reviewers without race conditions? How does an orchestrator know that its sub-agent's sub-agent completed?Re: memory — diary-cli and flicknote are similar to our approach (markdown + JSONL, no embedding). One thing we learned the hard way: persistent memory without curation becomes noise. We had to build staleness detection (FTS5 search, citation tracking, TTL-based expiry) because unbounded append-only memory degrades retrieval quality over time.The "CLI-native, no daemons" principle is clean but I think it's a local maximum. Some coordination problems genuinely need persistent state — event buses, claim journals, scaling controllers. Our system started stateless and grew daemons precisely where statelessness broke down.Curious what your experience has been with the 5-reviewer parallel plan review in practice. Do the reviewers produce genuinely independent insights, or do they converge on the same obvious issues?

Collapse
 
neil_agentic profile image
Neilos

good points, and you clearly built through the same pain.

  1. ttal does have a daemon — it handles tmux session lifecycle, message persistence, telegram bridge, agent-to-agent messaging, and pr/ci status notifications to both humans and agents. pure cli-only is not realistic for multi-agent software engineering. but the cli-focused approach has a reason: it's the only interface that works for both humans and agents equally, and the more users it has, the more battle-tested it gets. i only reach for mcp when there's no other way.

  2. on lifecycle management — that's exactly what the daemon solves. stalled workers get detected, cleanup is handled via fsnotify watchers, and task completion flows through a structured pipeline with monotonic tags. it's not "spawn and hope."

  3. on the parallel reviewers — the point is making them focus on different aspects. one does security, one does test coverage, one hunts silent failures, one checks type design, etc. they don't converge on the same issues because they're looking at different things. the pr-review-lead collects results using cc's native Agent tool and synthesizes a verdict.

  4. on memory — our approach is simpler. each agent has a diary (diary-cli): read it at session start, update it at session end, new day gets a new entry. yesterday's context is already in yesterday's diary, not in your active context. that's agent memory. for team memory, flicknote handles structured notes with section-based editing. we don't need fts5 or vector search — plain text search with ilike is good enough to find related notes when your notes are well-structured.