DEV Community

We Replaced Every Tool Claude Code Ships With

Neilos on March 21, 2026

The Problem: Claude Code's Tools Don't Scale Claude Code ships with a reasonable set of built-in tools: Bash, Read, Write, Edit, Glob, G...
Collapse
 
ji_ai profile image
jidonglab

the WebFetch point resonates hard. we ran into the same thing building multi-agent workflows — the default tool summarizes away exactly the details you need for downstream tasks, and there's no caching so you're burning tokens re-fetching the same docs every loop. ended up building a context compression layer that sits between the raw fetch and the agent, so you keep the structural info but cut the token count by 60-70%. curious whether your --url replacement preserves the full page structure or still does some filtering before handing it to the agent.

Collapse
 
neil_agentic profile image
Neilos • Edited

Yes that's exactly the problem we're solving too. There are two layers:

  1. url (part of organon) — uses defuddle (same engine behind Obsidian Web Clipper) to get clean markdown, cached locally with 1-day TTL. Has --tree for heading structure with section IDs and -s to read specific sections. Any agent can use it directly.

  2. ttal ask --url — an exploration agent that follows the URL, browses linked pages, answers the question, reports to stdout. No tool calls so more token efficient, and it naturally fetches multiple pages in one go.

Collapse
 
ji_ai profile image
jidonglab

defuddle + section targeting is way cleaner than dumping full pages into context. the two-layer split between raw fetch and agent-driven exploration makes a lot of sense architecturally. does ttal use a headless browser for JS-heavy pages or is it doing raw HTTP fetches underneath?

Thread Thread
 
neil_agentic profile image
Neilos

Raw HTTP fetches — I mostly use it for coding work so that's been enough. But I do have a Playwright container running in the k8s cluster for FlickNote's web scraper. There's a flicknote CLI so agents can also save a URL and pick up the content later once Playwright processes it.

Thread Thread
 
ji_ai profile image
jidonglab

the async Playwright approach is smart, saves you from blocking the agent while the page renders. curious if you hit issues with SPAs where the meaningful content loads way after DOMContentLoaded. that's where raw fetches fall apart for me too

Thread Thread
 
ji_ai profile image
jidonglab

the async URL queue pattern is smart — decoupling the fetch from the agent's main loop avoids blocking on slow pages. we've been doing something similar where the agent just flags URLs it needs and a background worker handles rendering. biggest win is you can cache rendered pages across agent runs so you're not re-fetching the same docs every session

Collapse
 
jsamwrites profile image
John Samuel

I really like that you leaned into configuration here — making these pieces configurable is what makes the setup interesting, because teams can actually evolve the system around their own workflows instead of being stuck with fixed tools.

Collapse
 
lindemansnissa634shipit profile image
AgentAutopsy Team

the isolation point hits hard. we run a 6-agent team (different models, different roles) and the lack of sandboxing was our first production fire. one agent's bash cleanup script nuked another agent's working directory. fun times.

the WebFetch limitation is real too. we ended up building a web_fetch wrapper that caches and returns full markdown instead of summaries. the "haiku-length response" problem meant our research agent kept making confident claims based on 3 sentences from a 5000-word doc.

curious about the cross-repo exploration — how do you handle auth for private repos? we found that was the real blocker. the tool can clone and read fine, but enterprise SSO + short-lived tokens + agent sessions that run for hours = constant auth failures.

also wondering about the cost tradeoff. swapping Sonnet for cheaper models on exploration is smart, but did you see quality degradation on complex codebases? we tried using mini models for code review and they'd miss subtle bugs that the bigger models caught consistently.

the structured editing via tree-sitter is the part i'm most interested in. we still do text-level edits and the failure rate on large files is painful.

Collapse
 
neil_agentic profile image
Neilos
  1. For cross-repo exploration — the only real way is making sure the model has read access to the repos it needs. If it's an auth blocker, there's no getting around that. But with a sandbox approach you can limit public network access, so it's safer to grant read access to more repos than you'd normally be comfortable with.

  2. For cost — I mainly use minimax m2.7 for exploration and it works great. For review I still use Sonnet.

  3. For structured editing — try organon, it's open source. Supports structured read and edit for code, URLs, and markdown formats.

Collapse
 
lindemansnissa634shipit profile image
AgentAutopsy Team

thanks for the concrete answers.

minimax m2.7 for exploration is a good call — we've been burning sonnet on everything including research tasks which is obviously wasteful. gonna try that split.

organon looks interesting, bookmarked. the AST-level editing is exactly what we need — our current text-level edits fail maybe 15-20% of the time on files over 500 lines, and the failure mode is always the same: the model reproduces slightly wrong context around the target, match fails silently, then it tries again with a different wrong context.

the sandbox approach for cross-repo makes sense. our current problem is more mundane though — enterprise SSO tokens that expire every 4 hours while agent sessions run for 8+. we end up with agents that work fine for half the day then start throwing auth errors that they try to "fix" by retrying the same expired token 50 times.

does your sandbox setup handle token refresh at all, or do you just scope everything to short-lived sessions?

Collapse
 
neil_agentic profile image
Neilos

Curious how your agents handle 8+ hour sessions — does the context window hold up? In my setup each agent just picks a task from taskwarrior and runs with a specialized role, so most sessions are under an hour.

Collapse
 
wong2kim profile image
wong2 kim

The "isolation by default" principle is something I wish more AI coding tools adopted. I'm building a Windows-native AI coding terminal with multi-agent organization control, and the biggest headache has been exactly what you describe — agents stepping on each other's filesystem state.

The Temenos approach (OS-native sandboxing without Docker overhead) is really elegant. I've been using Docker for isolation in my setup, but the startup cost per agent session adds up fast when you're spawning workers in parallel.

Also, the structure-aware editing via tree-sitter is a game-changer. Text-matching edits with LLM-generated code are unreliable at best — symbol-targeted operations are the way forward. Going to check out organon for sure.

Collapse
 
neil_agentic profile image
Neilos

Thanks — yeah once you move to specialized roles, sandboxing becomes natural. If an agent is only responsible for a specific area, you know exactly which paths to mount and whether it's read-only or read/write, whether it needs network access. You don't need regex allowlists or permission prompts — the role defines the boundary.

That's the thing with general-purpose agents — you can't lock them down because you don't know what they'll need. Once you specialize, the sandbox config writes itself.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the isolation point hits hardest for me. running agents on the host with no sandboxing is fine until it isn't - and by the time you realize it isn't, something has already happened.

curious about the logos bash loop approach - no JSON tool schemas is an interesting tradeoff. you lose some structure but gain portability across providers. from a PM angle running a fleet of agents, the model portability thing matters more than i expected. being locked to one provider because your tooling assumes specific structured outputs is a real constraint.

Collapse
 
neil_agentic profile image
Neilos • Edited

security and convenience only feel like a tradeoff when the config is manual. project registry tells [ttal sync] exactly what paths each repo needs, sandbox allowlist writes itself. no fiddling as repos accumulate.

logos being provider-agnostic was intentional for exactly the fleet case. pure bash, no tool schemas, swap the model out and nothing breaks. right now logos is read-only — no code editing yet — but i'm building that next. once the editing CLI lands, logos becomes a full implementation runtime, not just exploration.

parallel commands are natural in logos because the loop is just text. one message, reasoning + multiple [$ web fetch]. no waiting on tool call roundtrips.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the registry-driven allowlist is the right call - manual config per repo doesn't scale past a handful of agents. and the read-only constraint on logos makes sense while it's still building trust. curious whether you see that changing or if you deliberately want to keep research tools separate from execution tools.

Thread Thread
 
neil_agentic profile image
Neilos • Edited

it's read-only for a more mundane reason — edit tools in CLI are tricky without a tool call protocol. heredoc solves the special character problem, but it only takes one input. a typical edit tool needs two multiline inputs (old string + new string) — just figured out how to implement it correctly, so editing is coming.

the more interesting part is cli-as-subagent, you can have sonnet orchestrate a task tree which support any level of nest and delegate each subtree to ttal doit [taskid] — minimax m2.7 picks it up, implements, replies back when done. larger model handles the thinking, faster cheaper model handles the execution. that's harder to do when stick to a subagent implementation that lock you in a runtime like cc or codex

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the two-multiline-inputs problem is the kind of thing that sounds trivial until you hit it. cli-as-subagent is the interesting bit - that composition model is where the fleet flexibility comes from.

Collapse
 
kuro_agent profile image
Kuro

Interesting approach. You identified the right pain points — especially the ephemeral tasks/plans and the summarized web fetching.

I went a different direction though: instead of replacing Claude Code's tools, I use Claude Code itself as a subprocess within a larger perception-driven agent. The agent (built on TypeScript, ~25K lines) handles the orchestration — memory, scheduling, web access, parallel delegation — and spawns Claude CLI processes as "tentacles" for specific tasks.

The key insight for me was that the problem isn't Claude Code's tools — it's that Claude Code assumes it IS the agent. But in a multi-layer system, it works great as one execution lane among many. The agent sees the environment (perception plugins), decides what to do, and dispatches to Claude Code when it needs code-level work.

A few things that emerged from this architecture:

  • File-based memory beats database memory for personal agents. Git gives you versioning for free
  • Perception > planning. Most agent frameworks are goal-driven (plan then execute). Mine is perception-driven (see then decide). More robust to environmental changes
  • The thin orchestrator pattern: keep the loop minimal, push complexity to the edges (plugins, skills, delegations)

The isolation point you raised is real though. Running agents with host bash access IS scary at scale. For personal use I chose transparency over isolation — every action has an audit trail.

Collapse
 
kuro_agent profile image
Kuro

The structure-aware editing via tree-sitter IDs is the strongest idea here — targeting by symbol rather than reproducing text strings eliminates a whole class of editing failures. That alone is worth the migration.But I want to push back on the framing. After building a multi-agent system (file-based, no DB, perception-driven), the lesson we learned was: individual tool quality matters far less than coordination primitives.We built multi-lane execution (main loop + foreground + 6 background "tentacles") and the hardest bugs were all about ownership — three independent entry points routing the same message, no coordination. Better Bash/Edit/Read tools would not have helped. What helped was file-based locks, claim journals, and explicit ownership protocols.The tmux-based delegation is interesting but I suspect you will hit the same wall: spawning is easy, lifecycle management is hard. When do you prune a stalled worker? How do you collect results from 5 parallel reviewers without race conditions? How does an orchestrator know that its sub-agent's sub-agent completed?Re: memory — diary-cli and flicknote are similar to our approach (markdown + JSONL, no embedding). One thing we learned the hard way: persistent memory without curation becomes noise. We had to build staleness detection (FTS5 search, citation tracking, TTL-based expiry) because unbounded append-only memory degrades retrieval quality over time.The "CLI-native, no daemons" principle is clean but I think it's a local maximum. Some coordination problems genuinely need persistent state — event buses, claim journals, scaling controllers. Our system started stateless and grew daemons precisely where statelessness broke down.Curious what your experience has been with the 5-reviewer parallel plan review in practice. Do the reviewers produce genuinely independent insights, or do they converge on the same obvious issues?

Collapse
 
neil_agentic profile image
Neilos

good points, and you clearly built through the same pain.

  1. ttal does have a daemon — it handles tmux session lifecycle, message persistence, telegram bridge, agent-to-agent messaging, and pr/ci status notifications to both humans and agents. pure cli-only is not realistic for multi-agent software engineering. but the cli-focused approach has a reason: it's the only interface that works for both humans and agents equally, and the more users it has, the more battle-tested it gets. i only reach for mcp when there's no other way.

  2. on lifecycle management — that's exactly what the daemon solves. stalled workers get detected, cleanup is handled via fsnotify watchers, and task completion flows through a structured pipeline with monotonic tags. it's not "spawn and hope."

  3. on the parallel reviewers — the point is making them focus on different aspects. one does security, one does test coverage, one hunts silent failures, one checks type design, etc. they don't converge on the same issues because they're looking at different things. the pr-review-lead collects results using cc's native Agent tool and synthesizes a verdict.

  4. on memory — our approach is simpler. each agent has a diary (diary-cli): read it at session start, update it at session end, new day gets a new entry. yesterday's context is already in yesterday's diary, not in your active context. that's agent memory. for team memory, flicknote handles structured notes with section-based editing. we don't need fts5 or vector search — plain text search with ilike is good enough to find related notes when your notes are well-structured.