The Problem: Claude Code's Tools Don't Scale
Claude Code ships with a reasonable set of built-in tools: Bash, Read, Write, Edit, Glob, G...
For further actions, you may consider blocking this person and/or reporting abuse
the WebFetch point resonates hard. we ran into the same thing building multi-agent workflows — the default tool summarizes away exactly the details you need for downstream tasks, and there's no caching so you're burning tokens re-fetching the same docs every loop. ended up building a context compression layer that sits between the raw fetch and the agent, so you keep the structural info but cut the token count by 60-70%. curious whether your --url replacement preserves the full page structure or still does some filtering before handing it to the agent.
Yes that's exactly the problem we're solving too. There are two layers:
url (part of organon) — uses defuddle (same engine behind Obsidian Web Clipper) to get clean markdown, cached locally with 1-day TTL. Has --tree for heading structure with section IDs and -s to read specific sections. Any agent can use it directly.
ttal ask --url — an exploration agent that follows the URL, browses linked pages, answers the question, reports to stdout. No tool calls so more token efficient, and it naturally fetches multiple pages in one go.
defuddle + section targeting is way cleaner than dumping full pages into context. the two-layer split between raw fetch and agent-driven exploration makes a lot of sense architecturally. does ttal use a headless browser for JS-heavy pages or is it doing raw HTTP fetches underneath?
Raw HTTP fetches — I mostly use it for coding work so that's been enough. But I do have a Playwright container running in the k8s cluster for FlickNote's web scraper. There's a flicknote CLI so agents can also save a URL and pick up the content later once Playwright processes it.
the async Playwright approach is smart, saves you from blocking the agent while the page renders. curious if you hit issues with SPAs where the meaningful content loads way after DOMContentLoaded. that's where raw fetches fall apart for me too
the async URL queue pattern is smart — decoupling the fetch from the agent's main loop avoids blocking on slow pages. we've been doing something similar where the agent just flags URLs it needs and a background worker handles rendering. biggest win is you can cache rendered pages across agent runs so you're not re-fetching the same docs every session
I really like that you leaned into configuration here — making these pieces configurable is what makes the setup interesting, because teams can actually evolve the system around their own workflows instead of being stuck with fixed tools.
the isolation point hits hard. we run a 6-agent team (different models, different roles) and the lack of sandboxing was our first production fire. one agent's bash cleanup script nuked another agent's working directory. fun times.
the WebFetch limitation is real too. we ended up building a web_fetch wrapper that caches and returns full markdown instead of summaries. the "haiku-length response" problem meant our research agent kept making confident claims based on 3 sentences from a 5000-word doc.
curious about the cross-repo exploration — how do you handle auth for private repos? we found that was the real blocker. the tool can clone and read fine, but enterprise SSO + short-lived tokens + agent sessions that run for hours = constant auth failures.
also wondering about the cost tradeoff. swapping Sonnet for cheaper models on exploration is smart, but did you see quality degradation on complex codebases? we tried using mini models for code review and they'd miss subtle bugs that the bigger models caught consistently.
the structured editing via tree-sitter is the part i'm most interested in. we still do text-level edits and the failure rate on large files is painful.
For cross-repo exploration — the only real way is making sure the model has read access to the repos it needs. If it's an auth blocker, there's no getting around that. But with a sandbox approach you can limit public network access, so it's safer to grant read access to more repos than you'd normally be comfortable with.
For cost — I mainly use minimax m2.7 for exploration and it works great. For review I still use Sonnet.
For structured editing — try organon, it's open source. Supports structured read and edit for code, URLs, and markdown formats.
thanks for the concrete answers.
minimax m2.7 for exploration is a good call — we've been burning sonnet on everything including research tasks which is obviously wasteful. gonna try that split.
organon looks interesting, bookmarked. the AST-level editing is exactly what we need — our current text-level edits fail maybe 15-20% of the time on files over 500 lines, and the failure mode is always the same: the model reproduces slightly wrong context around the target, match fails silently, then it tries again with a different wrong context.
the sandbox approach for cross-repo makes sense. our current problem is more mundane though — enterprise SSO tokens that expire every 4 hours while agent sessions run for 8+. we end up with agents that work fine for half the day then start throwing auth errors that they try to "fix" by retrying the same expired token 50 times.
does your sandbox setup handle token refresh at all, or do you just scope everything to short-lived sessions?
Curious how your agents handle 8+ hour sessions — does the context window hold up? In my setup each agent just picks a task from taskwarrior and runs with a specialized role, so most sessions are under an hour.
The "isolation by default" principle is something I wish more AI coding tools adopted. I'm building a Windows-native AI coding terminal with multi-agent organization control, and the biggest headache has been exactly what you describe — agents stepping on each other's filesystem state.
The Temenos approach (OS-native sandboxing without Docker overhead) is really elegant. I've been using Docker for isolation in my setup, but the startup cost per agent session adds up fast when you're spawning workers in parallel.
Also, the structure-aware editing via tree-sitter is a game-changer. Text-matching edits with LLM-generated code are unreliable at best — symbol-targeted operations are the way forward. Going to check out organon for sure.
Thanks — yeah once you move to specialized roles, sandboxing becomes natural. If an agent is only responsible for a specific area, you know exactly which paths to mount and whether it's read-only or read/write, whether it needs network access. You don't need regex allowlists or permission prompts — the role defines the boundary.
That's the thing with general-purpose agents — you can't lock them down because you don't know what they'll need. Once you specialize, the sandbox config writes itself.
the isolation point hits hardest for me. running agents on the host with no sandboxing is fine until it isn't - and by the time you realize it isn't, something has already happened.
curious about the logos bash loop approach - no JSON tool schemas is an interesting tradeoff. you lose some structure but gain portability across providers. from a PM angle running a fleet of agents, the model portability thing matters more than i expected. being locked to one provider because your tooling assumes specific structured outputs is a real constraint.
security and convenience only feel like a tradeoff when the config is manual. project registry tells [ttal sync] exactly what paths each repo needs, sandbox allowlist writes itself. no fiddling as repos accumulate.
logos being provider-agnostic was intentional for exactly the fleet case. pure bash, no tool schemas, swap the model out and nothing breaks. right now logos is read-only — no code editing yet — but i'm building that next. once the editing CLI lands, logos becomes a full implementation runtime, not just exploration.
parallel commands are natural in logos because the loop is just text. one message, reasoning + multiple [$ web fetch]. no waiting on tool call roundtrips.
the registry-driven allowlist is the right call - manual config per repo doesn't scale past a handful of agents. and the read-only constraint on logos makes sense while it's still building trust. curious whether you see that changing or if you deliberately want to keep research tools separate from execution tools.
it's read-only for a more mundane reason — edit tools in CLI are tricky without a tool call protocol. heredoc solves the special character problem, but it only takes one input. a typical edit tool needs two multiline inputs (old string + new string) — just figured out how to implement it correctly, so editing is coming.
the more interesting part is cli-as-subagent, you can have sonnet orchestrate a task tree which support any level of nest and delegate each subtree to ttal doit [taskid] — minimax m2.7 picks it up, implements, replies back when done. larger model handles the thinking, faster cheaper model handles the execution. that's harder to do when stick to a subagent implementation that lock you in a runtime like cc or codex
the two-multiline-inputs problem is the kind of thing that sounds trivial until you hit it. cli-as-subagent is the interesting bit - that composition model is where the fleet flexibility comes from.
Interesting approach. You identified the right pain points — especially the ephemeral tasks/plans and the summarized web fetching.
I went a different direction though: instead of replacing Claude Code's tools, I use Claude Code itself as a subprocess within a larger perception-driven agent. The agent (built on TypeScript, ~25K lines) handles the orchestration — memory, scheduling, web access, parallel delegation — and spawns Claude CLI processes as "tentacles" for specific tasks.
The key insight for me was that the problem isn't Claude Code's tools — it's that Claude Code assumes it IS the agent. But in a multi-layer system, it works great as one execution lane among many. The agent sees the environment (perception plugins), decides what to do, and dispatches to Claude Code when it needs code-level work.
A few things that emerged from this architecture:
The isolation point you raised is real though. Running agents with host bash access IS scary at scale. For personal use I chose transparency over isolation — every action has an audit trail.
The structure-aware editing via tree-sitter IDs is the strongest idea here — targeting by symbol rather than reproducing text strings eliminates a whole class of editing failures. That alone is worth the migration.But I want to push back on the framing. After building a multi-agent system (file-based, no DB, perception-driven), the lesson we learned was: individual tool quality matters far less than coordination primitives.We built multi-lane execution (main loop + foreground + 6 background "tentacles") and the hardest bugs were all about ownership — three independent entry points routing the same message, no coordination. Better Bash/Edit/Read tools would not have helped. What helped was file-based locks, claim journals, and explicit ownership protocols.The tmux-based delegation is interesting but I suspect you will hit the same wall: spawning is easy, lifecycle management is hard. When do you prune a stalled worker? How do you collect results from 5 parallel reviewers without race conditions? How does an orchestrator know that its sub-agent's sub-agent completed?Re: memory — diary-cli and flicknote are similar to our approach (markdown + JSONL, no embedding). One thing we learned the hard way: persistent memory without curation becomes noise. We had to build staleness detection (FTS5 search, citation tracking, TTL-based expiry) because unbounded append-only memory degrades retrieval quality over time.The "CLI-native, no daemons" principle is clean but I think it's a local maximum. Some coordination problems genuinely need persistent state — event buses, claim journals, scaling controllers. Our system started stateless and grew daemons precisely where statelessness broke down.Curious what your experience has been with the 5-reviewer parallel plan review in practice. Do the reviewers produce genuinely independent insights, or do they converge on the same obvious issues?
good points, and you clearly built through the same pain.
ttal does have a daemon — it handles tmux session lifecycle, message persistence, telegram bridge, agent-to-agent messaging, and pr/ci status notifications to both humans and agents. pure cli-only is not realistic for multi-agent software engineering. but the cli-focused approach has a reason: it's the only interface that works for both humans and agents equally, and the more users it has, the more battle-tested it gets. i only reach for mcp when there's no other way.
on lifecycle management — that's exactly what the daemon solves. stalled workers get detected, cleanup is handled via fsnotify watchers, and task completion flows through a structured pipeline with monotonic tags. it's not "spawn and hope."
on the parallel reviewers — the point is making them focus on different aspects. one does security, one does test coverage, one hunts silent failures, one checks type design, etc. they don't converge on the same issues because they're looking at different things. the pr-review-lead collects results using cc's native Agent tool and synthesizes a verdict.
on memory — our approach is simpler. each agent has a diary (diary-cli): read it at session start, update it at session end, new day gets a new entry. yesterday's context is already in yesterday's diary, not in your active context. that's agent memory. for team memory, flicknote handles structured notes with section-based editing. we don't need fts5 or vector search — plain text search with ilike is good enough to find related notes when your notes are well-structured.