clearloop for OpenWalrus

Posted on Mar 15 • Originally published at openwalrus.xyz

Built-in tools: what your agent can reach

#ai #research #opensource #openwalrus

Every coding agent ships a toolbox. But what's actually in it?

The tools an agent has access to define its ceiling. An agent without a browser can't test web apps. An agent without code intelligence can't jump to definitions. An agent without a terminal can't run tests. What products choose to include — and what they leave out — reveals their theory of what an agent should do.

We cataloged the built-in tools in ten AI coding products, all from official documentation and public repos as of March 2026. This is a companion to our survey of built-in agents — that post covered the agents, this one covers what those agents can reach.

What we surveyed

Ten products, same list as the agents survey:

Claude Code (Anthropic) — CLI agent
GitHub Copilot — IDE + CLI agent
Cursor — IDE agent
Windsurf (Codeium) — IDE agent
Devin (Cognition) — autonomous cloud agent
OpenHands (All Hands AI) — multi-agent framework
Aider — terminal pair programmer
Amazon Q Developer — AWS-integrated agent
Gemini Code Assist (Google) — IDE agent
Augment Code — IDE agent + orchestration

Product by product

Claude Code

The most granular tool separation we found. Eleven named tools, each with distinct parameters and individually permissioned:

Tool	Category	Purpose
Read	File	Read file contents with optional line range
Write	File	Create or overwrite a file
Edit	File	Exact string replacement in a file
Glob	File	Fast file pattern matching
Bash	Terminal	Execute shell commands
Grep	Search	Content search built on ripgrep
WebSearch	Web	Search the web for information
WebFetch	Web	Fetch and process URL content
NotebookEdit	File	Edit Jupyter notebook cells
TodoWrite	Planning	Structured task tracking
Task	Agent	Launch subagent for complex work

Each tool can be placed in an allow list (auto-approved), deny list (blocked), or default ask list (requires user approval). Subagents receive restricted tool sets — the Explore agent gets read-only tools (Write, Edit denied), Plan gets read-only tools. This per-tool, per-agent permission model is the most fine-grained we found.

GitHub Copilot

The CLI and IDE agent modes share a common tool set, though the exact tool names aren't published the way Claude Code's are:

Category	Capabilities
File	Read files, edit files, create files
Terminal	Run commands, view output
Search	Semantic code search, file search by name
Web	Web search, browser preview
Agent	Delegate to Explore/Task/Plan/Code Review agents

Custom agents (Markdown files with YAML frontmatter) can restrict tool access via a tools list — the same pattern as Claude Code. MCP servers extend the tool set beyond built-in capabilities.

Copilot's agent mode automatically selects which tools to use and can run multiple tool calls in parallel. Tool selection is not user-visible in the same way as Claude Code's individual permission prompts.

Cursor

Ten documented tools available in Agent mode:

Tool	Category	Purpose
Semantic search	Search	Search by meaning across indexed codebase
File/folder search	Search	Find by name, directory, keywords
Web search	Web	Search the internet
Fetch rules	Config	Retrieve project rules
Read files	File	Read text and image files
Edit files	File	Suggest and auto-apply edits
Run shell commands	Terminal	Execute terminal commands
Browser control	Web	Navigate, screenshot, interact with pages
Image generation	Vision	Generate images
Ask clarifying questions	User	Request information from user

Browser control is notable — Cursor can navigate to URLs, take screenshots, click elements, and type text. Most products don't ship browser interaction at all. Image generation is also rare; Cursor can generate images as part of its workflow.

Custom Modes restrict which tools are available. Ask Mode removes write capabilities. Manual Mode limits to explicit file editing.

Windsurf

Windsurf's Cascade agent has a smaller, less granular tool set:

Category	Capabilities
Search	Search and analyze codebase
Web	Web search
Terminal	Terminal command execution
Code quality	Linter integration (auto-fixes lint errors)
Package management	Auto-detects and installs missing packages

A hard limit of 20 tool calls per prompt caps how much work the agent can do in a single turn. This is the only product in our survey with a documented tool-call ceiling.

Extensibility is limited to MCP server configuration. There's no per-tool permission model — tools are either available in a mode or not.

Devin

The broadest tool surface in our survey. Devin runs in a cloud IDE environment with full system access:

Category	Capabilities
File	Full filesystem access (editor + file explorer)
Terminal	Multiple terminal sessions
Browser	Full Chromium browser (real, not headless)
Search	Devin Search with "Deep Mode" for complex queries
Knowledge	Devin Wiki (auto-indexed repo documentation)
Review	Devin Review (code review with commit application)
Testing	Desktop Testing via computer use (Linux)
Git	Full git operations

Devin's tools aren't discrete named functions — they're a full operating environment. The agent can open multiple terminals, browse the web, interact with GUIs, and run desktop applications. This is closer to giving the agent a full computer than a set of API tools.

The tradeoff: Devin runs in the cloud, not locally. Everything happens in Cognition's sandboxed VMs.

OpenHands

OpenHands takes the most radical approach to tooling: CodeActAgent unifies all actions into executable code.

Category	Implementation
File operations	`open()`, `os.path`, shell commands
Terminal	Bash execution (arbitrary commands)
Python	Interactive Python interpreter
Browser	Delegated to BrowsingAgent
User interaction	Natural language conversation

There are no named tools like "Read" or "Edit." The agent writes Python or bash that does what it needs. Want to read a file? cat file.txt. Want to search? grep -r pattern .. Want to install a package? pip install package.

This "code action space" approach means OpenHands has no tool ceiling — anything you can do in a terminal or Python REPL, the agent can do. But it also means there's no tool-level permission control. You can't say "allow file reads but deny file writes" because both happen through the same execution mechanism. We explored the security implications of this in our tool permissions survey.

Aider

Aider doesn't expose tools in the agent framework sense. Instead, capabilities are built into the conversation loop:

Category	Implementation
File editing	Built into the LLM response format (diff/whole-file/udiff edit formats)
Code search	Repository map via tree-sitter (symbol-level index of entire repo)
Code quality	Auto-lint after every LLM edit
Testing	`/test` command runs tests and auto-fixes failures
File management	File watching + auto-add when referenced
Git	Auto-commit with descriptive messages after each edit
Voice	Voice coding support (transcription → code changes)
Vision	Image input for vision-capable models

The repository map is Aider's standout capability. It uses tree-sitter to build a symbol-level index of the entire codebase — function signatures, class definitions, method names — and sends a compressed map to the LLM as context. This gives the model a structural understanding of the codebase without reading every file. No other product in our survey uses tree-sitter this way.

No terminal access is exposed to the model directly — Aider runs commands (lint, test) on the model's behalf but doesn't give the model a shell.

Amazon Q Developer

Amazon Q's agent capabilities are organized as specialized features rather than named tools:

Category	Capabilities
Code generation	Real-time inline suggestions (25+ languages)
File editing	Multi-file implementation with test validation
Security	Vulnerability scanning (exposed credentials, injection, etc.)
Testing	Iterative unit test generation
Documentation	In-depth doc generation with data flow diagrams
Code review	Logical errors, anti-patterns, security issues
Transformation	.NET porting (Windows → Linux), Java version upgrades

The software development agent runs build and test scripts to validate generated code before presenting results. The CLI supports MCP for external tool integration.

Unlike Claude Code or Cursor, Amazon Q doesn't publish a list of discrete, named tools. The agent's capabilities are described as features, not as an API surface.

Gemini Code Assist

The most IDE-integrated tool set. Google's agent mode documentation lists ten named tools for IntelliJ:

Tool	Category	Purpose
`read_file`	File	Retrieve text content
`write_file`	File	Write text to files
`find_files`	File	Locate files by name or path
`list_files`	File	Enumerate directory contents
`grep`	Search	Search for text patterns
`analyze_current_file`	Code Intel	Check for errors and warnings
`resolve_symbol`	Code Intel	Trace symbol declarations
`find_usages`	Code Intel	Identify all references to a symbol
`git`	Git	Execute git CLI commands
`list_vcs_roots`	Git	Return version control repositories

resolve_symbol and find_usages are the standouts. These are code intelligence operations — go-to-definition and find-all-references — that leverage the IDE's language server. No other product in our survey exposes these as first-class agent tools. When Gemini needs to understand how a function is used across a codebase, it can ask the language server rather than grepping for text patterns.

In VS Code, all Gemini CLI built-in tools are available instead. MCP servers extend the set further.

Augment Code

Augment's IDE agent has the broadest integration surface:

Category	Capabilities
File	File operations (read, write, edit)
Terminal	Terminal execution
Search	Web search
Vision	Image understanding
Multi-repo	Cross-repository coordination
Native integrations	GitHub, Linear, Jira, Confluence, Notion, Sentry, Stripe
MCP	100+ configurable tools
Multi-model	Multiple AI models (Claude, GPT, etc.)

Two implementation details stand out. Parallel tool execution — Augment runs independent tool calls concurrently, claiming 2x faster turns. Most products execute tools sequentially. Native integrations — instead of generic MCP connections, Augment ships purpose-built integrations with project management (Linear, Jira), documentation (Confluence, Notion), and monitoring (Sentry, Stripe) tools. This means the agent can read Jira tickets and Sentry errors without configuring MCP servers.

The inventory at a glance

Product	File Ops	Terminal	Search	Web/Browser	Code Intel	Git	Vision
Claude Code	Read/Write/Edit/Glob	Bash	Grep + WebSearch	WebFetch	—	via Bash	—
Copilot	Read/Edit	Terminal	Semantic + file	Web search + preview	—	via terminal	—
Cursor	Read/Edit	Shell	Semantic + file + web	Browser control	—	via shell	Image gen + read
Windsurf	Search/analyze	Terminal	Web search	—	Linter	via terminal	—
Devin	Editor + filesystem	Terminal	Devin Search	Full browser	—	Full git	Desktop use
OpenHands	via code	Bash + Python	via code	BrowsingAgent	—	via code	—
Aider	Built-in edit	—	Repo map (tree-sitter)	—	tree-sitter	Auto-commit	Image input
Amazon Q	Suggestions + edit	Build/test	—	—	Security scan	—	—
Gemini Code Assist	read/write/find/list	—	grep + find_files	—	resolve_symbol, find_usages	git CLI	—
Augment	File ops	Terminal	Web search	—	Native integrations	GitHub native	Image understanding

Three design philosophies

The ten products fall into three approaches to tool design:

Granular named tools. Claude Code and Gemini Code Assist give each operation a distinct name, specific parameters, and independent permissions. Read is not Grep is not Glob. The LLM sees a menu of specific operations and picks the right one. This enables fine-grained permission control — you can allow Read but deny Write, or allow Grep but deny Bash. The cost is more tool definitions consuming context window space, and more decision points where the model can pick the wrong tool.

Code-as-tools. OpenHands and (to a lesser degree) Aider skip the named-tool abstraction. The agent writes executable code — bash or Python — that performs whatever operation it needs. The "tool set" is infinite: anything you can do in a REPL is available. This is maximally expressive but minimally controllable. As we explored in our sandbox permissions survey, the security boundary shifts from "which tools are allowed" to "what can the sandbox environment access."

IDE-integrated tools. Cursor, Gemini Code Assist, and Augment map tools to IDE capabilities. Semantic search uses the IDE's index. resolve_symbol uses the language server. Browser control uses an embedded browser. The agent inherits whatever the IDE can do. This is powerful — code intelligence operations like find-all-references are genuinely useful for refactoring — but ties the agent to a specific IDE runtime.
[Interactive chart — see original post]

What stands out

Code intelligence is the biggest gap. Only Gemini Code Assist ships resolve_symbol and find_usages as named tools. Every other product relies on text search (grep, ripgrep, semantic search) to understand code structure. Text search can find where a function name appears, but it can't distinguish a definition from a call from a string literal. For large-scale refactoring, this difference matters — and it's the clearest area where IDE-integrated agents have an advantage.

Browser interaction is rare. Only Cursor (browser control: navigate, screenshot, click, type) and Devin (full Chromium in cloud VM) ship browser interaction. The other eight products can't test web UIs, can't follow links in documentation, and can't verify rendered output. As agent tasks get more complex, this gap will grow.

The granularity spectrum is wide. Claude Code has 11+ named tools. OpenHands has effectively 2 (bash + Python interpreter). Both ship, both work, and both have users. The tradeoff is control vs. expressiveness — and the bash bypass problem shows that granular tools don't provide real security if the agent also has a shell.

Vision is emerging but uneven. Cursor generates and reads images. Devin has full desktop computer use. Augment understands images. Aider accepts image input. But Claude Code, Copilot, Windsurf, OpenHands, Amazon Q, and Gemini Code Assist are primarily text-only in their tool interactions.

MCP is the escape hatch. Eight of ten products support MCP for adding tools beyond the built-in set. The built-in tools define the floor — the minimum capability surface. MCP raises the ceiling. But no two products ship the same MCP servers by default, so the "extended" tool set varies widely. We discussed MCP's role as a universal extensibility layer in our skills design post.
[Interactive chart — see original post]

What the research says

Tool selection accuracy remains an active research area. The ToolBench benchmark (May 2023) showed that GPT-4 achieved 56.6% pass rate on complex tool-use tasks involving 16,000+ real-world APIs — demonstrating that more tools doesn't automatically mean better performance. Models make selection errors when the tool set is large and tools have overlapping functionality.

The CodeAct paper (February 2024) that inspired OpenHands' approach found that code actions outperformed JSON-based tool calls on 6 of 7 benchmarks, with an average 20% improvement. The argument: LLMs are better at writing code than selecting from a tool menu, so "code is the tool" produces better results.

However, Gorilla (May 2023) showed that fine-tuning on API documentation significantly improves tool-use accuracy, and that constrained API calls (named tools with typed parameters) reduce hallucinated function calls compared to free-form code generation. The granular-tools camp has evidence too.

The tradeoff may not be universal. For coding tasks with well-known operations (read, write, search, run), named tools reduce errors. For novel tasks requiring creative tool composition, code-as-tools offers more flexibility.

Open questions

Will code intelligence tools become standard? Gemini Code Assist is alone in shipping resolve_symbol and find_usages. If agents become primary refactoring tools, every product will need symbol-level operations — not just text search. Will they build it, or will MCP language server integrations fill the gap?

Does tool granularity help or hurt LLM performance? Claude Code has 11+ tools; OpenHands has 2. ToolBench suggests more tools can reduce accuracy, but CodeAct suggests code beats API calls. The answer may depend on the model — larger models handle more tools better, but tool-call overhead costs tokens regardless of model size.

Will browser interaction become baseline? Cursor and Devin have it. Eight products don't. As agents take on full-stack tasks (frontend + backend + testing), can they remain effective without seeing the rendered page?

Does "code-as-tools" scale? OpenHands' approach is elegant — infinite expressiveness, zero tool ceiling. But it means every operation goes through bash or Python, making audit trails harder to parse and permissions harder to enforce. Does this matter at scale, or is it a theoretical concern?

Should the built-in tool set be standardized? MCP standardizes the protocol for adding tools. But there's no standard for what tools should ship built-in. If you write an MCP server that provides file operations, does it need to match Claude Code's Read/Write/Edit/Glob interface, or can it define its own? Tool portability across products doesn't exist yet.

What's the right tool-call limit? Windsurf caps at 20 tool calls per prompt. Most products have no documented limit. Is a limit a safety feature (prevents runaway agents) or a capability ceiling (prevents complex multi-step work)?

What this means for walrus

Walrus exposes capabilities to agents through WHS hooks — and the design questions in this survey map directly to WHS architecture.

The granularity question applies to hooks. Should a WHS memory hook expose fine-grained operations (store, query, delete, list) or a single broad operation (execute_memory_operation)? The Claude Code/Gemini approach (granular named tools) enables per-operation permissions. The OpenHands approach (code-as-tools) maximizes expressiveness. WHS currently leans toward granularity — each hook has a typed protobuf interface — and this survey suggests that's the right call for permission control.

Code intelligence is a differentiation opportunity. Nine of ten products can't do resolve_symbol or find_usages. Only Gemini Code Assist ships it, and only because it integrates with the IDE's language server. A WHS hook that provides language-server-style code intelligence (backed by tree-sitter, LSP, or a custom index) would give walrus-powered agents a capability most competitors lack.

Tool-call limits are worth considering. Windsurf's 20-call cap prevents runaway tool use. WHS hooks could implement per-hook rate limits — a memory hook might allow 50 operations per turn, while an inference hook might allow 1. This is more granular than a global tool-call cap and maps naturally to the hook lifecycle.

DEV Community