<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kuroko</title>
    <description>The latest articles on DEV Community by kuroko (@kuroko1t).</description>
    <link>https://dev.to/kuroko1t</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F168435%2Fb7b5bb69-c228-4766-8b85-7b9cc4322d73.jpeg</url>
      <title>DEV Community: kuroko</title>
      <link>https://dev.to/kuroko1t</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kuroko1t"/>
    <language>en</language>
    <item>
      <title>I Built a Tool to Stop Losing My Claude Code Conversation History</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Sat, 14 Mar 2026 03:02:40 +0000</pubDate>
      <link>https://dev.to/kuroko1t/i-built-a-tool-to-stop-losing-my-claude-code-conversation-history-5500</link>
      <guid>https://dev.to/kuroko1t/i-built-a-tool-to-stop-losing-my-claude-code-conversation-history-5500</guid>
      <description>&lt;p&gt;A few weeks ago I needed to revisit a debugging session. Claude had walked me through a nasty race condition in my app — it took over an hour, and the fix was subtle. I knew exactly which session it was.&lt;/p&gt;

&lt;p&gt;I went to find the JSONL file. Gone. No warning, no "this file will be deleted in 3 days." Just gone.&lt;/p&gt;

&lt;p&gt;If you've been using Claude Code for more than a couple of months, this has probably happened to you too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait, Claude Code Deletes My History?
&lt;/h2&gt;

&lt;p&gt;Yeah. Claude Code stores conversations as JSONL files under &lt;code&gt;~/.claude/projects/&lt;/code&gt;, and old files are &lt;a href="https://github.com/anthropics/claude-code/issues/4172" rel="noopener noreferrer"&gt;automatically deleted over time&lt;/a&gt;. You can change this in settings, but that only solves the auto-deletion problem. &lt;code&gt;/compact&lt;/code&gt; still lossy-summarizes your context, and version updates can &lt;a href="https://github.com/anthropics/claude-code/issues/29154" rel="noopener noreferrer"&gt;break session compatibility&lt;/a&gt;. Even with deletion disabled, JSONL files are scattered across directories with no way to search across sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tried (and Why It Wasn't Enough)
&lt;/h2&gt;

&lt;p&gt;I tried &lt;a href="https://github.com/raine/claude-history" rel="noopener noreferrer"&gt;claude-history&lt;/a&gt; (Rust TUI) and &lt;a href="https://github.com/jhlee0409/claude-code-history-viewer" rel="noopener noreferrer"&gt;Claude Code History Viewer&lt;/a&gt; (desktop app). Both are great for browsing, but they read JSONL files directly — once those files get deleted, they can't show you anything either. &lt;a href="https://github.com/thedotmack/claude-mem" rel="noopener noreferrer"&gt;claude-mem&lt;/a&gt; does persist data into its own database, but it's a full memory system with Node.js, MCP server, and semantic search — more than I needed. I just wanted to archive conversations before they disappear.&lt;/p&gt;

&lt;p&gt;What I was missing: a simple, durable archive I could set up once and forget about.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Built One
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kuroko1t/claude-vault" rel="noopener noreferrer"&gt;claude-vault&lt;/a&gt; is a single Rust binary that imports your Claude Code conversations into SQLite with full-text search. No Node.js, no Python, no MCP server — just download and run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault import
&lt;span class="c"&gt;# Imported 94562 messages (0 skipped, 12847 filtered, 0 errors) from 203 files&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once conversations are in SQLite, they survive file deletion, compaction, updates — whatever happens to the original JSONL files.&lt;/p&gt;

&lt;h3&gt;
  
  
  What About All the Noise?
&lt;/h3&gt;

&lt;p&gt;If you've ever opened a Claude Code JSONL file, you know it's mostly noise — tool results, system tags, file read outputs, progress messages. claude-vault strips all of that during import, keeping only what matters: your questions, Claude's responses, and code-modifying actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search That Actually Works
&lt;/h3&gt;

&lt;p&gt;Search uses FTS5 with Porter stemming, so "running" matches "run" and "configurations" matches "configure":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault search &lt;span class="s2"&gt;"race condition fix"&lt;/span&gt;
claude-vault search &lt;span class="s2"&gt;"deploy"&lt;/span&gt; &lt;span class="nt"&gt;--project&lt;/span&gt; my-app &lt;span class="nt"&gt;--since&lt;/span&gt; 2025-01-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also pipe JSON output to Claude itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault search &lt;span class="s2"&gt;"previous auth implementation"&lt;/span&gt; &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Part That Made It Actually Useful: Hooks
&lt;/h2&gt;

&lt;p&gt;Manually running &lt;code&gt;import&lt;/code&gt; is fine, but I kept forgetting. The real fix was hooking it into Claude Code's lifecycle. Add this to &lt;code&gt;~/.claude/settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreCompact"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-vault import &amp;gt;/dev/null 2&amp;gt;&amp;amp;1"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SessionEnd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-vault import &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 &amp;amp;"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PreCompact&lt;/strong&gt; captures the full conversation before &lt;code&gt;/compact&lt;/code&gt; summarizes it. &lt;strong&gt;SessionEnd&lt;/strong&gt; archives in the background when you exit. Once set up, I never think about it — every session is archived automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Doesn't Do (Honest Assessment)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It's an &lt;strong&gt;archive&lt;/strong&gt;, not a memory system. It won't inject past context into new sessions automatically.&lt;/li&gt;
&lt;li&gt;It's &lt;strong&gt;CLI-only&lt;/strong&gt;. If you want a TUI, &lt;a href="https://github.com/raine/claude-history" rel="noopener noreferrer"&gt;claude-history&lt;/a&gt; is great.&lt;/li&gt;
&lt;li&gt;No semantic search — it's keyword-based FTS5 with stemming.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does one thing: makes sure your conversations don't disappear. That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;claude-vault
&lt;span class="c"&gt;# or download a prebuilt binary from GitHub Releases&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seriously, run &lt;code&gt;claude-vault import&lt;/code&gt; now. If you've been using Claude Code for a while, some of your old sessions might already be gone — archive what's left before it's too late.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kuroko1t/claude-vault" rel="noopener noreferrer"&gt;GitHub: kuroko1t/claude-vault&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you lost Claude Code sessions you wish you could get back? What's your approach to preserving conversation history? I'd love to hear what others are doing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>What Happens When Local LLMs Fail at Tool Calling — Testing 7 Models with a Rust Coding Agent</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Sun, 01 Mar 2026 14:28:05 +0000</pubDate>
      <link>https://dev.to/kuroko1t/what-happens-when-local-llms-fail-at-tool-calling-testing-7-models-with-a-rust-coding-agent-cep</link>
      <guid>https://dev.to/kuroko1t/what-happens-when-local-llms-fail-at-tool-calling-testing-7-models-with-a-rust-coding-agent-cep</guid>
      <description>&lt;p&gt;I tested 7 local LLMs on the same simple coding task. 4 succeeded. 3 failed — each in a different way. One model burned 30K tokens retrying the exact same broken call because my system prompt told it to.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Whet&lt;/a&gt;, a coding agent written in Rust. It connects to local LLMs through Ollama and gives them tools — read files, edit files, run shell commands, search code — so the model can actually modify your project instead of just suggesting changes. Think of it as a local, open-source alternative to tools like Claude Code or Cursor, but running entirely on your machine with whatever model you choose.&lt;/p&gt;

&lt;p&gt;The key mechanism is &lt;strong&gt;tool calling&lt;/strong&gt;: instead of the model printing "you should edit line 5," the model returns a structured API call like &lt;code&gt;edit_file(path, old_text, new_text)&lt;/code&gt;, and the agent executes it. When this works, the model can autonomously chain multiple tools to complete a task. When it breaks, things get interesting.&lt;/p&gt;

&lt;p&gt;This article documents the failure patterns I found, which ones were the model's fault vs. my agent's fault, and what I did about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt;: I built Whet as a personal project, so I'm biased toward finding and fixing issues in my own agent rather than blaming models. The "model vs agent" distinction below is my interpretation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Whet&lt;/a&gt; — a single-binary Rust coding agent with 9 built-in tools (read_file, edit_file, shell, grep, etc.) plus optional web tools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task&lt;/strong&gt;: "Read hello.py and add a farewell function"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# hello.py (before)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;World&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough that any tool-calling model should handle it. The expected tool chain is: &lt;code&gt;read_file&lt;/code&gt; → &lt;code&gt;edit_file&lt;/code&gt;. Two calls, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models&lt;/strong&gt;: 7 models available via Ollama, ranging from 7B to 24B parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode&lt;/strong&gt;: Yolo (auto-approve all tool calls). Max 10 iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to reproduce&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;whet
ollama pull qwen3:8b  &lt;span class="c"&gt;# or any model below&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; hello.py
whet &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Read hello.py and add a farewell function"&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3:8b &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Tool Calls&lt;/th&gt;
&lt;th&gt;Failure Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;devstral-small-2&lt;/td&gt;
&lt;td&gt;24B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;5,990&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;glm-4.7-flash&lt;/td&gt;
&lt;td&gt;19B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;6,684&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3:8b&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;6,895&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;8,946&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6,013&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Wrong &lt;code&gt;old_text&lt;/code&gt;, gave up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:7b&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3,801&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Read file, asked user instead of editing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,873&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Output JSON as text instead of calling tool&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4 passed. 3 failed. Parameter count didn't predict success — qwen3:8b (8B) passed while qwen2.5-coder:14b (14B) failed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Success Looks Like
&lt;/h2&gt;

&lt;p&gt;Before the failures, here's a successful run (devstral-small-2, 5,990 tokens):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] read_file {"path": "hello.py"}
    → returned file content (5 lines)

[2] edit_file {"path": "hello.py", "old_text": "if __name__...", "new_text": "def farewell..."}
    → added farewell function ✓

Done. Task complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two tool calls, clean execution. The model read the file, understood the structure, wrote a valid edit, and stopped. This is what all 7 models should have done.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Failure Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Refusing to Act (qwen2.5:7b)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[tool: read_file] {"path":"hello.py"}  ← only tool call

"Should I edit the file?"  ← asked user instead of editing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model read the file successfully, then asked for permission instead of using &lt;code&gt;edit_file&lt;/code&gt;. The system prompt says "ACT, DON'T ASK" — the model ignored it. 1 tool call, 3,801 tokens, task incomplete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Tool Format Confusion (qwen2.5-coder:14b)
&lt;/h3&gt;

&lt;p&gt;The model output what &lt;em&gt;looks like&lt;/em&gt; a tool call, but as plain text instead of using the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# What the model printed (as text, NOT an actual tool call):
{"name": "read_file", "arguments": {"path": "hello.py"}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model understood it needed to call &lt;code&gt;read_file&lt;/code&gt;, but output the JSON as text inside a markdown code block instead of using the tool calling API. Zero actual tool calls. 1,873 tokens wasted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Retry Loop
&lt;/h3&gt;

&lt;p&gt;This was the most interesting failure because it was &lt;strong&gt;both the model's and my agent's fault&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Iteration&lt;/th&gt;
&lt;th&gt;Tool Call&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;read_file {"path": "hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Error&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Error&lt;/strong&gt; (same)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Error&lt;/strong&gt; (same)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;(max iterations)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gave up&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;30K tokens. 10+ tool calls. The model hit an error on &lt;code&gt;shell&lt;/code&gt;, then repeated the exact same call 5+ times. It never tried a different approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model side&lt;/strong&gt;: qwen3:14b didn't adapt after seeing the error. Other models (qwen3:8b, devstral) changed their approach on failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent side&lt;/strong&gt;: My system prompt said &lt;em&gt;"if shell command fails: read the error output, fix the issue, and retry"&lt;/em&gt; — which the model interpreted literally as "call the same thing again."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I Did About It
&lt;/h2&gt;

&lt;p&gt;Pattern 3 was the most actionable. One line added to the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- NEVER repeat the same failing tool call more than once.
  If it failed, change your approach (different arguments,
  different tool, or ask the user).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;qwen3:14b (before)&lt;/th&gt;
&lt;th&gt;qwen3:14b (after)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task completed&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;~30,000&lt;/td&gt;
&lt;td&gt;8,946&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calls&lt;/td&gt;
&lt;td&gt;10+&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool success rate&lt;/td&gt;
&lt;td&gt;&amp;lt; 20%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One line of prompt turned a 30K-token failure into a 9K-token success.&lt;/p&gt;

&lt;p&gt;For the other two patterns, I added agent-level recovery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pattern 2 (JSON as text)&lt;/strong&gt;: A fallback parser that scans the model's text output for JSON objects matching the tool call format and executes them. This successfully extracted &lt;code&gt;read_file&lt;/code&gt; calls from qwen2.5-coder:14b's text output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern 1 (refusing to act)&lt;/strong&gt;: A question detector that catches when the model asks instead of acting, and re-prompts it to use tools instead of asking. This fired in 3 out of 5 test runs with qwen2.5:7b.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both helped partially, but neither is a complete fix — ultimately the model needs to use the tool calling API correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Data Shows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model generation matters more than size
&lt;/h3&gt;

&lt;p&gt;All three qwen2.5 models failed. All three qwen3 models passed (after the prompt fix). devstral-small-2 and glm-4.7-flash also passed. The qwen3/qwen2.5 boundary is a clearer predictor of tool-calling success than parameter count.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Each failure is different
&lt;/h3&gt;

&lt;p&gt;The three failing models broke in three distinct ways: refusing to act, format confusion, retry loops. There's no single "tool calling doesn't work" failure mode — each model fails differently, which means each failure needs different investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Agent bugs hide behind smart models
&lt;/h3&gt;

&lt;p&gt;qwen3:8b and devstral never triggered the retry loop bug because they recover gracefully from errors. If I'd only tested with these models, the prompt bug would still be in my code. The "worst" model (qwen3:14b pre-fix) was the most useful for finding agent bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single task&lt;/strong&gt;: These results are from one task. A model that passes "add a function" might fail at "debug a test failure" or "refactor across files." I'm working on a broader benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-deterministic&lt;/strong&gt;: LLM outputs vary between runs. qwen2.5:14b might succeed on a retry. I ran each model once for the initial results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama-specific&lt;/strong&gt;: Results may differ with other inference engines (llama.cpp, vLLM). Tool calling implementation varies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author bias&lt;/strong&gt;: I built Whet. I'm inclined to fix my agent rather than blame models. Another developer might classify some "agent bugs" as "model limitations" or vice versa.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test with multiple models, not just the best one.&lt;/strong&gt; Smart models hide agent bugs by working around them. The model that fails the most dramatically teaches you the most about your agent's weaknesses.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;"Retry on failure" is dangerous prompt guidance.&lt;/strong&gt; Humans understand "retry" as "try differently." LLMs may read it as "call the exact same function again." Be explicit about what NOT to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check the generation, not just the size.&lt;/strong&gt; qwen3:8b (8B) outperformed qwen2.5-coder:14b (14B) at tool calling. Newer model families tend to have better tool-use training regardless of parameter count.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent can compensate — partially.&lt;/strong&gt; JSON fallback parsing and question re-prompting helped, but the biggest win was a one-line prompt fix. Invest in your system prompt before building workarounds.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;The code is &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;open source&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>How Accessibility Tree Formatting Affects Token Cost in Browser MCPs</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Thu, 26 Feb 2026 07:58:44 +0000</pubDate>
      <link>https://dev.to/kuroko1t/how-accessibility-tree-formatting-affects-token-cost-in-browser-mcps-n2a</link>
      <guid>https://dev.to/kuroko1t/how-accessibility-tree-formatting-affects-token-cost-in-browser-mcps-n2a</guid>
      <description>&lt;p&gt;Token cost in browser automation MCPs has become a real topic — articles like &lt;a href="https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0" rel="noopener noreferrer"&gt;"Playwright MCP Burns 114K Tokens Per Test"&lt;/a&gt; have been making the rounds. Tools are approaching this from different angles: Playwright MCP's &lt;code&gt;--output-mode file&lt;/code&gt; option saves snapshots to disk instead of returning them in LLM context, Vercel's &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt; compresses DOM state to a fraction of the original, and some tools add vision-based fallbacks for layout understanding.&lt;/p&gt;

&lt;p&gt;I've been working on &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;WebClaw&lt;/a&gt;, an open-source Chrome extension-based browser MCP. It takes the accessibility tree approach like Playwright MCP, but with a more compact format. I wanted to measure the actual difference — not guess, but measure — so I set up a side-by-side test.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Measured
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Versions tested:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Playwright MCP: &lt;code&gt;@playwright/mcp&lt;/code&gt; v0.0.68 (&lt;code&gt;npx @playwright/mcp@0.0.68 --headless&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;WebClaw: &lt;code&gt;webclaw-mcp&lt;/code&gt; v0.9.0 + Chrome extension v0.9.0&lt;/li&gt;
&lt;li&gt;Measured: February 26, 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I registered both &lt;a href="https://github.com/microsoft/playwright-mcp" rel="noopener noreferrer"&gt;Playwright MCP&lt;/a&gt; and WebClaw as MCP servers in the &lt;strong&gt;same Claude Code session&lt;/strong&gt;, then ran the same steps on each:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to the target URL&lt;/li&gt;
&lt;li&gt;Call the snapshot tool (&lt;code&gt;browser_snapshot&lt;/code&gt; / &lt;code&gt;page_snapshot&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Measure the full response text length in characters&lt;/li&gt;
&lt;li&gt;Estimate tokens as &lt;code&gt;characters / 4&lt;/code&gt; (approximation — actual tokenization varies by model)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Both tools return the complete accessibility tree with no truncation.&lt;/strong&gt; WebClaw's default is unlimited output (no token budget), so this is a pure format efficiency comparison.&lt;/p&gt;

&lt;p&gt;I picked three pages with different content patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wikipedia&lt;/strong&gt; — long article with many reference links and navigation templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt; — repository page with file listing, README, and sidebar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hacker News&lt;/strong&gt; — list-style page with 30 items&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important caveat on fairness:&lt;/strong&gt; Playwright MCP runs a headless Chromium (not logged in). WebClaw runs in the user's Chrome (logged in to GitHub in my case). This means WebClaw sees &lt;em&gt;more&lt;/em&gt; UI on GitHub — authenticated menus, notifications, repo actions — which actually increases its output. The comparison is biased against WebClaw on that page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Format Efficiency
&lt;/h2&gt;

&lt;p&gt;Both tools returning full, untruncated accessibility trees:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Site&lt;/th&gt;
&lt;th&gt;Playwright MCP&lt;/th&gt;
&lt;th&gt;WebClaw&lt;/th&gt;
&lt;th&gt;Difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikipedia.org/wiki/Model_Context_Protocol" rel="noopener noreferrer"&gt;Wikipedia (MCP article)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;16,044 tokens (64,176 chars)&lt;/td&gt;
&lt;td&gt;7,860 tokens (31,439 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-cookbooks" rel="noopener noreferrer"&gt;GitHub (anthropics/claude-cookbooks)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;19,409 tokens (77,637 chars)&lt;/td&gt;
&lt;td&gt;4,304 tokens (17,215 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News (front page)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14,547 tokens (58,189 chars)&lt;/td&gt;
&lt;td&gt;3,052 tokens (12,207 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The range is &lt;strong&gt;51% to 79%&lt;/strong&gt; depending on the page. Let me dig into why.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Creates the Difference
&lt;/h2&gt;

&lt;p&gt;Comparing the actual output for the same Wikipedia page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright MCP&lt;/strong&gt; (&lt;code&gt;browser_snapshot&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;generic [active] [ref=e1]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;link "Jump to content" [ref=e2] [cursor=pointer]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;/url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#bodyContent"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;banner [ref=e4]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;navigation "Site" [ref=e6]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;generic "Main menu" [ref=e7]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;button "Main menu" [ref=e8] [cursor=pointer]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;WebClaw&lt;/strong&gt; (&lt;code&gt;page_snapshot&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[page "Model Context Protocol - Wikipedia"]
 [banner]
  [nav "Site"]
  [@e2 link]
 [search]
  [@e3 searchbox "Search Wikipedia"]
  [@e4 button "Search"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference comes down to design choices — each reasonable on its own, but they compound:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Design choice&lt;/th&gt;
&lt;th&gt;Playwright MCP&lt;/th&gt;
&lt;th&gt;WebClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Which elements get refs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All elements (&lt;code&gt;generic&lt;/code&gt;, &lt;code&gt;rowgroup&lt;/code&gt;, &lt;code&gt;cell&lt;/code&gt;...)&lt;/td&gt;
&lt;td&gt;Only interactive elements (buttons, links, inputs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attribute output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[active]&lt;/code&gt;, &lt;code&gt;[cursor=pointer]&lt;/code&gt;, &lt;code&gt;/url:&lt;/code&gt; on all applicable&lt;/td&gt;
&lt;td&gt;Minimal — only what's needed for action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Table representation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full nested structure per cell&lt;/td&gt;
&lt;td&gt;Compressed single-line rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ref count (GitHub)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;789 refs&lt;/td&gt;
&lt;td&gt;245 refs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Playwright MCP's approach — labeling every element with a ref — gives maximum flexibility for targeting any element. WebClaw trades that completeness for compactness by only labeling things the AI can actually interact with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the range is so wide (51% to 79%)
&lt;/h3&gt;

&lt;p&gt;The format savings vary by page structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub (78%)&lt;/strong&gt;: The file listing table is where the biggest difference shows. Playwright MCP assigns refs to every &lt;code&gt;row&lt;/code&gt;, &lt;code&gt;cell&lt;/code&gt;, &lt;code&gt;generic&lt;/code&gt; wrapper (789 total). WebClaw only labels links and buttons (245 total). Additionally, WebClaw follows the W3C Accessible Name specification, using &lt;code&gt;textContent&lt;/code&gt; before the &lt;code&gt;title&lt;/code&gt; attribute for buttons and links. On GitHub, many buttons have short display text ("X") but verbose title attributes ("Close dialog") — using the spec-compliant order avoids the bloat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hacker News (79%)&lt;/strong&gt;: Simple, repetitive table structure. WebClaw's table compression (&lt;code&gt;[row] 1. | link | link&lt;/code&gt;) eliminates most of the verbosity. Playwright MCP outputs nested &lt;code&gt;rowgroup &amp;gt; row &amp;gt; cell &amp;gt; generic &amp;gt; link&lt;/code&gt; for each of the 30 items.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wikipedia (51%)&lt;/strong&gt;: The article body has many inline links that both tools represent similarly. The savings come primarily from the navigation templates (Generative AI, Artificial Intelligence navboxes) where structural compression helps, but the text content itself is irreducible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Controlling Output Size
&lt;/h2&gt;

&lt;p&gt;WebClaw defaults to unlimited output — no truncation. But when you need to manage token costs, two options are available:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interactive elements only&lt;/strong&gt; — &lt;code&gt;interactiveOnly&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"interactiveOnly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strips all text content. A 2,000-line page becomes ~200 lines of buttons, links, and inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Landmark region focus&lt;/strong&gt; — &lt;code&gt;focusRegion&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"focusRegion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"main"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only returns the &lt;code&gt;main&lt;/code&gt;, &lt;code&gt;nav&lt;/code&gt;, &lt;code&gt;header&lt;/code&gt;, or &lt;code&gt;footer&lt;/code&gt; section. Useful when you know where the content you need is.&lt;/p&gt;

&lt;p&gt;Playwright MCP doesn't have equivalents — it always returns the full tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Landscape
&lt;/h2&gt;

&lt;p&gt;This comparison only covers in-context accessibility trees. The ecosystem is moving fast, and there are other approaches worth knowing about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Playwright MCP file output&lt;/strong&gt; (&lt;code&gt;--output-mode file&lt;/code&gt;): Saves snapshots to disk files instead of returning them in LLM context. Clients that support file references can read these without consuming context tokens. A fundamentally different approach to the same problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOM compression tools&lt;/strong&gt; (Vercel's &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt;, &lt;a href="https://github.com/browser-use/browser-use" rel="noopener noreferrer"&gt;browser-use&lt;/a&gt;, etc.): These extract and compress DOM/accessibility tree state, filtering down thousands of nodes to the most relevant elements. Some also support optional vision models for layout understanding as a secondary input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;WebClaw's approach is narrower: same accessibility tree method as Playwright MCP's &lt;code&gt;browser_snapshot&lt;/code&gt;, but with a more compact format. The numbers above show what format choices alone can do — but they don't capture the full picture of what's possible with file-based or DOM compression approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Format Efficiency Still Matters
&lt;/h2&gt;

&lt;p&gt;Even with file-based alternatives emerging, in-context snapshots remain the default for most MCP setups. A browser automation task rarely reads a page just once — navigate, read, click, read again, fill a form, check the result — that's easily 5-10 snapshot calls. A 51-79% format reduction compounds across those calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;I'm biased — I built WebClaw — so let me be upfront about the tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Playwright MCP is the better choice:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/headless environments (WebClaw needs a visible Chrome window)&lt;/li&gt;
&lt;li&gt;Cross-browser testing (Chromium, Firefox, WebKit)&lt;/li&gt;
&lt;li&gt;Zero-install setup (&lt;code&gt;npx&lt;/code&gt; one-liner vs. Chrome extension)&lt;/li&gt;
&lt;li&gt;Complete output — every element gets a ref, nothing is omitted&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--output-mode file&lt;/code&gt; for file-based snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where WebClaw fits better:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token-sensitive workflows where format compactness matters&lt;/li&gt;
&lt;li&gt;Logged-in sessions (runs in your existing Chrome — no re-authentication)&lt;/li&gt;
&lt;li&gt;Bot-resistant sites (Chrome extension, no WebDriver flags)&lt;/li&gt;
&lt;li&gt;When you need output size controls (&lt;code&gt;interactiveOnly&lt;/code&gt;, &lt;code&gt;focusRegion&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;WebClaw limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires Chrome + extension install&lt;/li&gt;
&lt;li&gt;No headless mode&lt;/li&gt;
&lt;li&gt;No test code generation&lt;/li&gt;
&lt;li&gt;Uses your real session (the AI operates with your credentials)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add webclaw &lt;span class="nt"&gt;--&lt;/span&gt; npx &lt;span class="nt"&gt;-y&lt;/span&gt; webclaw-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude Desktop&lt;/strong&gt; — add to &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"webclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"webclaw-mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then install the &lt;a href="https://github.com/kuroko1t/webclaw/releases/latest" rel="noopener noreferrer"&gt;Chrome extension&lt;/a&gt;: extract the zip, go to &lt;code&gt;chrome://extensions/&lt;/code&gt;, enable Developer mode, and load the &lt;code&gt;dist/&lt;/code&gt; folder.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The takeaway isn't "use WebClaw instead of Playwright MCP" — it's that &lt;strong&gt;accessibility tree format choices matter more than you'd expect&lt;/strong&gt;. Assigning refs to every element vs. only interactive ones, including &lt;code&gt;[cursor=pointer]&lt;/code&gt; hints vs. omitting them, following the W3C accessible name spec vs. using title attributes — these small decisions compound into a 51-79% difference on real pages.&lt;/p&gt;

&lt;p&gt;The browser MCP space is evolving quickly. File-based snapshots, DOM compression tools, and hybrid approaches are all worth watching. If you're hitting token limits with your current setup, the data here might help you understand why — and what to try next.&lt;/p&gt;

&lt;p&gt;If you want to reproduce these measurements or try WebClaw, the &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;repo is open&lt;/a&gt;. Issues and feedback welcome — this is a solo project and I'm still figuring out the right tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;github.com/kuroko1t/webclaw&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;npm&lt;/strong&gt;: &lt;code&gt;npx -y webclaw-mcp&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;WebClaw is MIT-licensed open source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>webdev</category>
      <category>playwright</category>
    </item>
  </channel>
</rss>
