DEV Community

Dave London
Dave London

Posted on

Structured Output for AI Coding Agents: Why I Built Pare

If you've spent any time watching Claude Code, Cursor, or Copilot work through a coding task, you've seen it: the agent runs git log and gets back 200 lines of formatted terminal output. It runs npm outdated and parses an ASCII table. It runs docker ps and tries to extract container IDs from column-aligned text that was designed for a human glancing at a terminal.

Most of the time it works. Sometimes it doesn't — the agent misreads a column boundary, hallucinates a field that wasn't there, or burns through context window space on ANSI color codes and decorative characters that carry zero information. And every time, it's spending your tokens on text it has to re-parse back into the structured data the CLI tool had internally before it formatted it for human eyes.

I started tracking this in my own workflows. In a typical 30-minute coding session, an agent might make 40-60 tool calls. Each one returns raw terminal text that the model has to interpret. The token overhead from progress bars, ANSI escape sequences, column padding, and repeated headers was consistently 3-10x more than the actual data the agent needed. On multi-file refactors with heavy test/build cycles, I was watching context windows fill up with formatting noise while the agent lost track of the actual code changes it was supposed to be reasoning about.

The frustrating part: the data was already structured inside the tool. git knows the commit hash, author, and file list as distinct fields. eslint has a JSON formatter built in. cargo test tracks pass/fail per test case internally. But the default output mode — the one every agent uses — throws all that structure away and paints a picture for a human reading a terminal.

What if the tools just spoke the agent's language?

That was the premise behind Pare: build MCP servers that wrap real CLI tools and return typed, schema-validated JSON. Not approximations — real parsers that handle the full output surface of each tool, including edge cases, errors, and platform differences.

The scope grew fast. What started as a few servers for git, npm, and Docker turned into 25 MCP servers covering 222 tools across the developer CLI landscape:

Category Servers What They Wrap
Version control Git, GitHub 28 git operations + PRs, issues, actions, releases
JavaScript/TS npm, Bun, Deno Package management, scripts, builds, tests
Build & lint Build, Lint, Search tsc, vite, webpack, esbuild, ESLint, Prettier, Biome, ripgrep
Testing Test vitest, jest, mocha, pytest — auto-detected
Containers Docker Images, containers, compose, networks, volumes
Languages Cargo, Go, Python, Swift, Ruby, .NET Build, test, lint, format for 6 language ecosystems
JVM JVM Gradle and Maven — build, test, dependencies
Infrastructure Infra, K8s Terraform, Vagrant, Ansible, Helm, kubectl
Databases DB PostgreSQL, MySQL, MongoDB, Redis
Other HTTP, Security, Remote, Process, Make, Nix, Bazel, CMake curl, trivy, semgrep, gitleaks, SSH, rsync, and more

Each one uses the Model Context Protocol (MCP) — the standard for AI-tool communication supported by Claude, Cursor, Windsurf, VS Code, Zed, Gemini CLI, OpenAI Codex, and others. Every tool call returns both structured JSON with a Zod-validated schema and human-readable text for chat display.

Here's a before/after with git log --stat. This is what the agent sees today:

commit a1b2c3d4e5f67890abcdef1234567890abcdef12
Author: Jane Developer <jane@example.com>
Date:   Mon Feb 10 14:32:01 2026 +0200

    Add user authentication middleware

 src/auth/middleware.ts | 45 +++++++++++++++++++++++++++++++++++++++++++++
 src/routes/api.ts      |  2 +-
 2 files changed, 46 insertions(+), 1 deletion(-)
Enter fullscreen mode Exit fullscreen mode

That's ~95 tokens. The agent has to extract the hash, author, date, file list, and diff stats by pattern-matching against whitespace-aligned text. Usually it works. Sometimes it miscounts the + characters or misreads the file path boundaries.

Here's the same commit through Pare:

{
    "commits": [
        {
            "hash": "a1b2c3d4e5f6",
            "hashShort": "a1b2c3d",
            "message": "Add user authentication middleware",
            "author": "Jane Developer",
            "date": "2026-02-10T14:32:01+02:00",
            "files": ["src/auth/middleware.ts", "src/routes/api.ts"],
            "insertions": 46,
            "deletions": 1
        }
    ],
    "total": 1
}
Enter fullscreen mode Exit fullscreen mode

~55 tokens. Every field is typed and directly addressable — no regex, no guessing. And the gap widens fast: run git log --stat on 10 commits and you're looking at ~950 tokens of terminal formatting versus ~310 tokens of structured JSON.

The token savings are real

I ran extensive benchmarks on the various tool outputs against their raw CLI equivalent:

Tool Raw CLI Tokens Pare Tokens Reduction
git status 428 86 80%
git log (10 commits) 1,847 312 83%
docker ps 892 147 84%
npm outdated 634 97 85%
eslint (with errors) 2,156 234 89%
cargo test 1,423 198 86%
pytest (with failures) 3,241 287 91%
terraform plan 4,567 342 93%

In practice, Pare also has an automatic compact mode: when the structured JSON would exceed the raw CLI token count (which can happen with very terse commands), it automatically applies a compact projection — stripping verbose fields while keeping everything the agent needs to make decisions. This means Pare always uses fewer tokens than raw output, guaranteed.

Some things that weren't obvious

From the outside, "wrap a CLI and return JSON" sounds like a weekend project. In practice, a few categories of problems kept coming up.

Every CLI is its own parsing problem. git log output varies by platform — Windows cmd.exe misinterprets angle brackets in format strings. docker ps column widths change based on content. cargo test interleaves compiler output with test results. ansible-playbook has a PLAY RECAP section with a completely different structure from the rest of its output. Each tool needed its own parser, and each parser had its own edge cases to discover.

Cross-platform differences add up. Pare runs CI on Linux, macOS, and Windows with Node.js 20 and 22. Path separators, line endings, shell quoting, and process group handling all differ in ways that don't surface until CI catches them. One example: async taskkill on Windows was leaving orphan processes after timeouts, which required switching to synchronous execFileSync for the kill logic.

Schema design is an iterative process. The first version of the output schemas included everything each CLI returned. Over time it became clear that agents don't benefit from — and are sometimes confused by — fields like resolvedUrl in npm packages or endLine/endColumn in lint diagnostics. Each round of pruning was guided by watching what agents actually used versus what they read and ignored. It ended up being closer to API design than CLI wrapping.

The testing regime is substantial. 222 tools need more than just unit tests:

  • Parser and formatter tests cover every output format with realistic CLI fixtures — currently over 4,500 tests across 218 test files
  • Fidelity tests run the real CLI tool and the Pare parser against the same inputs, then diff the results. If the parser drops or misrepresents data, the test fails. This catches regressions that unit tests miss because the fixture is stale.
  • Security tests on every package verify that flag injection is blocked on all positional parameters and that Zod input limits prevent DoS via oversized payloads
  • Smoke tests replay recorded MCP sessions — real tool call transcripts captured from actual agent usage — to verify the full request/response cycle hasn't regressed
  • Integration tests spawn real MCP servers via StdioClientTransport and make actual tool calls, validating the entire chain from input schema through CLI execution to output schema

This isn't test theater. The fidelity and smoke layers exist because I kept finding bugs that unit tests missed: parsers that worked on the fixture but broke on real output, schema changes that compiled fine but broke the MCP response format, compact mode projections that accidentally dropped error information.

The architecture that makes 25 servers maintainable

Scaling to 25 without the codebase becoming a mess required deliberate architectural investment:

Shared foundations. A common library (@paretools/shared) provides the dual-output system, command execution with execFile (no shell injection surface), input validation, error categorization, and a createServer() factory that eliminates boilerplate. When I add a new server, the entry point is 6 lines of code.

Structured error recovery. Every Pare tool classifies failures into categories an agent can match on programmatically — command-not-found, permission-denied, timeout, network-error, authentication-error, conflict, and others. Instead of parsing "Error: EACCES: permission denied" from stderr, the agent gets { "category": "permission-denied", "command": "git", "exitCode": 128 } and can decide what to do next without guessing.

Centralized input schemas. Common parameters like path, compact, fix, config, and filePatterns are defined once in the shared library and reused across all 222 tools. This ensures consistent behavior and makes it impossible for one server to accidentally define path differently from another.

Automatic compact mode. Every tool measures whether structured output saves tokens compared to raw CLI output. If structured is more expensive (rare, but possible for very terse commands), it automatically switches to a compact projection. The agent can override this with compact: false if it needs full details.

Security when agents are the users

When an AI agent constructs CLI commands from natural language, the attack surface changes. A prompt injection that tricks the agent into passing --output=/etc/passwd as a "filename" is a real threat.

Every Pare tool defends against this:

  • execFile everywhere: Argument arrays, never shell string concatenation
  • Flag injection detection: assertNoFlagInjection() on every positional string parameter — anything starting with - is rejected
  • Input size limits: Zod .max() constraints on all strings and arrays prevent payload-based DoS
  • Policy gates: Destructive operations like vagrant destroy or terraform apply require explicit opt-in via environment variables
  • Docker volume blocking: Mount validation prevents access to sensitive host paths

"But loading all those tools costs tokens too"

This is one of the frequent pushback I hear when discussing pare with other devs, and it's fair. Every MCP server registers tool definitions upfront — that's context the model carries for the whole session. But the math works out differently than you'd expect.

First, there is rarely a need to load all 222 tools. Pare has 25 servers with ~9 tools each. It provides flexibility in installing and using only what you need — if you just want git and test, that's 27 tool definitions. You can filter even further with environment variables:

# Only register status and log in the git server
PARE_GIT_TOOLS=status,log npx @paretools/git
Enter fullscreen mode Exit fullscreen mode

Second, the savings compound on the output side. Each tool call returns structured JSON that's typically 30-90% leaner than raw CLI output. In a benchmark across a real coding session, the aggregate reduction was 72% — and that's counting the upfront tool registration cost. After two or three tool calls, you're ahead on net context usage. By the end of a session with 40-60 tool calls, the savings are substantial.

Third, and this is the part people miss: even when the token count is similar, structured output is higher quality context. An agent reading { "success": false, "category": "permission-denied", "exitCode": 128 } doesn't need to pattern-match against stderr text. It reasons about typed fields directly. That means fewer wasted inference cycles, fewer misinterpretations, and less backtracking — which saves tokens downstream in ways that don't show up in a simple input/output comparison.

All coding agents are welcomed

Pare works with any MCP-compatible client. Here's the setup for some of the popular ones:

Claude Code (one command per server):

claude mcp add --transport stdio pare-git -- npx -y @paretools/git
claude mcp add --transport stdio pare-test -- npx -y @paretools/test
Enter fullscreen mode Exit fullscreen mode

Claude Desktop / Cursor / Windsurf / Cline / Gemini CLI (JSON config):

{
    "mcpServers": {
        "pare-git": {
            "command": "npx",
            "args": ["-y", "@paretools/git"]
        },
        "pare-test": {
            "command": "npx",
            "args": ["-y", "@paretools/test"]
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

VS Code / GitHub Copilot (.vscode/mcp.json):

{
    "servers": {
        "pare-git": {
            "type": "stdio",
            "command": "npx",
            "args": ["-y", "@paretools/git"]
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

OpenAI Codex (.codex/config.toml):

[mcp_servers.pare-git]
command = "npx"
args = ["-y", "@paretools/git"]
Enter fullscreen mode Exit fullscreen mode

That's it. The agent immediately gets access to structured tool output. No configuration, no API keys, no runtime dependencies beyond the CLI tools themselves.

Telling the agent to prefer Pare

Once the servers are configured, add a one-liner to your project's agent instruction file so the agent reaches for Pare tools instead of raw CLI commands:

CLAUDE.md (Claude Code):

## MCP Tools

When Pare MCP tools are available (prefixed with mcp\_\_pare-\*), prefer them
over running raw CLI commands via Bash. Pare tools return structured JSON
with ~85% fewer tokens than CLI output.
Enter fullscreen mode Exit fullscreen mode

AGENTS.md (OpenAI Codex, Gemini CLI):

## MCP Servers

This project uses Pare MCP servers for structured, token-efficient dev
tool output. Prefer Pare MCP tools over raw CLI commands for git, testing,
building, linting, npm, docker, python, cargo, and go.
Enter fullscreen mode Exit fullscreen mode

.cursor/rules/pare.mdc (Cursor):

---
description: Use Pare MCP tools for structured dev tool output
globs: ["**/*"]
alwaysApply: true
---

When Pare MCP tools are available, prefer them over running CLI commands
in the terminal. Pare tools return structured JSON with up to 95% fewer
tokens than raw CLI output.
Enter fullscreen mode Exit fullscreen mode

With this in place, the agent will automatically use mcp__pare-git__status instead of running git status through Bash — and get typed JSON back instead of terminal text.

What's next

The MCP ecosystem is young. The patterns established now — how tools structure their output, what schemas look like, how errors are categorized — will shape how AI agents interact with developer infrastructure for years.

I've open-sourced Pare under the MIT license because this should be shared infrastructure, not a proprietary advantage. The codebase is designed to make contributing straightforward: each server is self-contained, follows the same architecture, and has the same test patterns. If there's a CLI tool you wish your agent handled better, the pattern for adding it is well-established.

GitHub: github.com/Dave-London/Pare
npm: All 25 packages at npmjs.com/org/paretools


Built by Dave London.

Top comments (1)

Collapse
 
maxxmini profile image
MaxxMini

This resonates with real pain I've experienced. I run automated workflows that chain 50+ CLI calls per session, and the token overhead from ANSI codes and ASCII table formatting is brutal — especially docker ps and git log --stat where the column alignment tricks models into hallucinating non-existent fields.

The MCP approach is smart. One question: how do you handle tools that mix structured data with genuinely useful freeform output? For example, cargo build warnings often contain contextual hints in the prose that a pure JSON extraction might lose. Do Pare servers preserve those as a separate text field, or do you have a heuristic for what's "signal" vs "formatting noise"?

Also curious about the latency overhead — adding a parse + validate layer per call seems worth it for context savings, but have you measured the wall-clock cost on hot paths like test runners that get called dozens of times?