DEV Community

Fernando Rodriguez
Fernando Rodriguez

Posted on • Originally published at frr.dev

Agentic Experience: 1,324 calls to my CLI, 15.9% error rate

My CLI's most frequent user isn't me

lql is a Rust CLI for managing Linear issues. I wrote it because none of the existing alternatives worked for my actual use case: an AI agent that manages issues autonomously.

Why I had to write my own CLI

Linear's MCP was the first attempt. The idea is elegant: an MCP server that exposes Linear's API directly to the agent. In practice, it was slow, unstable, and the agent had to build GraphQL queries from scratch on every call. Each call was an opportunity to invent a field that doesn't exist. I uninstalled it after two weeks.

The community CLI (linear, by schpet) was the second attempt. Designed for humans: interactive menus, arrow key selection, confirmation prompts. An agent can't navigate interactive menus. Next.

Linear's "agent." In March 2026, Linear announced their own AI agent. Sounds perfect until you read the fine print: it only works inside Linear's web interface. You can't call it from terminal, it has no API, it doesn't integrate with anything external. It's a chatbot glued to their own UI. If your workflow is "the agent that codes also manages issues," Linear's agent solves nothing.

Data-driven design: the first analysis

Before writing lql, I parsed 165 Claude Code sessions looking for every error when interacting with Linear. The result: over 500 errors, 370 retries, and a conservative estimate of 700,000 tokens per month wasted. --sort forgotten 40 times. --state "Todo" instead of --state unstarted, 12 times. --no-interactive missing and the CLI hanging waiting for keyboard input, 64 times. (Full details are in the original post.)

With that data I designed lql's interface. I didn't guess what an agent needed — I measured it. That's where the foundational decisions came from: compact output in TOON (~25 tokens per issue instead of verbose JSON), aliases for flags that LLMs confuse (--status--state), value normalization (Todounstarted, urgent1), and error messages that suggest the correct command instead of just saying "unknown flag."

Without knowing it, I was applying Postel's Law. But I discovered that later.

The second analysis: a month with lql in production

lql has been in production for a month. Claude Code calls it 30-50 times per day — creating issues, updating states, linking dependencies, querying details. I never call it. I'm not interested. I don't want to manage issues manually; that's what I have an agent for. If I ever have to run lql myself, something has gone very wrong.

lql's only user is an LLM. That changes everything about the design.

So I repeated the exercise: I parsed Claude Code's session logs looking for errors when calling lql.

Metric Value
Sessions analyzed 200
lql calls 1,324
Errors (is_error: true) 210
Error rate 15.9%

The 15.9% doesn't include calls canceled due to parallelism (when one tool call fails and Claude cancels the others in that batch). Only actual CLI errors.

Error classification

Not all errors are equal. Some reveal missing conventions; others reveal operations that should exist.

Error Frequency Real example
Label not found 20 --label tokamak (doesn't exist in that team)
--title as flag in create 8 lql create --title "Epic: ..." --team PROD
show/get instead of view 6 lql show PROD-911
relate args in wrong order 12 lql relate PROD-834 PROD-833 blocked-by
update --team (move issue) 15+ attempts lql update PRIV-32 --team PROD
relates instead of related 2 lql relate PROD-912 relates PROD-910
--body in comment 1 lql comment PROD-926 --body "text"
--comments in view 1 lql view PROD-824 --comments

The rest were Linear API errors, 1Password authentication issues, or shell errors (broken quoting in long heredocs).

What the errors reveal

LLMs don't read --help. They guess by semantic intuition.

When a developer doesn't know how a command works, they run lql view --help. When Claude doesn't know, it guesses. And it guesses right 84% of the time — but the remaining 16% reveals its biases.

lql show is more intuitive than lql view. Most tools use show: kubectl get, docker inspect, git show. Claude doesn't consult lql's documentation to choose the verb. It uses the one that seems most natural given the thousands of CLIs it has seen in its training data.

The solution isn't better documentation. It's accepting the synonym:

#[command(alias = "show", alias = "get")]
View(ViewOpts),
Enter fullscreen mode Exit fullscreen mode

One line. Six errors eliminated.

Agents prefer named flags over positional arguments

In lql create, the title is a positional argument:

lql create "My title" --team PROD
Enter fullscreen mode Exit fullscreen mode

Claude, on 8 occasions, wrote it like this:

lql create --title "My title" --team PROD
Enter fullscreen mode Exit fullscreen mode

--title didn't exist as a flag. For a human this is obvious — you read the --help and see that <TITLE> is positional. For an LLM, named flags are safer because they don't depend on position.

The fix: accept both.

pub struct CreateOpts {
    pub title: Option<String>,

    #[arg(long = "title", hide = true)]
    pub title_flag: Option<String>,
    // ...
}
Enter fullscreen mode Exit fullscreen mode

The --title flag is hidden in --help (humans don't need it), but it works.

If you can detect the error, fix it instead of rejecting it

The most revealing case. lql relate expects three positional arguments in strict order:

lql relate <FROM> <RELATION_TYPE> <TO>
Enter fullscreen mode Exit fullscreen mode

Claude wrote this 12 times:

lql relate PROD-834 PROD-833 blocked-by
Enter fullscreen mode Exit fullscreen mode

The natural order for an LLM is FROM TO TYPE — "relate this to this, in this way." The CLI's order is FROM TYPE TO — "from here, relation type, to there."

POSIX philosophy says: reject incorrect input with a descriptive error. AX philosophy says: if you can detect that the second argument is an issue ID and the third is a relation type, reorder them automatically.

pub fn normalize_args(args: &[String]) -> Option<Vec<String>> {
    if args.len() < 5 { return None; }
    if args[1] == "relate"
        && looks_like_issue_id(&args[2])
        && looks_like_issue_id(&args[3])
        && !looks_like_issue_id(&args[4])
    {
        let mut fixed = args.to_vec();
        fixed.swap(3, 4);
        eprintln!(
            "ℹ Reordered: relate {} {} {} → relate {} {} {}",
            args[2], args[3], args[4], fixed[2], fixed[3], fixed[4]
        );
        return Some(fixed);
    }
    None
}
Enter fullscreen mode Exit fullscreen mode

The detection is deterministic: an issue ID has the format TEAM-123 (uppercase, dash, number). A relation type doesn't. There's no possible ambiguity.

A note is emitted to stderr (ℹ Reordered: ...) so there's a record of the correction. If the heuristic ever fails, the user can trace what happened.

If an operation is attempted repeatedly, it should exist

The agent tried lql update PRIV-32 --team PROD more than 15 times across multiple sessions. Moving an issue from one team to another is a legitimate Linear operation that lql simply didn't implement.

It wasn't an interface error. It was a missing feature. The data made it visible.

Adding --team to update required 3 lines in the clap parser and an additional meta.find_team() call in the update logic. Linear's API already supported teamId in the issueUpdate mutation.

What tolerance doesn't fix

We need to be honest about the limits. Of the 210 errors:

  • 20 were non-existent labels. Claude invented labels like tokamak or improvement that don't exist in that Linear team. This doesn't get fixed with aliases — it requires the agent to query available labels before creating. lql already returns fuzzy suggestions ("Closest: ..."), but Claude doesn't always retry.

  • 18 were API errors (wrong team labels, invalid GraphQL queries in raw). These are agent errors, not CLI errors.

  • 7 were authentication errors (1Password down or expired session). Infrastructure, not interface.

Interface tolerance covers maybe 60% of the errors. The rest requires the agent to be more disciplined or the tool to validate more things before sending to the API.

Postel's Law applied to CLIs

Jon Postel wrote in 1980: "Be conservative in what you send, be liberal in what you accept" (RFC 761). It's TCP's robustness principle. Every internet protocol that works applies it.

Nobody applies it to CLIs. POSIX orthodoxy is the opposite: reject any input that doesn't exactly match the specification, return a descriptive error, and let the user fix it. When your user is a human who reads the error and adjusts, it works. When your user is an LLM that retries with a random variation, it's a waste of time and tokens.

Agentic Experience is Postel's Law applied to CLI arguments. It's not a new idea. It's a 1980 principle that was never applied to this context.

Five concrete rules emerge from the data:

  1. Accept natural synonyms. If the verb exists in popular CLIs (show, get, display), accept it as an alias. It costs nothing and eliminates vocabulary errors.

  2. Accept named flags in addition to positionals. LLMs prefer --title "X" to putting "X" in the correct position. Hide them in --help if you don't want to confuse humans.

  3. Reorder rather than reject. If arguments are of distinguishable types (issue ID vs. enum string), wrong order can be detected and corrected automatically.

  4. Normalize close variants. relatesrelated, blockedbyblocked-by. Edit distance is 1. The cost of accepting it is zero. The cost of rejecting it is an error and a retry.

  5. If it's attempted >3 times, it probably should exist. Session logs are a goldmine of data about missing features. An agent doesn't insist on an absurd operation 15 times. If it insists, the operation makes sense and the tool doesn't support it.

The meta angle

I used Claude to parse Claude's logs, classify Claude's errors when using my tool, and then implement the fixes. The tool adapts to its most frequent user with data from that same user.

All the code is public. The commit with the corrections is 34f1f08. The data can be reproduced by parsing the JSONL files in ~/.claude/projects/.

How to do it yourself

  1. Parse the logs. Claude Code's JSONL files are in ~/.claude/projects/<project>/. Each tool_result with is_error: true is a data point. The format is the same for any tool, not just CLIs.

  2. Classify before fixing. Not all errors are equal. Separate interface errors (the CLI rejects valid input) from logic errors (the agent asks for something absurd). Only the first type gets fixed with tolerance.

  3. Measure afterwards. The error rate before changes was 15.9%. Next time I analyze, I'll know if it went down. Without the first measurement, there's no baseline.

lql installs with brew install frr149/tools/lql. The repo is at github.com/frr149/lql.

This article was originally written in Spanish and translated with the help of AI.

Top comments (0)