DEV Community

Cover image for I Built a CLI That X-Rays Your AI Coding Sessions — No LLM, <5ms (Open Source)
Chris Yao
Chris Yao

Posted on

I Built a CLI That X-Rays Your AI Coding Sessions — No LLM, <5ms (Open Source)

I score every prompt I send to AI coding tools. My average across 3,140 prompts over ten weeks: 38 out of 100.

Not because I'm bad at prompting. Because at 2am debugging an auth bug, I type "fix the auth bug" and hit enter. Same intent as a well-structured prompt, completely different quality.

So I built ctxray — a CLI that analyzes how you actually use AI coding tools. Not what the AI outputs. What you type into it. Rule-based, local-only, under 5ms per prompt.

Before / After

$ ctxray check "fix the auth bug"

  DRAFT · 29

  Clarity     ███████░░░░░░░░░░░░░  9/25
  Context     ░░░░░░░░░░░░░░░░░░░░  0/25
  Position    ████████████████████ 20/20

  Auto-rewrite (+24 pts)
  ✓ Added debug prompt structure

  Rewritten:
    fix the auth bug
    Error: <paste the error message or stack trace>
    File: <which file and function>
    Expected: <what should happen vs what actually happens>
Enter fullscreen mode Exit fullscreen mode

It detected "fix the auth bug" as a debug task and added the slots that debug prompts need. Implement prompts get I/O specs + edge cases. Refactor gets scope + constraints. Five task types, each with different structural scaffolding.

The same prompt with actual context scores 58:

$ ctxray check "Fix the NPE in auth.service.ts:47 when session expires,
  expected AuthException not HTTP 200"

  GOOD · 58

  Clarity     █████████████████░░░ 22/25
  Context     ████████████░░░░░░░░ 16/25
  Position    ████████████████████ 20/20
Enter fullscreen mode Exit fullscreen mode

Same intent. Twice the score. The difference is file path, line number, error message — context that the model needs but I keep forgetting to include.

What it actually does

ctxray scans session files from 9 AI coding tools on your machine. Claude Code, Cursor, Aider, Gemini CLI, Cline, OpenClaw, Codex CLI, plus ChatGPT and Claude.ai web exports.

pip install ctxray

ctxray scan                    # auto-discover sessions
ctxray check "your prompt"     # score + lint + rewrite
ctxray insights                # personal patterns vs research benchmarks
ctxray sessions                # session quality + frustration signals
ctxray agent                   # agent workflow efficiency
ctxray privacy --deep          # find leaked API keys in sessions
Enter fullscreen mode Exit fullscreen mode

ctxray insights is the one that surprised me most. It told me 32% of my prompts were near-duplicates — same structure, different variable names. I'm asking the same thing across sessions without remembering I figured it out last week.

ctxray sessions scores entire sessions and detects frustration signals — error loops where the same fix gets tried 3+ times, repetitive prompts that signal the model isn't understanding, and sessions where >60% of turns are filler.

The compression engine

Your prompts probably contain more filler than you think:

$ ctxray compress "I was wondering if you could please help me refactor
  the authentication middleware to use JWT tokens instead of session
  cookies. Basically what I need is for the current implementation in
  src/auth/middleware.ts to be updated."

  Tokens: 50 → 33 (34% saved)
  Research: Moderate compression improves LLM output (Zhang+ 2505.00019)
Enter fullscreen mode Exit fullscreen mode

Four layers: character normalization, phrase simplification, filler deletion, structure cleanup. All regex. Works for English and Chinese.

Why rule-based in 2026?

Everyone's using LLMs to analyze prompts. I went the other way. Three reasons:

Speed. Under 5ms per prompt. Runs in a pre-commit hook and nobody notices. An LLM call takes 2-5 seconds.

Determinism. Same input, same output, every time. I track my scores weekly — if the scoring function shifts with model version, the trend is meaningless. I use ctxray lint --score-threshold 50 in CI. Random failures from a creative LLM would not be fun.

Privacy. My prompts contain file paths, function names, error messages with stack traces, sometimes credentials I forgot to redact. That's a map of my codebase. Sending it to another LLM for "improvement" defeats the purpose.

The tradeoff is real. Structural signals miss semantic intent. An LLM would understand "this is where I changed my whole approach." My heuristics just see a long turn with high vocabulary shift. For daily feedback loops, structural analysis catches 80-90% of what matters.

The research behind the scoring

The scoring engine uses 30+ features calibrated against 10 NLP papers. Not as decoration — each paper maps to a specific dimension:

  • Position bias is architectural (Stanford 2307.03172, confirmed by Chowdhury 2603.10123): models weight beginnings and ends of prompts more than the middle. Front-load your instructions.
  • Moderate repetition helps (Google 2512.14982): repeating key requirements at the end improves recall up to 76%. But excessive repetition hurts.
  • Specificity > length (Zi+ 2508.03678): file paths, line numbers, error messages improve output more than verbose explanations.

Works in CI

The part that makes this sticky: ctxray lint runs as a CI quality gate.

# GitHub Action
- uses: ctxray/ctxray-action@v1
  with:
    score-threshold: 50
Enter fullscreen mode Exit fullscreen mode

There's a .ctxray.toml config for team rules, a --format github flag that posts score breakdowns as PR comments, and a pre-commit hook. Think ESLint for AI prompts — configurable, deterministic, fast.

Why now

Every prompt analysis tool is getting acquired. Promptfoo joined OpenAI (March 2026). Humanloop was acqui-hired by Anthropic. PromptPerfect got absorbed into Jina. The tools that measured model behavior now belong to the labs that make models.

ctxray stays independent and model-agnostic. It's the only tool that sees your Claude Code sessions and your Cursor sessions and your ChatGPT history together, locally, without sending anything anywhere.

What I haven't figured out

The scoring handles maybe 30% of what makes a good prompt. The other 70% is stuff only you know — the error message on your screen, the file you just edited, the approach you already tried. No tool can add that for you.

I also don't think the scoring is "right" yet. A 3-word prompt from someone deep in a debugging session can be more effective than a 200-word structured request from someone who doesn't understand the codebase. Context that lives in your head doesn't show up in a score.

Try it

pip install ctxray
ctxray demo              # try with built-in sample data
ctxray scan              # discover your sessions
ctxray check "your prompt here"
Enter fullscreen mode Exit fullscreen mode

1,941 tests, strict mypy, MIT licensed. Everything local, no account, no telemetry.

Star on GitHub

What do your prompts look like when you actually measure them? I'm genuinely curious whether people who use CLAUDE.md files or cursor rules have noticeably different patterns.

Top comments (2)

Collapse
 
chrishohoho profile image
Chris Yao

Author here. Some details that didn't fit above.

The adapter architecture is probably the most satisfying part. Every AI tool stores sessions differently: Claude Code writes JSONL with tool_use blocks, Cursor stores in SQLite blobs, Aider uses markdown chat history, ChatGPT exports as nested JSON trees. Each adapter implements parse_session() and parse_conversation() for full turn reconstruction. Adding a new adapter is about 60 lines plus parsing logic.

There's also an MCP server if you want scoring inside Claude Code without leaving your session:

{
  "mcpServers": {
    "ctxray": {
      "type": "stdio",
      "command": "ctxray",
      "args": ["mcp-serve"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

7 tools: score_prompt, compress_prompt, compare_prompts, build_prompt_from_parts, check_privacy, search_prompts, scan_sessions.

And ctxray install-hook adds a post-session hook to Claude Code that prints your score after each session. It's the simplest integration and the one that actually changed my habits — my debug prompt average went from 31 to 48 after ten weeks just from seeing the number.

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Info Comment hidden by post author - thread only accessible via permalink
Ali Muwwakkil

A surprising insight from working with enterprise teams is that many developers overestimate the impact of prompt scores while underestimating the importance of integrating AI feedback into their coding workflows. The real value often lies in how seamlessly these suggestions are implemented into daily practices, rather than just scoring well. Consider creating agents that guide developers in embedding AI insights directly into their coding processes for more tangible improvements. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)

Some comments have been hidden by the post's author - find out more