DEV Community: Chris Yao

I Built a CLI That X-Rays Your AI Coding Sessions — No LLM, <5ms (Open Source)

Chris Yao — Wed, 08 Apr 2026 14:01:00 +0000

I score every prompt I send to AI coding tools. My average across 3,140 prompts over ten weeks: 38 out of 100.

Not because I'm bad at prompting. Because at 2am debugging an auth bug, I type "fix the auth bug" and hit enter. Same intent as a well-structured prompt, completely different quality.

So I built ctxray — a CLI that analyzes how you actually use AI coding tools. Not what the AI outputs. What you type into it. Rule-based, local-only, under 5ms per prompt.

Before / After

$ ctxray check "fix the auth bug"

  DRAFT · 29

  Clarity     ███████░░░░░░░░░░░░░  9/25
  Context     ░░░░░░░░░░░░░░░░░░░░  0/25
  Position    ████████████████████ 20/20

  Auto-rewrite (+24 pts)
  ✓ Added debug prompt structure

  Rewritten:
    fix the auth bug
    Error: <paste the error message or stack trace>
    File: <which file and function>
    Expected: <what should happen vs what actually happens>

It detected "fix the auth bug" as a debug task and added the slots that debug prompts need. Implement prompts get I/O specs + edge cases. Refactor gets scope + constraints. Five task types, each with different structural scaffolding.

The same prompt with actual context scores 58:

$ ctxray check "Fix the NPE in auth.service.ts:47 when session expires,
  expected AuthException not HTTP 200"

  GOOD · 58

  Clarity     █████████████████░░░ 22/25
  Context     ████████████░░░░░░░░ 16/25
  Position    ████████████████████ 20/20

Same intent. Twice the score. The difference is file path, line number, error message — context that the model needs but I keep forgetting to include.

What it actually does

ctxray scans session files from 9 AI coding tools on your machine. Claude Code, Cursor, Aider, Gemini CLI, Cline, OpenClaw, Codex CLI, plus ChatGPT and Claude.ai web exports.

pip install ctxray

ctxray scan                    # auto-discover sessions
ctxray check "your prompt"     # score + lint + rewrite
ctxray insights                # personal patterns vs research benchmarks
ctxray sessions                # session quality + frustration signals
ctxray agent                   # agent workflow efficiency
ctxray privacy --deep          # find leaked API keys in sessions

ctxray insights is the one that surprised me most. It told me 32% of my prompts were near-duplicates — same structure, different variable names. I'm asking the same thing across sessions without remembering I figured it out last week.

ctxray sessions scores entire sessions and detects frustration signals — error loops where the same fix gets tried 3+ times, repetitive prompts that signal the model isn't understanding, and sessions where >60% of turns are filler.

The compression engine

Your prompts probably contain more filler than you think:

$ ctxray compress "I was wondering if you could please help me refactor
  the authentication middleware to use JWT tokens instead of session
  cookies. Basically what I need is for the current implementation in
  src/auth/middleware.ts to be updated."

  Tokens: 50 → 33 (34% saved)
  Research: Moderate compression improves LLM output (Zhang+ 2505.00019)

Four layers: character normalization, phrase simplification, filler deletion, structure cleanup. All regex. Works for English and Chinese.

Why rule-based in 2026?

Everyone's using LLMs to analyze prompts. I went the other way. Three reasons:

Speed. Under 5ms per prompt. Runs in a pre-commit hook and nobody notices. An LLM call takes 2-5 seconds.

Determinism. Same input, same output, every time. I track my scores weekly — if the scoring function shifts with model version, the trend is meaningless. I use ctxray lint --score-threshold 50 in CI. Random failures from a creative LLM would not be fun.

Privacy. My prompts contain file paths, function names, error messages with stack traces, sometimes credentials I forgot to redact. That's a map of my codebase. Sending it to another LLM for "improvement" defeats the purpose.

The tradeoff is real. Structural signals miss semantic intent. An LLM would understand "this is where I changed my whole approach." My heuristics just see a long turn with high vocabulary shift. For daily feedback loops, structural analysis catches 80-90% of what matters.

The research behind the scoring

The scoring engine uses 30+ features calibrated against 10 NLP papers. Not as decoration — each paper maps to a specific dimension:

Position bias is architectural (Stanford 2307.03172, confirmed by Chowdhury 2603.10123): models weight beginnings and ends of prompts more than the middle. Front-load your instructions.
Moderate repetition helps (Google 2512.14982): repeating key requirements at the end improves recall up to 76%. But excessive repetition hurts.
Specificity > length (Zi+ 2508.03678): file paths, line numbers, error messages improve output more than verbose explanations.

Works in CI

The part that makes this sticky: ctxray lint runs as a CI quality gate.

# GitHub Action
- uses: ctxray/ctxray-action@v1
  with:
    score-threshold: 50

There's a .ctxray.toml config for team rules, a --format github flag that posts score breakdowns as PR comments, and a pre-commit hook. Think ESLint for AI prompts — configurable, deterministic, fast.

Why now

Every prompt analysis tool is getting acquired. Promptfoo joined OpenAI (March 2026). Humanloop was acqui-hired by Anthropic. PromptPerfect got absorbed into Jina. The tools that measured model behavior now belong to the labs that make models.

ctxray stays independent and model-agnostic. It's the only tool that sees your Claude Code sessions and your Cursor sessions and your ChatGPT history together, locally, without sending anything anywhere.

What I haven't figured out

The scoring handles maybe 30% of what makes a good prompt. The other 70% is stuff only you know — the error message on your screen, the file you just edited, the approach you already tried. No tool can add that for you.

I also don't think the scoring is "right" yet. A 3-word prompt from someone deep in a debugging session can be more effective than a 200-word structured request from someone who doesn't understand the codebase. Context that lives in your head doesn't show up in a score.

Try it

pip install ctxray
ctxray demo              # try with built-in sample data
ctxray scan              # discover your sessions
ctxray check "your prompt here"

1,941 tests, strict mypy, MIT licensed. Everything local, no account, no telemetry.

Star on GitHub

What do your prompts look like when you actually measure them? I'm genuinely curious whether people who use CLAUDE.md files or cursor rules have noticeably different patterns.

This CLI Rewrites Your AI Prompts — No LLM, No API, 50ms (Open Source)

Chris Yao — Wed, 01 Apr 2026 11:29:23 +0000

I score every prompt I send to Claude Code. My average is 38 out of 100.

Not because I'm bad at prompting — because I'm human. At 2am debugging an auth bug, I don't carefully structure my request. I type "fix the auth bug" and hit enter.

I built a scoring engine. Then a compression engine. They told me what was wrong but didn't fix anything. So I built the part I actually wanted: a rewrite engine that takes a lazy prompt and makes it better. No LLM. No API call. Just rules extracted from NLP papers.

Before / After

$ reprompt rewrite "I was wondering if you could maybe help me fix the authentication bug that seems to be kind of broken"

  34 → 56 (+22)

  ╭─ Rewritten ────────────────────────────────────────╮
  │ Help me fix the authentication bug that seems to   │
  │ be broken.                                         │
  ╰────────────────────────────────────────────────────╯

  Changes
  ✓ Removed filler (18% shorter)
  ✓ Removed hedging language

  You should also
  → Add actual code snippets or error messages for context
  → Reference specific files or functions by name
  → Add constraints (e.g., "Do not modify existing tests")

The "You should also" section is honestly the most useful part. The machine handles what it can — filler removal, restructuring — and tells you what only a human can add.

What the Rewriter Does

Four transformations, applied in order:

1. Strip filler. "Please help me with", "basically what I need is", "I would like you to" — these add tokens without adding information. 40+ English rules, 40+ Chinese rules (reuses the compression engine).

2. Front-load instructions. If your key ask is buried in the middle, it moves it to the front. This matters: Stanford's "Lost in the Middle" paper found models recall instructions at the start 2-3x better than instructions in the middle.

3. Echo key requirements. For long prompts (40+ words) with low repetition, the main instruction gets repeated at the end. Google Research (arXiv:2512.14982) found moderate repetition improves recall by up to 76%. This only fires when the prompt is long enough that the model might lose the thread.

4. Remove hedging. "Maybe", "perhaps", "I was wondering", "kind of", "sort of". These weaken the instruction signal without adding information. 12 regex patterns.

Why Not Use an LLM to Rewrite?

I thought about it. Three reasons I went rule-based:

It's fast. Under 50ms. You can run it in a pre-commit hook or CI pipeline and nobody notices.

It's deterministic. Same input, same output. I actually use reprompt lint in CI with a score threshold — if I used an LLM rewriter, my CI would randomly fail on Tuesdays because GPT was feeling creative.

It's private. My prompts contain production error messages, internal file paths, sometimes API keys I forgot to redact. That's exactly the kind of thing I don't want sending to another LLM for "improvement."

The Broader Toolkit

rewrite is one command. Here's what else is in the box:

reprompt check "your prompt"          # full diagnostic: score + lint + rewrite
reprompt build "task" --file auth.ts  # assemble a prompt from components
reprompt compress "your prompt"       # save 40-60% tokens
reprompt scan                         # discover sessions from 9 AI tools
reprompt privacy --deep               # find leaked API keys in sessions
reprompt lint --score-threshold 50    # CI quality gate (GitHub Action included)

Auto-discovers sessions from Claude Code, Cursor, Aider, Codex CLI, Gemini CLI, Cline, and OpenClaw. ChatGPT and Claude.ai via export. Browser extension shows a live score badge as you type — click it for inline suggestions.

What I still haven't figured out

The rewriter handles maybe 30% of what makes a good prompt. The other 70% is stuff only you know — the error message you're staring at, the file you just edited, the thing you tried that didn't work. No tool can add that for you.

I also don't think the scoring is "right" yet. A 3-word prompt from someone deep in a debugging session can be more effective than a beautifully structured 200-word request from someone who doesn't understand the codebase. Context that lives in your head doesn't show up in a score.

The weights are calibrated against 4 NLP papers, but papers study prompts in isolation. Real prompting happens in the middle of a conversation, at 2am, when you've already explained the problem three times. I'm not sure how to score that.

Try it

pip install reprompt-cli
reprompt check "your worst prompt"
reprompt rewrite "your worst prompt"

MIT, local-only, 1,800+ tests. GitHub · PyPI

Honestly curious: do you think about your prompts before sending them, or is it more stream-of-consciousness? I've been tracking mine for months and I still default to lazy prompts when I'm tired. Starting to think that's just how humans work.

I Audited 1,000+ Prompts I Sent to AI Coding Tools. Here's What I Found.

Chris Yao — Sun, 29 Mar 2026 05:13:43 +0000

I've been using AI coding tools daily for months. Claude Code, Cursor, Codex CLI, sometimes Aider. By rough estimate, I've sent over a thousand prompts to various AI services.

Recently I built a tool to answer a simple question: what exactly did I send?

The answer was uncomfortable.

Finding 1: Leaked Credentials

Running reprompt privacy --deep on my prompt history surfaced:

3 API keys (OpenAI, GitHub, one internal service)
1 JWT token (from a debugging session)
12 email addresses (from log outputs I pasted)
47 internal file paths (including home directory paths)

None of these were pasted intentionally. They were in error messages, stack traces, and log outputs that I copy-pasted when asking the AI for help debugging. The typical pattern:

"Fix this error: AuthenticationError: Invalid API key 'sk-proj-...' for model gpt-4"

That prompt just sent my API key to whatever service processes it.

Finding 2: Agent Error Loops

reprompt agent analyzes Claude Code and Codex CLI sessions for workflow efficiency. It fingerprints each tool call (tool name + target file + error flag) and detects when the agent gets stuck in a loop.

My error loop rate: 35%.

That means in over a third of my agent sessions, the AI got stuck retrying the same failing approach three or more times. The most common pattern: Bash(test.py):error -> Edit(auth.py) -> Bash(test.py):error — edit a file, run the test, fail, edit, test, fail.

The agent burned tokens and time on approaches that clearly weren't working. Knowing this changed how I intervene in agent sessions.

Finding 3: Most Conversation Turns Are Filler

reprompt distill scores every conversation turn using 6 signals (position, length, tool trigger, error recovery, topic shift, vocabulary uniqueness).

Result: 50-70% of my turns carry near-zero information.

"ok try that", "continue", "looks good", "hmm interesting" — these are the prompting equivalent of "um" and "uh." They don't guide the AI in any useful direction. The actually productive turns — the ones that specify files, constraints, and context — typically make up only 15-20 turns out of a 100-turn session.

The Privacy Angle

The EU AI Act took effect in August 2025. Organizations are increasingly required to understand what data flows to AI services. But most developers have no visibility into what they've actually sent.

reprompt privacy shows a per-tool breakdown: which adapter (Claude Code, Cursor, ChatGPT) received which types of content. reprompt privacy --deep goes further and scans for 12 categories of sensitive content: API keys (OpenAI, AWS, GitHub, Anthropic, Stripe), JWT tokens, emails, IP addresses, password assignments, environment secrets, and home directory paths.

All detection is regex-based. Zero network calls. Your prompts never leave your machine.

How It Works

reprompt reads session files that AI tools already store locally:

Tool	Format	Location
Claude Code	JSONL	`~/.claude/projects/`
Codex CLI	JSONL	`~/.codex/sessions/`
Cursor	SQLite	`~/.cursor/`
Aider	Markdown	`.aider.chat.history.md`
Gemini CLI	JSON	`~/.gemini/tmp/`

No instrumentation required. No code changes. Just:

pip install reprompt-cli
reprompt scan
reprompt privacy --deep

The scoring engine is calibrated against 4 NLP research papers. The agent analyzer builds tool call fingerprints and detects repetition patterns. The distiller uses TF-IDF cosine similarity for topic shift detection. Everything runs in <50ms for a typical session.

What I Changed

After running reprompt on my history:

I stopped copy-pasting full error messages with credentials. Instead, I redact API keys before pasting.
I intervene earlier in agent sessions when I see the same test failing twice.
My debug prompts went from averaging 31/100 to 52/100 — not from trying harder, just from seeing the score.

Try It

pip install reprompt-cli
reprompt scan                     # discover sessions from installed AI tools
reprompt                          # see your dashboard
reprompt privacy --deep           # scan for leaked credentials
reprompt agent --last 5           # analyze recent agent sessions
reprompt distill --last 3         # extract important turns

1,529 tests. MIT license. Zero network calls. Supports 9 AI tools.

GitHub: reprompt-dev/reprompt

What would your numbers look like?