I score every prompt I send to AI coding tools. My average across 3,140 prompts over ten weeks: 38 out of 100.
Not because I'm bad at prompting. Because at 2am debugging an auth bug, I type "fix the auth bug" and hit enter. Same intent as a well-structured prompt, completely different quality.
So I built ctxray — a CLI that analyzes how you actually use AI coding tools. Not what the AI outputs. What you type into it. Rule-based, local-only, under 5ms per prompt.
Before / After
$ ctxray check "fix the auth bug"
DRAFT · 29
Clarity ███████░░░░░░░░░░░░░ 9/25
Context ░░░░░░░░░░░░░░░░░░░░ 0/25
Position ████████████████████ 20/20
Auto-rewrite (+24 pts)
✓ Added debug prompt structure
Rewritten:
fix the auth bug
Error: <paste the error message or stack trace>
File: <which file and function>
Expected: <what should happen vs what actually happens>
It detected "fix the auth bug" as a debug task and added the slots that debug prompts need. Implement prompts get I/O specs + edge cases. Refactor gets scope + constraints. Five task types, each with different structural scaffolding.
The same prompt with actual context scores 58:
$ ctxray check "Fix the NPE in auth.service.ts:47 when session expires,
expected AuthException not HTTP 200"
GOOD · 58
Clarity █████████████████░░░ 22/25
Context ████████████░░░░░░░░ 16/25
Position ████████████████████ 20/20
Same intent. Twice the score. The difference is file path, line number, error message — context that the model needs but I keep forgetting to include.
What it actually does
ctxray scans session files from 9 AI coding tools on your machine. Claude Code, Cursor, Aider, Gemini CLI, Cline, OpenClaw, Codex CLI, plus ChatGPT and Claude.ai web exports.
pip install ctxray
ctxray scan # auto-discover sessions
ctxray check "your prompt" # score + lint + rewrite
ctxray insights # personal patterns vs research benchmarks
ctxray sessions # session quality + frustration signals
ctxray agent # agent workflow efficiency
ctxray privacy --deep # find leaked API keys in sessions
ctxray insights is the one that surprised me most. It told me 32% of my prompts were near-duplicates — same structure, different variable names. I'm asking the same thing across sessions without remembering I figured it out last week.
ctxray sessions scores entire sessions and detects frustration signals — error loops where the same fix gets tried 3+ times, repetitive prompts that signal the model isn't understanding, and sessions where >60% of turns are filler.
The compression engine
Your prompts probably contain more filler than you think:
$ ctxray compress "I was wondering if you could please help me refactor
the authentication middleware to use JWT tokens instead of session
cookies. Basically what I need is for the current implementation in
src/auth/middleware.ts to be updated."
Tokens: 50 → 33 (34% saved)
Research: Moderate compression improves LLM output (Zhang+ 2505.00019)
Four layers: character normalization, phrase simplification, filler deletion, structure cleanup. All regex. Works for English and Chinese.
Why rule-based in 2026?
Everyone's using LLMs to analyze prompts. I went the other way. Three reasons:
Speed. Under 5ms per prompt. Runs in a pre-commit hook and nobody notices. An LLM call takes 2-5 seconds.
Determinism. Same input, same output, every time. I track my scores weekly — if the scoring function shifts with model version, the trend is meaningless. I use ctxray lint --score-threshold 50 in CI. Random failures from a creative LLM would not be fun.
Privacy. My prompts contain file paths, function names, error messages with stack traces, sometimes credentials I forgot to redact. That's a map of my codebase. Sending it to another LLM for "improvement" defeats the purpose.
The tradeoff is real. Structural signals miss semantic intent. An LLM would understand "this is where I changed my whole approach." My heuristics just see a long turn with high vocabulary shift. For daily feedback loops, structural analysis catches 80-90% of what matters.
The research behind the scoring
The scoring engine uses 30+ features calibrated against 10 NLP papers. Not as decoration — each paper maps to a specific dimension:
- Position bias is architectural (Stanford 2307.03172, confirmed by Chowdhury 2603.10123): models weight beginnings and ends of prompts more than the middle. Front-load your instructions.
- Moderate repetition helps (Google 2512.14982): repeating key requirements at the end improves recall up to 76%. But excessive repetition hurts.
- Specificity > length (Zi+ 2508.03678): file paths, line numbers, error messages improve output more than verbose explanations.
Works in CI
The part that makes this sticky: ctxray lint runs as a CI quality gate.
# GitHub Action
- uses: ctxray/ctxray-action@v1
with:
score-threshold: 50
There's a .ctxray.toml config for team rules, a --format github flag that posts score breakdowns as PR comments, and a pre-commit hook. Think ESLint for AI prompts — configurable, deterministic, fast.
Why now
Every prompt analysis tool is getting acquired. Promptfoo joined OpenAI (March 2026). Humanloop was acqui-hired by Anthropic. PromptPerfect got absorbed into Jina. The tools that measured model behavior now belong to the labs that make models.
ctxray stays independent and model-agnostic. It's the only tool that sees your Claude Code sessions and your Cursor sessions and your ChatGPT history together, locally, without sending anything anywhere.
What I haven't figured out
The scoring handles maybe 30% of what makes a good prompt. The other 70% is stuff only you know — the error message on your screen, the file you just edited, the approach you already tried. No tool can add that for you.
I also don't think the scoring is "right" yet. A 3-word prompt from someone deep in a debugging session can be more effective than a 200-word structured request from someone who doesn't understand the codebase. Context that lives in your head doesn't show up in a score.
Try it
pip install ctxray
ctxray demo # try with built-in sample data
ctxray scan # discover your sessions
ctxray check "your prompt here"
1,941 tests, strict mypy, MIT licensed. Everything local, no account, no telemetry.
What do your prompts look like when you actually measure them? I'm genuinely curious whether people who use CLAUDE.md files or cursor rules have noticeably different patterns.
Top comments (2)
Author here. Some details that didn't fit above.
The adapter architecture is probably the most satisfying part. Every AI tool stores sessions differently: Claude Code writes JSONL with tool_use blocks, Cursor stores in SQLite blobs, Aider uses markdown chat history, ChatGPT exports as nested JSON trees. Each adapter implements parse_session() and parse_conversation() for full turn reconstruction. Adding a new adapter is about 60 lines plus parsing logic.
There's also an MCP server if you want scoring inside Claude Code without leaving your session:
7 tools: score_prompt, compress_prompt, compare_prompts, build_prompt_from_parts, check_privacy, search_prompts, scan_sessions.
And
ctxray install-hookadds a post-session hook to Claude Code that prints your score after each session. It's the simplest integration and the one that actually changed my habits — my debug prompt average went from 31 to 48 after ten weeks just from seeing the number.A surprising insight from working with enterprise teams is that many developers overestimate the impact of prompt scores while underestimating the importance of integrating AI feedback into their coding workflows. The real value often lies in how seamlessly these suggestions are implemented into daily practices, rather than just scoring well. Consider creating agents that guide developers in embedding AI insights directly into their coding processes for more tangible improvements. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)
Some comments have been hidden by the post's author - find out more