DEV Community

Medhansh Pratap Singh
Medhansh Pratap Singh

Posted on

I audited my AI tool catalog with Claude — turns out 50% was mis-categorized

A week before my college's AI Expo, I was demoing my side project — a curated AI tool finder for students called AI Compass.

I picked "Coding" as the goal. The wizard recommended Suno in the top 6.

Suno makes AI music. It has nothing to do with coding.

I checked the data file. Suno's category? "Coding". So was Power BI. So was Loom. So was Discord, somehow.

The catalog was lying.

The shape of the problem

AI Compass works simply: students answer 4 questions (goal, use case, budget, platform), and the wizard returns 5-6 tools with a one-line reason for each. The catalog had ~450 tools, each tagged into one category from a fixed list (Coding, Writing & Chat, Research, Productivity, Image Gen, Video Gen).

The wizard had a strict hard-gate: pick goal=coding → only tools tagged "Coding" can surface.

The gate worked. The data didn't.

The midnight band-aid

I had hours, not days. So I shipped a defense-in-depth fix instead of touching all 450 entries: a tag-based veto.

A tool now had to pass two checks: its category had to match, AND at least one of its tags or use_cases had to contain a category-relevant keyword.

For Coding, that meant code, programming, developer, github, debug, framework, etc.

Suno's tags: ['music', 'vocals', 'AI generation', 'creative', 'design']. Zero coding keywords. Vetoed.

The veto caught 60 mis-tagged tools across all 6 goals. Suno, Power BI, Loom, Discord, Khan Academy (all "Coding" for some reason), plus presentation tools tagged "Writing & Chat," education courses tagged "Video Generation."

Shipped. Suno gone. Crisis averted.

But the band-aid was hiding the real problem: about half of my categories were just wrong.

The actual fix: auditing 450 tools with Claude

Cold-prompting an LLM with "categorize my catalog" is a hallucination factory. It'll confidently relabel tools based on training data that's outdated or just wrong.

I structured the audit around three constraints:

1. No web lookups, no training-data inference. Categorize each tool only using fields already in the data file (name, description, tags, use_cases). If those don't make the category obvious, mark confidence: low and flag for human review. Don't guess.

2. Apply an explicit taxonomy. I wrote the 8-category rulebook upfront — what each category means, what the multi-category tiebreaker is, where edge cases go. The prompt embedded this verbatim.

3. Output a proposal, not changes. The audit wrote a separate JSON file with one entry per tool: current category, proposed category, confidence, reasoning. I reviewed before anything got applied.

The audit found:

  • 226 proposed changes (50% of the catalog)
  • 3 missing categories the taxonomy didn't have: Audio & Voice, Courses & Tutorials, Design & Graphics
  • ~60 tools borderline by inclusion bar — vanilla productivity apps without meaningful AI features
  • 40 tools with thin metadata — even with the rules explicit, the data was too sparse to categorize

I approved 209 changes, removed 7 vanilla apps (Slack, Zulip, Apple Notes, etc.), and added the 3 new categories.

What changed for users

Top results for goal = coding, before:

  1. Chatgpt
  2. Suno (Music AI)
  3. Power BI (BI Tool)
  4. Some Educational courses

After:

  1. Chatgpt
  2. Claude
  3. Github Copilot
  4. Cursor
  5. VS Code

The wizard finally returns the tools you'd actually expect. Claude went from "Writing & Chat" to "Coding" — matching its 2026 reputation as one of the strongest coding models, instead of being buried under chat queries.

What I'm taking from this

Bad data is silent. No test caught Suno-for-coding because the schema was valid — the value was wrong. I need tests that validate "goal X should surface tools like Y," not just that the JSON parses.

LLMs are great at applying rules, terrible at inferring them. The audit worked because I wrote the taxonomy first. "Categorize this however you think best" would've been 450 hallucinations.

Boring infrastructure work is the actual moat. TAAFT has 48,000 tools scraped. AI Compass has ~440 hand-curated. The only reason a student picks the smaller one is curation quality — and that breaks the moment your data lies.


If you want to try it: ai-compass.in/ai-tool-finder — 4 questions, 5-6 hand-picked tools, no signup.

Honest feedback welcome, especially queries where the recommendations still feel wrong. That's how the catalog actually gets better.

19, CS at RVCE Bangalore. Building AI Compass as my first indie project.

Top comments (0)