How I Used Claude to Finish Building an AI That Evaluates AI — and Caught It Hallucinating

#devchallenge #githubfinishupathon #ai #githubchallenge

The Project I Started But Never Finished

Earlier this year I started building ai-qe-agent —
a multi-agent system that auto-generates QA test cases
using Claude (Anthropic's AI).

8 specialized agents. TypeScript. Direct Anthropic SDK.

It worked. But it had a critical problem:

No visibility into whether the outputs were actually correct.

Agents were generating test cases, reviewing them,
converting them to Playwright scripts — and I had
no idea if Claude was hallucinating, truncating,
or silently failing between agents.

That's what I set out to finish.

The Before

How Claude Helped Me Finish It

I used Claude (via Claude Code) as my primary
AI coding assistant throughout this project.

Claude helped me:

Design the LLM-as-Judge eval architecture
Generate eval_suite.py from scratch
Debug LangSmith tracing integration
Build the TruLens monitoring setup
Create the Fintech AI Agent Gradio app

The meta-irony: I used Claude to build a system
that evaluates Claude's own outputs.

What I Finished

1. Custom LLM Eval Suite

Built eval_suite.py using LLM-as-Judge pattern —
Claude evaluating Claude's own outputs across 4 dimensions:

Completeness — did the agent complete the full task?
Specificity — were outputs precise and detailed?
Faithfulness — did the agent follow all instructions?
Hallucination detection — did it invent facts not in context?

2. TruLens Monitoring Dashboard

Real-time quality metrics across all 4 agents:

Faithfulness scores
Hallucination flags
Chain compatibility checks
Quality score trends

3. LangSmith Production Tracing

Every Claude API call now traced:

Input prompt
Output response
Latency per agent
Token usage

4. Pinecone Vector Store

Semantic deduplication for test cases:

Prevents duplicate test generation
0.85+ cosine similarity = HIGH OVERLAP flag

5. Fintech AI Agent (New HF Space)

Live demo combining everything:

Fraud detection with risk scoring (0-10)
Compliance Q&A (KYC/AML/GDPR/SOX/PCI-DSS)
AML risk report generation (6-section formal reports)
Real-time eval dashboard

The Findings — What Claude Found About Itself

Running the eval suite on my own pipeline revealed:

🔴 2 hallucinations caught

AutomationScriptGenerator invented 'Invalid credentials'
as error text — never specified in the input context.
SelfHealingAgent fabricated DOM selectors without a DOM.

🔴 2 pipeline breaks found

ManualTestGenerator output = bare array.
QAReviewAgent expected a wrapped ManualTestSuite object.
chain_compatibility = 0. Would silently fail in production.

🔴 2 faithfulness failures

ManualTestGenerator generated 2 of 8 required test cases.
Stopped with no error. No warning. Just silent truncation.

🟢 0.902 avg quality score

AutomationScriptGenerator: 0.94
SelfHealingAgent: 1.0 quality — but 0.0 faithfulness.
Good output. Wrong process. Only eval catches this.