The Project I Started But Never Finished
Earlier this year I started building ai-qe-agent —
a multi-agent system that auto-generates QA test cases
using Claude (Anthropic's AI).
8 specialized agents. TypeScript. Direct Anthropic SDK.
It worked. But it had a critical problem:
No visibility into whether the outputs were actually correct.
Agents were generating test cases, reviewing them,
converting them to Playwright scripts — and I had
no idea if Claude was hallucinating, truncating,
or silently failing between agents.
That's what I set out to finish.
The Before
How Claude Helped Me Finish It
I used Claude (via Claude Code) as my primary
AI coding assistant throughout this project.
Claude helped me:
- Design the LLM-as-Judge eval architecture
- Generate eval_suite.py from scratch
- Debug LangSmith tracing integration
- Build the TruLens monitoring setup
- Create the Fintech AI Agent Gradio app
The meta-irony: I used Claude to build a system
that evaluates Claude's own outputs.
What I Finished
1. Custom LLM Eval Suite
Built eval_suite.py using LLM-as-Judge pattern —
Claude evaluating Claude's own outputs across 4 dimensions:
- Completeness — did the agent complete the full task?
- Specificity — were outputs precise and detailed?
- Faithfulness — did the agent follow all instructions?
- Hallucination detection — did it invent facts not in context?
2. TruLens Monitoring Dashboard
Real-time quality metrics across all 4 agents:
- Faithfulness scores
- Hallucination flags
- Chain compatibility checks
- Quality score trends
3. LangSmith Production Tracing
Every Claude API call now traced:
- Input prompt
- Output response
- Latency per agent
- Token usage
4. Pinecone Vector Store
Semantic deduplication for test cases:
- Prevents duplicate test generation
- 0.85+ cosine similarity = HIGH OVERLAP flag
5. Fintech AI Agent (New HF Space)
Live demo combining everything:
- Fraud detection with risk scoring (0-10)
- Compliance Q&A (KYC/AML/GDPR/SOX/PCI-DSS)
- AML risk report generation (6-section formal reports)
- Real-time eval dashboard
The Findings — What Claude Found About Itself
Running the eval suite on my own pipeline revealed:
🔴 2 hallucinations caught
AutomationScriptGenerator invented 'Invalid credentials'
as error text — never specified in the input context.
SelfHealingAgent fabricated DOM selectors without a DOM.
🔴 2 pipeline breaks found
ManualTestGenerator output = bare array.
QAReviewAgent expected a wrapped ManualTestSuite object.
chain_compatibility = 0. Would silently fail in production.
🔴 2 faithfulness failures
ManualTestGenerator generated 2 of 8 required test cases.
Stopped with no error. No warning. Just silent truncation.
🟢 0.902 avg quality score
AutomationScriptGenerator: 0.94
SelfHealingAgent: 1.0 quality — but 0.0 faithfulness.
Good output. Wrong process. Only eval catches this.
The After
The Key Insight
AI systems fail silently.
No errors. No warnings. No crashes.
Just wrong outputs — shipped with confidence.
This is why LLM Evaluation Engineering exists.
And why finishing this project mattered.
Demo
🤗 Fintech AI Agent (live):
https://huggingface.co/spaces/Vijayarv07/fintech-ai-agent
🤗 ai-qe-agent (live):
https://huggingface.co/spaces/Vijayarv07/ai-qe-agent
⭐ GitHub:
https://github.com/vijayarjun7/ai-qe-agent
Tech Stack
- Claude (claude-sonnet-4-20250514) — Anthropic
- Python + TypeScript
- TruLens (eval monitoring)
- LangSmith (production tracing)
- Pinecone (vector store)
- Gradio (HF Space UI)
- Playwright (automation)
Built in public. Follow my journey: #BuildInPublic

Top comments (2)
That's inspiring. Thanks for sharing.
Just curious, were you using the Claude Code free tier, or do you have to upgrade it to complete your project?
Yes need to upgrade in order to complete.