If you are shipping AI features, quality can drift silently. A model update, prompt tweak, or retrieval change can lower answer quality without obvious runtime errors.
@hazeljs/eval gives you a practical way to measure that quality with:
- Golden datasets
- Retrieval metrics (
precision@k,recall@k,MRR,NDCG) - Agent trajectory scoring
- CI-ready pass/fail reporting
This starter explains how the package works and includes a real-time evaluation example you can run locally.
What @hazeljs/eval Solves
@hazeljs/eval is an evaluation toolkit for HazelJS AI applications. You can use it to:
- Define expected behavior as golden test cases.
- Run your RAG/agent pipeline against those cases.
- Score retrieval and response quality.
- Fail CI when quality drops below your threshold.
That makes AI quality regression-testable, not guesswork.
Install
npm install @hazeljs/eval @hazeljs/core
Core APIs (Quick Overview)
-
loadGoldenDatasetFromJson(path): load dataset from JSON. -
runGoldenDataset(dataset, runner, options): execute all cases and aggregate scores. -
evaluateRetrieval(...): retrieval quality metrics. -
answerContextOverlap(...): quick answer-context overlap heuristic. -
reportEvalForCi(result, { exitOnFail: true }): CI-friendly reporting.
Real-Time Example: Evaluate Live Support Responses
In this example, we simulate incoming support events and score them in real time.
Every incoming event is evaluated for:
- Retrieval quality (
precision@k,recall@k,MRR,NDCG) - Answer-context overlap
Suggested project structure
hazeljs-eval-starter/
package.json
tsconfig.json
src/
realtime-eval.ts
eval/
golden.realtime.json
package.json
{
"name": "hazeljs-eval-starter",
"version": "1.0.0",
"private": true,
"type": "module",
"scripts": {
"dev": "tsx src/realtime-eval.ts"
},
"dependencies": {
"@hazeljs/core": "latest",
"@hazeljs/eval": "latest"
},
"devDependencies": {
"tsx": "^4.20.0",
"typescript": "^5.8.0"
}
}
tsconfig.json
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "Bundler",
"strict": true,
"skipLibCheck": true
},
"include": ["src"]
}
eval/golden.realtime.json
{
"name": "realtime-support",
"version": "1.0.0",
"cases": [
{
"id": "billing-01",
"input": "When will I be charged?",
"expectedRetrievedIds": ["billing-policy", "pricing-page"]
},
{
"id": "refund-01",
"input": "Can I get a refund after 20 days?",
"expectedRetrievedIds": ["refund-policy"]
}
]
}
src/realtime-eval.ts
import { EventEmitter } from 'node:events';
import {
answerContextOverlap,
evaluateRetrieval,
loadGoldenDatasetFromJson,
reportEvalForCi,
runGoldenDataset,
} from '@hazeljs/eval';
type LiveInference = {
output: string;
retrievedIds: string[];
contexts: string[];
};
const stream = new EventEmitter();
const rollingWindow: number[] = [];
const MAX_WINDOW = 20;
function pushScore(score: number) {
rollingWindow.push(score);
if (rollingWindow.length > MAX_WINDOW) rollingWindow.shift();
}
function windowAverage(): number {
if (rollingWindow.length === 0) return 0;
return rollingWindow.reduce((a, b) => a + b, 0) / rollingWindow.length;
}
async function fakeRagAnswer(input: string): Promise<LiveInference> {
if (input.toLowerCase().includes('refund')) {
return {
output: 'Yes, refunds are available within 30 days of purchase.',
retrievedIds: ['refund-policy', 'terms-page'],
contexts: [
'Refunds are available within 30 days of purchase.',
'Terms and conditions apply for enterprise plans.'
]
};
}
return {
output: 'You are charged on the first day of each month.',
retrievedIds: ['billing-policy', 'pricing-page'],
contexts: [
'Billing occurs monthly, on the first day of each month.',
'Pricing details are listed on the public pricing page.'
]
};
}
async function handleLiveEvent(input: string, relevantIds: string[]) {
const result = await fakeRagAnswer(input);
const retrieval = evaluateRetrieval({
query: input,
retrievedIds: result.retrievedIds,
relevantIds,
k: 5,
});
const overlap = answerContextOverlap(result.output, result.contexts);
const liveScore = (retrieval.ndcgAtK + overlap) / 2;
pushScore(liveScore);
console.log('--- live event ---');
console.log('input:', input);
console.log('retrievedIds:', result.retrievedIds.join(', '));
console.log('precision@k:', retrieval.precisionAtK.toFixed(3));
console.log('recall@k:', retrieval.recallAtK.toFixed(3));
console.log('mrr:', retrieval.mrr.toFixed(3));
console.log('ndcg@k:', retrieval.ndcgAtK.toFixed(3));
console.log('answer overlap:', overlap.toFixed(3));
console.log('rolling quality score:', windowAverage().toFixed(3));
}
async function runBaselineGoldenEval() {
const dataset = loadGoldenDatasetFromJson('./eval/golden.realtime.json');
const evalResult = await runGoldenDataset(
dataset,
async ({ input }) => {
const output = await fakeRagAnswer(input);
return {
output: output.output,
retrievedIds: output.retrievedIds,
};
},
{ concurrency: 1, minAverageScore: 0.7 }
);
reportEvalForCi(evalResult, { exitOnFail: false });
}
async function main() {
console.log('Starting baseline golden eval...');
await runBaselineGoldenEval();
stream.on('support-query', async (payload: { input: string; relevantIds: string[] }) => {
await handleLiveEvent(payload.input, payload.relevantIds);
});
// Simulate real-time production traffic.
setInterval(() => {
const events = [
{ input: 'When will I be charged?', relevantIds: ['billing-policy', 'pricing-page'] },
{ input: 'Can I get a refund after 20 days?', relevantIds: ['refund-policy'] },
];
const next = events[Math.floor(Math.random() * events.length)];
stream.emit('support-query', next);
}, 3000);
}
main().catch((error) => {
console.error(error);
process.exitCode = 1;
});
Run It
npm install
npm run dev
You will see:
- A baseline golden dataset evaluation summary at startup.
- Ongoing real-time event-by-event quality metrics every few seconds.
- A rolling quality score you can alert on in production.
Real-Life Example: SaaS Support Bot in Production
Here is a practical pattern used by AI teams:
- You run a HazelJS support bot for billing/refund questions.
- Every new production conversation is evaluated right after inference.
- If rolling quality drops below a threshold, you alert Slack and block new rollouts.
- Nightly golden dataset runs enforce a CI quality gate.
Install extra packages for this real setup
npm install @hazeljs/ai @hazeljs/rag
src/real-life-support-eval.ts
import { EventEmitter } from 'node:events';
import { HazelAI } from '@hazeljs/ai';
import {
answerContextOverlap,
evaluateRetrieval,
loadGoldenDatasetFromJson,
reportEvalForCi,
runGoldenDataset,
} from '@hazeljs/eval';
type SupportEvent = {
ticketId: string;
userQuestion: string;
relevantDocIds: string[];
};
const liveBus = new EventEmitter();
const rollingScores: number[] = [];
const ROLLING_WINDOW = 50;
const ALERT_THRESHOLD = 0.62;
function pushRollingScore(score: number) {
rollingScores.push(score);
if (rollingScores.length > ROLLING_WINDOW) rollingScores.shift();
}
function getRollingAverage() {
if (rollingScores.length === 0) return 0;
return rollingScores.reduce((sum, score) => sum + score, 0) / rollingScores.length;
}
async function evaluateLiveSupportEvent(ai: ReturnType<typeof HazelAI.create>, event: SupportEvent) {
const rag = await ai.rag.ask(event.userQuestion, { topK: 6 });
const retrievedIds = rag.sources.map((source) => source.id);
const retrieval = evaluateRetrieval({
query: event.userQuestion,
retrievedIds,
relevantIds: event.relevantDocIds,
k: 6,
});
const overlap = answerContextOverlap(
rag.answer,
rag.sources.map((source) => source.text),
);
const qualityScore = (retrieval.ndcgAtK + overlap) / 2;
pushRollingScore(qualityScore);
const rollingAverage = getRollingAverage();
console.log(`[ticket=${event.ticketId}] score=${qualityScore.toFixed(3)} rolling=${rollingAverage.toFixed(3)}`);
if (rollingAverage < ALERT_THRESHOLD) {
// Replace with Slack/PagerDuty webhook in production.
console.warn(`ALERT: rolling support quality dropped below ${ALERT_THRESHOLD}`);
}
}
async function runNightlyGoldenEval(ai: ReturnType<typeof HazelAI.create>) {
const dataset = loadGoldenDatasetFromJson('./eval/golden.realtime.json');
const result = await runGoldenDataset(
dataset,
async ({ input }) => {
const rag = await ai.rag.ask(input, { topK: 6 });
return {
output: rag.answer,
retrievedIds: rag.sources.map((source) => source.id),
};
},
{ concurrency: 2, minAverageScore: 0.7 },
);
reportEvalForCi(result, { exitOnFail: process.env.CI === 'true' });
}
async function main() {
const ai = HazelAI.create({
defaultProvider: 'openai',
persistence: {
rag: {
vectorStore: 'qdrant',
connectionString: process.env.QDRANT_URL ?? 'http://127.0.0.1:6333',
indexName: 'support-docs',
},
},
});
await runNightlyGoldenEval(ai);
liveBus.on('support_ticket_answered', async (event: SupportEvent) => {
await evaluateLiveSupportEvent(ai, event);
});
// Simulated production events from your message queue / webhook.
setInterval(() => {
const sampleEvents: SupportEvent[] = [
{
ticketId: 'T-1001',
userQuestion: 'Am I charged on the first day of the month?',
relevantDocIds: ['billing-policy', 'pricing-page'],
},
{
ticketId: 'T-1002',
userQuestion: 'Can I request refund after 20 days?',
relevantDocIds: ['refund-policy'],
},
];
const next = sampleEvents[Math.floor(Math.random() * sampleEvents.length)];
liveBus.emit('support_ticket_answered', next);
}, 4000);
}
main().catch((error) => {
console.error(error);
process.exitCode = 1;
});
Environment variables for real integrations
export OPENAI_API_KEY=your_key_here
export QDRANT_URL=http://127.0.0.1:6333
Why this is "real life"
- Uses live retrieval results from your actual vector index.
- Scores each production response continuously.
- Adds rolling quality alerting for operations.
- Keeps CI quality gates with
reportEvalForCi.
Why This Pattern Works
- Golden datasets protect against regressions before deploy.
- Real-time metrics monitor quality after deploy.
- CI thresholds let you enforce quality gates automatically.
If you want, the next step is to connect this starter to your actual HazelJS app (HazelAI + real vector store) so the same evaluator runs on true production responses.
Top comments (0)