DEV Community

Cover image for Building Reliable AI in NodeJS with Typescript
Muhammad Arslan
Muhammad Arslan

Posted on

Building Reliable AI in NodeJS with Typescript

If you are shipping AI features, quality can drift silently. A model update, prompt tweak, or retrieval change can lower answer quality without obvious runtime errors.

@hazeljs/eval gives you a practical way to measure that quality with:

  • Golden datasets
  • Retrieval metrics (precision@k, recall@k, MRR, NDCG)
  • Agent trajectory scoring
  • CI-ready pass/fail reporting

This starter explains how the package works and includes a real-time evaluation example you can run locally.


What @hazeljs/eval Solves

@hazeljs/eval is an evaluation toolkit for HazelJS AI applications. You can use it to:

  1. Define expected behavior as golden test cases.
  2. Run your RAG/agent pipeline against those cases.
  3. Score retrieval and response quality.
  4. Fail CI when quality drops below your threshold.

That makes AI quality regression-testable, not guesswork.


Install

npm install @hazeljs/eval @hazeljs/core
Enter fullscreen mode Exit fullscreen mode

Core APIs (Quick Overview)

  • loadGoldenDatasetFromJson(path): load dataset from JSON.
  • runGoldenDataset(dataset, runner, options): execute all cases and aggregate scores.
  • evaluateRetrieval(...): retrieval quality metrics.
  • answerContextOverlap(...): quick answer-context overlap heuristic.
  • reportEvalForCi(result, { exitOnFail: true }): CI-friendly reporting.

Real-Time Example: Evaluate Live Support Responses

In this example, we simulate incoming support events and score them in real time.
Every incoming event is evaluated for:

  • Retrieval quality (precision@k, recall@k, MRR, NDCG)
  • Answer-context overlap

Suggested project structure

hazeljs-eval-starter/
  package.json
  tsconfig.json
  src/
    realtime-eval.ts
  eval/
    golden.realtime.json
Enter fullscreen mode Exit fullscreen mode

package.json

{
  "name": "hazeljs-eval-starter",
  "version": "1.0.0",
  "private": true,
  "type": "module",
  "scripts": {
    "dev": "tsx src/realtime-eval.ts"
  },
  "dependencies": {
    "@hazeljs/core": "latest",
    "@hazeljs/eval": "latest"
  },
  "devDependencies": {
    "tsx": "^4.20.0",
    "typescript": "^5.8.0"
  }
}
Enter fullscreen mode Exit fullscreen mode

tsconfig.json

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "ESNext",
    "moduleResolution": "Bundler",
    "strict": true,
    "skipLibCheck": true
  },
  "include": ["src"]
}
Enter fullscreen mode Exit fullscreen mode

eval/golden.realtime.json

{
  "name": "realtime-support",
  "version": "1.0.0",
  "cases": [
    {
      "id": "billing-01",
      "input": "When will I be charged?",
      "expectedRetrievedIds": ["billing-policy", "pricing-page"]
    },
    {
      "id": "refund-01",
      "input": "Can I get a refund after 20 days?",
      "expectedRetrievedIds": ["refund-policy"]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

src/realtime-eval.ts

import { EventEmitter } from 'node:events';
import {
  answerContextOverlap,
  evaluateRetrieval,
  loadGoldenDatasetFromJson,
  reportEvalForCi,
  runGoldenDataset,
} from '@hazeljs/eval';

type LiveInference = {
  output: string;
  retrievedIds: string[];
  contexts: string[];
};

const stream = new EventEmitter();
const rollingWindow: number[] = [];
const MAX_WINDOW = 20;

function pushScore(score: number) {
  rollingWindow.push(score);
  if (rollingWindow.length > MAX_WINDOW) rollingWindow.shift();
}

function windowAverage(): number {
  if (rollingWindow.length === 0) return 0;
  return rollingWindow.reduce((a, b) => a + b, 0) / rollingWindow.length;
}

async function fakeRagAnswer(input: string): Promise<LiveInference> {
  if (input.toLowerCase().includes('refund')) {
    return {
      output: 'Yes, refunds are available within 30 days of purchase.',
      retrievedIds: ['refund-policy', 'terms-page'],
      contexts: [
        'Refunds are available within 30 days of purchase.',
        'Terms and conditions apply for enterprise plans.'
      ]
    };
  }

  return {
    output: 'You are charged on the first day of each month.',
    retrievedIds: ['billing-policy', 'pricing-page'],
    contexts: [
      'Billing occurs monthly, on the first day of each month.',
      'Pricing details are listed on the public pricing page.'
    ]
  };
}

async function handleLiveEvent(input: string, relevantIds: string[]) {
  const result = await fakeRagAnswer(input);

  const retrieval = evaluateRetrieval({
    query: input,
    retrievedIds: result.retrievedIds,
    relevantIds,
    k: 5,
  });

  const overlap = answerContextOverlap(result.output, result.contexts);
  const liveScore = (retrieval.ndcgAtK + overlap) / 2;
  pushScore(liveScore);

  console.log('--- live event ---');
  console.log('input:', input);
  console.log('retrievedIds:', result.retrievedIds.join(', '));
  console.log('precision@k:', retrieval.precisionAtK.toFixed(3));
  console.log('recall@k:', retrieval.recallAtK.toFixed(3));
  console.log('mrr:', retrieval.mrr.toFixed(3));
  console.log('ndcg@k:', retrieval.ndcgAtK.toFixed(3));
  console.log('answer overlap:', overlap.toFixed(3));
  console.log('rolling quality score:', windowAverage().toFixed(3));
}

async function runBaselineGoldenEval() {
  const dataset = loadGoldenDatasetFromJson('./eval/golden.realtime.json');
  const evalResult = await runGoldenDataset(
    dataset,
    async ({ input }) => {
      const output = await fakeRagAnswer(input);
      return {
        output: output.output,
        retrievedIds: output.retrievedIds,
      };
    },
    { concurrency: 1, minAverageScore: 0.7 }
  );

  reportEvalForCi(evalResult, { exitOnFail: false });
}

async function main() {
  console.log('Starting baseline golden eval...');
  await runBaselineGoldenEval();

  stream.on('support-query', async (payload: { input: string; relevantIds: string[] }) => {
    await handleLiveEvent(payload.input, payload.relevantIds);
  });

  // Simulate real-time production traffic.
  setInterval(() => {
    const events = [
      { input: 'When will I be charged?', relevantIds: ['billing-policy', 'pricing-page'] },
      { input: 'Can I get a refund after 20 days?', relevantIds: ['refund-policy'] },
    ];
    const next = events[Math.floor(Math.random() * events.length)];
    stream.emit('support-query', next);
  }, 3000);
}

main().catch((error) => {
  console.error(error);
  process.exitCode = 1;
});
Enter fullscreen mode Exit fullscreen mode

Run It

npm install
npm run dev
Enter fullscreen mode Exit fullscreen mode

You will see:

  • A baseline golden dataset evaluation summary at startup.
  • Ongoing real-time event-by-event quality metrics every few seconds.
  • A rolling quality score you can alert on in production.

Real-Life Example: SaaS Support Bot in Production

Here is a practical pattern used by AI teams:

  • You run a HazelJS support bot for billing/refund questions.
  • Every new production conversation is evaluated right after inference.
  • If rolling quality drops below a threshold, you alert Slack and block new rollouts.
  • Nightly golden dataset runs enforce a CI quality gate.

Install extra packages for this real setup

npm install @hazeljs/ai @hazeljs/rag
Enter fullscreen mode Exit fullscreen mode

src/real-life-support-eval.ts

import { EventEmitter } from 'node:events';
import { HazelAI } from '@hazeljs/ai';
import {
  answerContextOverlap,
  evaluateRetrieval,
  loadGoldenDatasetFromJson,
  reportEvalForCi,
  runGoldenDataset,
} from '@hazeljs/eval';

type SupportEvent = {
  ticketId: string;
  userQuestion: string;
  relevantDocIds: string[];
};

const liveBus = new EventEmitter();
const rollingScores: number[] = [];
const ROLLING_WINDOW = 50;
const ALERT_THRESHOLD = 0.62;

function pushRollingScore(score: number) {
  rollingScores.push(score);
  if (rollingScores.length > ROLLING_WINDOW) rollingScores.shift();
}

function getRollingAverage() {
  if (rollingScores.length === 0) return 0;
  return rollingScores.reduce((sum, score) => sum + score, 0) / rollingScores.length;
}

async function evaluateLiveSupportEvent(ai: ReturnType<typeof HazelAI.create>, event: SupportEvent) {
  const rag = await ai.rag.ask(event.userQuestion, { topK: 6 });
  const retrievedIds = rag.sources.map((source) => source.id);

  const retrieval = evaluateRetrieval({
    query: event.userQuestion,
    retrievedIds,
    relevantIds: event.relevantDocIds,
    k: 6,
  });

  const overlap = answerContextOverlap(
    rag.answer,
    rag.sources.map((source) => source.text),
  );

  const qualityScore = (retrieval.ndcgAtK + overlap) / 2;
  pushRollingScore(qualityScore);

  const rollingAverage = getRollingAverage();
  console.log(`[ticket=${event.ticketId}] score=${qualityScore.toFixed(3)} rolling=${rollingAverage.toFixed(3)}`);

  if (rollingAverage < ALERT_THRESHOLD) {
    // Replace with Slack/PagerDuty webhook in production.
    console.warn(`ALERT: rolling support quality dropped below ${ALERT_THRESHOLD}`);
  }
}

async function runNightlyGoldenEval(ai: ReturnType<typeof HazelAI.create>) {
  const dataset = loadGoldenDatasetFromJson('./eval/golden.realtime.json');

  const result = await runGoldenDataset(
    dataset,
    async ({ input }) => {
      const rag = await ai.rag.ask(input, { topK: 6 });
      return {
        output: rag.answer,
        retrievedIds: rag.sources.map((source) => source.id),
      };
    },
    { concurrency: 2, minAverageScore: 0.7 },
  );

  reportEvalForCi(result, { exitOnFail: process.env.CI === 'true' });
}

async function main() {
  const ai = HazelAI.create({
    defaultProvider: 'openai',
    persistence: {
      rag: {
        vectorStore: 'qdrant',
        connectionString: process.env.QDRANT_URL ?? 'http://127.0.0.1:6333',
        indexName: 'support-docs',
      },
    },
  });

  await runNightlyGoldenEval(ai);

  liveBus.on('support_ticket_answered', async (event: SupportEvent) => {
    await evaluateLiveSupportEvent(ai, event);
  });

  // Simulated production events from your message queue / webhook.
  setInterval(() => {
    const sampleEvents: SupportEvent[] = [
      {
        ticketId: 'T-1001',
        userQuestion: 'Am I charged on the first day of the month?',
        relevantDocIds: ['billing-policy', 'pricing-page'],
      },
      {
        ticketId: 'T-1002',
        userQuestion: 'Can I request refund after 20 days?',
        relevantDocIds: ['refund-policy'],
      },
    ];
    const next = sampleEvents[Math.floor(Math.random() * sampleEvents.length)];
    liveBus.emit('support_ticket_answered', next);
  }, 4000);
}

main().catch((error) => {
  console.error(error);
  process.exitCode = 1;
});
Enter fullscreen mode Exit fullscreen mode

Environment variables for real integrations

export OPENAI_API_KEY=your_key_here
export QDRANT_URL=http://127.0.0.1:6333
Enter fullscreen mode Exit fullscreen mode

Why this is "real life"

  • Uses live retrieval results from your actual vector index.
  • Scores each production response continuously.
  • Adds rolling quality alerting for operations.
  • Keeps CI quality gates with reportEvalForCi.

Why This Pattern Works

  • Golden datasets protect against regressions before deploy.
  • Real-time metrics monitor quality after deploy.
  • CI thresholds let you enforce quality gates automatically.

If you want, the next step is to connect this starter to your actual HazelJS app (HazelAI + real vector store) so the same evaluator runs on true production responses.

Top comments (0)