Priyanshu S

Posted on Apr 10

Simplifying the AI Testing through Evaliphy

#ai #testing #llm #rag

I have been a QA consultant for more than a decade. I test software for a living. APIs, UIs, integrations, if they break, I have probably written a test for it at some point.

A few months ago, I started building something called Evaliphy - A SDK for AI testing.

Not because I set out to build a tool. But because I spotted a gap.

The context

Last few months, I have been exploring and diving into AI space to update myself. At the same time, the teams I was working with started with AI features. RAG being the first feature we delivered a year back, and we casually shipped it through Vibe-Testing.

A few months later, the same RAG feature which we shipped as beta, was ready to scale and go live for all the customers.

The causal vibe-testing which we did a year back was not any longer an effective way to test, and honestly we did not know how to test non-deterministic systems.

The only option which we had at that point of time, was to learn AI evaluation. That's what we did.

We explored tools like DeepEval, Ragas, Promptfoo, etc.

All of these are established and super powerful tools, we decided to use Promptfoo and built an evaluation suite in that.

The exploration phase

These tools are genuinely impressive. The people who built them clearly understand AI evaluation deeply. The metrics they expose — faithfulness scores, retrieval precision, context recall — are grounded in real ML research.

But at some point of time, I stopped and asked myself an honest question.

Who is this actually built for?

Because it was not built for me. Not for the version of me that spends most of my working life writing Playwright tests, treating systems as black boxes, validating what users actually experience.

I kept hitting the same wall. To use these tools properly I needed access to retrieval pipeline internals. I needed to understand embedding similarity. I needed to construct test cases based on internals of the system, providing retrieved context that I, as a black box QA engineer, simply did not want to.
[Note: Its not skill gap, rather a testing risk because we are testing system from LLM perspective, but ignoring end-to-end tests which eventually matter more for end users.]

The gap nobody was talking about

Here is what I noticed. Every article, every tutorial, every conference talk about AI evaluation was aimed at ML engineers and developers who built the AI system. The assumption was always that the person doing the evaluation had built the thing, understood the internals, and could instrument the pipeline.

But that is not how QA works.

In every team I have ever worked with, QA engineers own end-to-end black box testing. They test the behaviour the user sees. They do not need to understand how the database query optimizer works to test whether search results are correct. They do not need to understand OAuth internals to test whether login works.

That separation exists for good reasons. Independent testing means independent perspective. A QA engineer who did not build the system will catch things the developer never will, precisely because they are not anchored to how it is supposed to work internally.

But for AI features, that separation had completely broken down. The only people testing AI quality were the people who built it. Using tools that required them to build it.

I found that genuinely concerning.

What I was trying to build

I did not want to replace DeepEval or Ragas. Those tools solve real problems for the right audience. What I wanted was something that felt like Playwright — familiar, approachable, built around the workflows QA engineers already know.

The core idea was simple. A QA engineer should be able to:

Point the tool at an HTTP endpoint
Write an assertion in plain English
Get a pass or fail result

No Python environment. No pipeline access. No ML concepts to learn before writing the first test. Just an endpoint and an assertion.

I wanted the entry point to be npm install and the first test to take fifteen minutes to write, not three days.

And, thats how I built "Evaliphy"

What Evaliphy looks like today

After weeks of building, iterating, breaking things, and fixing them, Evaliphy is in beta.

Here is what a basic evaluation looks like:

import { evaluate, expect } from 'evaliphy';

evaluate('return policy check', async ({ httpClient }) => {
  const res = await httpClient.post('/api/chat', {
    message: 'What is your return policy?'
  });

  const { answer, context } = await res.json();

  await expect({
    query: 'What is your return policy?',
    response: answer,
    context,
  }).toBeFaithful();

  await expect({
    query: 'What is your return policy?',
    response: answer,
    context,
  }).toBeRelevant();
});

That is it. You point it at your API. You write assertions. You run it.

No Python. No pipeline access. No ML background required. If you have written a Playwright test before, this will feel familiar within minutes.

The built-in judge handles the LLM evaluation underneath. The prompts are shipped with the SDK. You configure your API key and base URL, and everything else is taken care of.

The honest limitations

I want to be clear about something because I think honesty matters more than marketing.

Evaliphy is not a replacement for deep AI evaluation. If you are an ML engineer who wants to measure retrieval precision, chunk quality, and embedding similarity — DeepEval and Ragas are still the right tools for that job.

What Evaliphy gives you is a starting point. Meaningful coverage today, from the QA engineer's perspective, without the prerequisite of ML expertise.

Does that mean you never need to understand the internals? No. Deep knowledge always improves testing. It does for every system. But should that knowledge be the entry ticket before a QA engineer can write their first test? I do not think so.

We do not ask QA engineers to understand database internals before testing search functionality. We should not ask them to understand retrieval pipelines before testing AI responses.

Where it goes from here

Evaliphy is open source and completely free. It is in beta which means the API will change, there will be rough edges, and your feedback will directly shape what it becomes.

The vision is bigger than RAG. RAG is where the pain is most acute right now but the same problem exists across AI systems — there is no standard black box E2E testing tool for any of them. That is what I want to build toward.

If you are a QA engineer who has been handed an AI feature and is not sure where to start — Evaliphy is for you.

npm install evaliphy
evaliphy init my-first-eval

And if you try it and something does not work, open an issue. If an assertion you need does not exist, tell me. If the whole approach feels wrong, I want to know that too.

Evaliphy is open source and in beta. You can find it at evaliphy.com or install it directly with npm install evaliphy. Honest feedback is more valuable than stars right now.