DEV Community

Derf
Derf

Posted on

Testing AI agents before users do

Site: [https://test.qlankr.com]

A lot of AI testing still feels too dependent on gut feeling.

You run an agent, chatbot, or RAG workflow, tweak a prompt, change a tool, try again and then ask yourself:

Did this actually get better, or does it just feel different?

That was the starting point for QLANKR Test.

I built it because I wanted a faster and more structured way to test AI systems before users do.

The problem

A lot of builders are shipping:

  • AI agents
  • chatbots
  • RAG systems
  • tool-calling workflows

But the evaluation loop is often messy.

It is easy to demo something.
It is harder to inspect quality clearly, compare runs over time, and understand where a system breaks down.

What QLANKR Test does

QLANKR Test lets you run an evaluation and get:

  • a structured report
  • a QI score
  • clearer signals on what feels weak, inconsistent, or unreliable

The goal is not to replace human judgment.

The goal is to make AI evaluation more structured, repeatable, and easier to inspect.

What I wanted to improve

The main thing I wanted to avoid was “vibe-based testing”.

That feeling where you:

  • try a few prompts
  • get a decent answer once
  • assume the system is good enough
  • then discover later that it breaks in real usage

I wanted something that helps create a better feedback loop.

What I am still figuring out

The big questions for me right now are:

  • does the report feel genuinely useful?
  • does the score make sense?
  • what is still missing for real-world AI testing?

If you work on AI products, agents, or evaluation workflows, I would genuinely love feedback.

Site: [https://test.qlankr.com]

Top comments (1)

Collapse
 
programmer4web profile image
Alexandru A

The "vibe-based testing" problem is real and I think it's the same root issue we see in traditional QA — evaluating outputs without a structured baseline makes it impossible to know if a change was actually an improvement or just different.

One thing I've been exploring is how much input quality drives output quality in AI-assisted testing. The pattern holds whether you're generating test cases from acceptance criteria or evaluating agent responses — garbage in, garbage in a structured report.

Curious about the QI score — how do you handle consistency across runs where the model output is non-deterministic? That's the part that tends to break structured evaluation in practice.