I built a load tester with an AI diagnosis layer—because no existing tool does both

#opensource #langchain #webdev #ai

Load testing and LLM observability are two separate categories of tools. Nobody has combined them.

So I built something that does. It's called QueryScope.

The problem

k6, JMeter, and Locust are great tools. They fire requests, measure latency, and produce a report. But the report just tells you what happened. P99 spiked. Error rate went up. It doesn't tell you why. LangSmith and Langfuse are also great. But they monitor AI apps passively. They don't run load tests.

If you want to benchmark an endpoint AND ask "why did tail latency get worse after my last deploy?", you're stitching together multiple tools manually.

You are still the workflow engine. And that was the part that bothered me.

What QueryScope does

Users can point QueryScope at any REST or LLM endpoint. They can configure requests and concurrency and get real p50/p95/p99 (percentiles), throughput, and error rate in a live dashboard.

Now that's just the load testing layer. Here's the interesting part:

Every completed run gets embedded via OpenAI text-embedding-3-small and indexed into Azure AI Search via LlamaIndex. When you ask, "explain the latest benchmark on the Y Combinator page", a LangChain LCEL retrieval chain finds semantically similar historical runs, injects your 5 most recent runs from Postgres as ground truth, and GPT-4o-mini generates a grounded diagnosis.

Not a hallucination. An explanation grounded in your actual benchmark data.

The MCP server

This is the component I'm most proud of. I created a Node.js MCP server that exposes two tools: run_benchmark and query_runs. Users can connect it to Claude Desktop and prompt it, "benchmark this endpoint with 50 requests at concurrency 5", and Claude will call the tool, fire the actual HTTP requests, and analyze the results autonomously.

No UI needed; this feature enables agentic behavior as Claude is driving execution, not just answering a question.

How the RAG pipeline works end to end

This is the piece I spent the most time on:
Indexing - after every benchmark completes, the indexer builds a plain-text summary:

"Benchmark run {id} against {url} ({method}) with {n} requests. p50={p50}ms p95={p95}ms p99={p99}ms throughput={tps}req/s error_rate={err}"

Since embedding models are trained on natural language, I decided to use plain text over JSON; so "p99 spiked to 582ms" carries more semantic signal than {"p99": 582}. That summary gets embedded and upserted into Azure AI Search via LlamaIndex.

Retrieval - when you ask a question, two things happen in parallel:

The question gets embedded and Azure AI Search does a vector similarity search, returning the top 5 semantically relevant runs
The 5 most recent runs get fetched directly from Postgres as ground truth

Both get injected into a LangChain LCEL prompt alongside your question. GPT-4o-mini generates the diagnosis grounded in both sources: semantic relevance AND recency.

The recency injection was a fix I had to implement. This is because pure vector search doesn't understand queries like "my last two runs"; it finds semantically similar runs regardless of time. Injecting recent runs from Postgres directly solved that.

The full stack

FastAPI + async SQLAlchemy + asyncpg → benchmark runner and REST API
LlamaIndex + Azure AI Search → indexing and vector retrieval
LangChain LCEL → RCA chain with Postgres context injection
React + Recharts → live polling dashboard
Node.js + @modelcontextprotocol/sdk → MCP server
Docker Compose → one command local setup
Kubernetes with HPA → scales benchmark workers under load
MySQL adapter → sa.JSON replaces Postgres-native ARRAY for cross-DB compatibility