Load testing and LLM observability are two separate categories of tools. Nobody has combined them.
So I built something that does. It's called QueryScope.
The problem
k6, JMeter, and Locust are great tools. They fire requests, measure latency, and produce a report. But the report just tells you what happened. P99 spiked. Error rate went up. It doesn't tell you why. LangSmith and Langfuse are also great. But they monitor AI apps passively. They don't run load tests.
If you want to benchmark an endpoint AND ask "why did tail latency get worse after my last deploy?", you're stitching together multiple tools manually.
You are still the workflow engine. And that was the part that bothered me.
What QueryScope does
Users can point QueryScope at any REST or LLM endpoint. They can configure requests and concurrency and get real p50/p95/p99 (percentiles), throughput, and error rate in a live dashboard.
Now that's just the load testing layer. Here's the interesting part:
Every completed run gets embedded via OpenAI text-embedding-3-small and indexed into Azure AI Search via LlamaIndex. When you ask, "explain the latest benchmark on the Y Combinator page", a LangChain LCEL retrieval chain finds semantically similar historical runs, injects your 5 most recent runs from Postgres as ground truth, and GPT-4o-mini generates a grounded diagnosis.
Not a hallucination. An explanation grounded in your actual benchmark data.
The MCP server
This is the component I'm most proud of. I created a Node.js MCP server that exposes two tools: run_benchmark and query_runs. Users can connect it to Claude Desktop and prompt it, "benchmark this endpoint with 50 requests at concurrency 5", and Claude will call the tool, fire the actual HTTP requests, and analyze the results autonomously.
No UI needed; this feature enables agentic behavior as Claude is driving execution, not just answering a question.
How the RAG pipeline works end to end
This is the piece I spent the most time on:
Indexing - after every benchmark completes, the indexer builds a plain-text summary:
"Benchmark run {id} against {url} ({method}) with {n} requests. p50={p50}ms p95={p95}ms p99={p99}ms throughput={tps}req/s error_rate={err}"
Since embedding models are trained on natural language, I decided to use plain text over JSON; so "p99 spiked to 582ms" carries more semantic signal than {"p99": 582}. That summary gets embedded and upserted into Azure AI Search via LlamaIndex.
Retrieval - when you ask a question, two things happen in parallel:
- The question gets embedded and Azure AI Search does a vector similarity search, returning the top 5 semantically relevant runs
- The 5 most recent runs get fetched directly from Postgres as ground truth
Both get injected into a LangChain LCEL prompt alongside your question. GPT-4o-mini generates the diagnosis grounded in both sources: semantic relevance AND recency.
The recency injection was a fix I had to implement. This is because pure vector search doesn't understand queries like "my last two runs"; it finds semantically similar runs regardless of time. Injecting recent runs from Postgres directly solved that.
The full stack
- FastAPI + async SQLAlchemy + asyncpg → benchmark runner and REST API
- LlamaIndex + Azure AI Search → indexing and vector retrieval
- LangChain LCEL → RCA chain with Postgres context injection
- React + Recharts → live polling dashboard
-
Node.js +
@modelcontextprotocol/sdk→ MCP server - Docker Compose → one command local setup
- Kubernetes with HPA → scales benchmark workers under load
-
MySQL adapter →
sa.JSONreplaces Postgres-nativeARRAYfor cross-DB compatibility
Try it
Self-hostable, open source, runs with one command:
Full demo walkthrough: https://www.loom.com/share/aa0458b3b73849f4b8c731217b443b6f
GitHub: https://github.com/kavishkartha05/QueryScope
Happy to hear feedback or answer questions about the RAG pipeline, MCP integration, or anything else in the comments.




Top comments (0)