How a weekend experiment turned into a full agentic architecture — and why I’m really proud, because the future of architecture is where my magical spices are.
🔗 GitHub: github.com/aiqualitylab/ai-natural-language-tests
🌐 Website: tests.aiqualitylab.org
One weekend, I wrote a script. Give it a sentence and a URL. It calls an LLM, gets some code back, saves it to a file. Dead simple.
That was v1. It was dumb. It worked sometimes.
But something about it felt right. So I kept going.
It Started Simple
The first version was embarrassingly basic. One script. One API call. One prompt that said “write me a test for this requirement.”
No memory. No analysis. No structure.
The problems showed up immediately. But when it worked, it saved me time. When it didn’t, I spent time fixing it. The math still worked in my favour.
So instead of throwing it away, I started asking: what would it take to make this actually reliable?
Then the Architecture Grew
Over the few months, the script turned into a system. Not because I planned it that way — but because each pain pointed to a clear next step.
Pain: The AI picked terrible selectors.
Solution: I wrote a set of rules and injected them into every prompt. Always prefer stable attributes. Never use fragile selectors.
Pain: The AI was writing code blind.
Solution: I added a step before generation — fetch the page, analyze the structure, extract every element into a clean structured format.
Pain: Every run started from zero.
Solution: I added a vector database. Every generated output gets stored as an embedding. Next time a similar requirement comes in, the system pulls up past patterns as references. First run: scratch. Fiftieth run: it knows your style.
Pain: Locked into one framework.
Solution: I made the framework architecture simpler, Adding a third framework took almost 6 months.
Pain: Locked into one AI provider.
Solution: I built a thin layer over three providers — OpenAI, Anthropic, and Google. One flag switches the brain. The pipeline doesn’t care which model is thinking.
Pain: No idea why things break.
Solution: I added OpenTelemetry spans to every step. Traces go to Grafana Tempo. Logs go to Grafana Loki. Now when something goes wrong, I see the entire decision chain. Like a flight recorder for an AI system.
Pain: When things fail, you’re on your own.
Solution: I built a failure analyzer. Feed it an error, it classifies the failure and gives you a structured diagnosis — what went wrong, why, and how to fix it.
Where It Stands Today
The weekend script is now a five-step workflow.
It takes a requirement in plain English and a URL. It analyses the page. It searches its memory for similar patterns. It generates a complete, runnable spec — constrained by rules and informed by past experience. And if you want, it runs the test right there.
Three AI providers. Three frameworks. A growing pattern library. Structured prompt specs. A failure analyser. Full observability.
One sentence in. A working test out.
But Here’s What Gets Me Excited
The architecture mindset is shifting.
Automation engineer → Architectural LLM Engineer. That’s the shift.
I’m excited. Are you?
If so — wait and watch. The gun is loaded.
Photo by averie woodard on Unsplash
The Honest Part
I want to be straight about what this is and what it isn’t.
It doesn’t know your business logic. It doesn’t know which flows are critical and which don’t matter. It doesn’t know your edge cases unless you tell it.
The system handles the mechanical work. You handle the strategic work.
The project is open source under AGPL-3.0 at github.com/aiqualitylab/ai-natural-language-tests .
Clone it. Install dependencies.
Give it a sentence and a URL.
Watch it work.
It started as weekend idea. Now it’s a full agentic pipeline with memory, observability, and multi-framework support.
And the best part? I’m just getting started.

Top comments (0)