Marko Milojkovic

Posted on Feb 24

I built an AI model comparison tool after 12 hours wasted on LLM integration in project. Launching on Product Hunt today.

#agents #ai #llm #testing

I run a product studio building apps. Nowadays every app needs at least one LLM integration.

I subscribe to Claude ($20/mo) and usually use it for API as it's good for complex stuff, but sometimes it's overkill so I switch to ChatGPT. Then Gemini came out. Then Grok. Then DeepSeek.

You never really know which one is actually best for your specific use case. They're all changing daily. New models, new sub-models, different pricing.

I was spending hours reading "ChatGPT vs Claude" Reddit threads and still guessing.

And even worse: I'd integrate a model, then discover a cheaper alternative that works just as well. Too late - already spent 2-3 days on integration.

The solution

I built Test AI Models: paste your actual prompt, see quality/speed/cost across 9 LLM models side-by-side in 30 seconds.

No API keys needed. No reading benchmarks that test "write a poem" when you need to debug code.

Test YOUR production prompts. See which model actually wins for YOUR use case.

How it started

Built the first version in less than a week for a Bubble/Contra hackathon. 4 LLMs, basic comparison. Won "Best Use of AI" + $5K award.

Started to read Reddit threads where developers argue over ChatGPT vs Claude - no one is right as they could not prove it.

That was the signal - other developers have this exact problem too.

Current status

Launched BETA: Feb 24, 2026 (Product Hunt)
Models: 9 total (ChatGPT, Claude, Gemini, Grok, Perplexity, DeepSeek, Qwen, Kimi, Mistral)
Users: 90+ early testers
Tests model run: 420+

Pricing: 50 free test model selections, then $9/mo + API credits (1:1, no markup)

Realexample (I used it on itself)
Our app auto-generates titles for tests. I was using Claude Sonnet ($423/1M runs).

Tested alternatives:
Claude: $423/1M - perfect quality ✓
DeepSeek: $31/1M - cheapest BUT failed format requirements ✗
Qwen: $43/1M - also failed ✗
Grok: $49/1M - perfect quality ✓ + 8.6x cheaper

Switched to Grok. Saved $45/year on ONE tiny feature.

The lesson: "Use the cheapest model" doesn't work if it breaks your requirements. You have to test with YOUR actual needs.

Tech stack

Platform: Bubble.io (no-code)
APIs: OpenAI, Anthropic, Google, xAI, DeepSeek, Alibaba, Moonshot...
Email: Brevo
Payments: Paddle

What's next

Deciding the roadmap based on user feedback. What should I build?

A) Submodels (GPT-4o vs GPT-4o-mini, Claude Opus vs Sonnet vs Haiku)
B) API access (trigger tests from n8n, Zapier, agentic workflows)
C) Quality scoring (hallucination detection, consistency testing)
D) Image/voice generation comparison (DALL-E vs Midjourney, ElevenLabs vs Play.ht)

What would you actually use?

The ask

Try it: testaimodels.com - run one test with your actual prompt, tell me what breaks or what's confusing
Feedback: What's missing? What would make this 10x more useful?
Roadmap input: A, B, C, or D above (or tell me what I'm missing) I'm building this in public. Every decision is shaped by what early users say matters.

Questions I have

Is $9/mo too expensive for indie devs? (API credits are 1:1, no markup)
Is "test selections" pricing confusing? (50 selections = 5-25 full tests depending on how many models you compare)
What modality matters most after text? (Image, audio, video?)

Drop a comment. I read and reply to everything.

DEV Community

I built an AI model comparison tool after 12 hours wasted on LLM integration in project. Launching on Product Hunt today.

Top comments (0)